[GitHub] [spark] HyukjinKwon closed pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-18 Thread GitBox


HyukjinKwon closed pull request #29366:
URL: https://github.com/apache/spark/pull/29366


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29366:
URL: https://github.com/apache/spark/pull/29366#issuecomment-675868616


   Merged to master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29455:
URL: https://github.com/apache/spark/pull/29455#discussion_r472731516



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##
@@ -83,15 +83,18 @@ trait ShuffleExchangeLike extends Exchange {
 case class ShuffleExchangeExec(
 override val outputPartitioning: Partitioning,
 child: SparkPlan,
-noUserSpecifiedNumPartition: Boolean = true) extends ShuffleExchangeLike {
+noUserSpecifiedNumPartition: Boolean = true,
+isNullAwareAntiJoin: Boolean = false) extends ShuffleExchangeLike {

Review comment:
   This is too hacky, we shouldn't pollute the general shuffle node for one 
specific case.
   
   I think we need to develop a flexible stats collection framework for AQE 
first. NAAJ is just one use case: it asks AQE to collect null count.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode

2020-08-18 Thread GitBox


LuciferYang commented on a change in pull request #29000:
URL: https://github.com/apache/spark/pull/29000#discussion_r472729801



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala
##
@@ -164,4 +165,48 @@ class PartitionedWriteSuite extends QueryTest with 
SharedSparkSession {
   assert(e.getMessage.contains("Found duplicate column(s) b, b: `b`;"))
 }
   }
+
+  test("SPARK-27194 SPARK-29302: Fix commit collision in dynamic partition 
overwrite mode") {
+withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key ->
+  SQLConf.PartitionOverwriteMode.DYNAMIC.toString,
+  SQLConf.FILE_COMMIT_PROTOCOL_CLASS.key ->
+classOf[PartitionFileExistCommitProtocol].getName) {
+  withTempDir { d =>
+withTable("t") {
+  sql(
+s"""
+   | create table t(c1 int, p1 int) using parquet partitioned by 
(p1)
+   | location '${d.getAbsolutePath}'
+""".stripMargin)
+
+  val df = Seq((1, 2)).toDF("c1", "p1")
+  df.write
+.partitionBy("p1")
+.mode("overwrite")
+.saveAsTable("t")
+  checkAnswer(sql("select * from t"), df)
+}
+  }
+}
+  }
+}
+
+/**
+ * A file commit protocol with pre-created partition file. when try to 
overwrite partition dir
+ * in dynamic partition mode, FileAlreadyExist exception would raise without 
SPARK-31968
+ */
+private class PartitionFileExistCommitProtocol(
+jobId: String,
+path: String,
+dynamicPartitionOverwrite: Boolean)
+  extends HadoopMapReduceCommitProtocol(jobId, path, 
dynamicPartitionOverwrite) {
+  override def setupJob(jobContext: JobContext): Unit = {
+super.setupJob(jobContext)
+val stagingDir = new File(path, s".spark-staging-$jobId")

Review comment:
   @WinkerDu `path` string is a URI format like `file:/xxx`.
   
Now the path of `conflictTaskFile` create as 
   ```
   
/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/file:/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet
   ```
   It is wrong.
   
   We should strip the `file:` prefix or trans to a local filesystem path, let 
the `conflictTaskFile`  create as 
   
   ```
   
/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode

2020-08-18 Thread GitBox


LuciferYang commented on a change in pull request #29000:
URL: https://github.com/apache/spark/pull/29000#discussion_r472729801



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala
##
@@ -164,4 +165,48 @@ class PartitionedWriteSuite extends QueryTest with 
SharedSparkSession {
   assert(e.getMessage.contains("Found duplicate column(s) b, b: `b`;"))
 }
   }
+
+  test("SPARK-27194 SPARK-29302: Fix commit collision in dynamic partition 
overwrite mode") {
+withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key ->
+  SQLConf.PartitionOverwriteMode.DYNAMIC.toString,
+  SQLConf.FILE_COMMIT_PROTOCOL_CLASS.key ->
+classOf[PartitionFileExistCommitProtocol].getName) {
+  withTempDir { d =>
+withTable("t") {
+  sql(
+s"""
+   | create table t(c1 int, p1 int) using parquet partitioned by 
(p1)
+   | location '${d.getAbsolutePath}'
+""".stripMargin)
+
+  val df = Seq((1, 2)).toDF("c1", "p1")
+  df.write
+.partitionBy("p1")
+.mode("overwrite")
+.saveAsTable("t")
+  checkAnswer(sql("select * from t"), df)
+}
+  }
+}
+  }
+}
+
+/**
+ * A file commit protocol with pre-created partition file. when try to 
overwrite partition dir
+ * in dynamic partition mode, FileAlreadyExist exception would raise without 
SPARK-31968
+ */
+private class PartitionFileExistCommitProtocol(
+jobId: String,
+path: String,
+dynamicPartitionOverwrite: Boolean)
+  extends HadoopMapReduceCommitProtocol(jobId, path, 
dynamicPartitionOverwrite) {
+  override def setupJob(jobContext: JobContext): Unit = {
+super.setupJob(jobContext)
+val stagingDir = new File(path, s".spark-staging-$jobId")

Review comment:
   @WinkerDu `path` string is a URI format like `file:/xxx`.
now the path of `conflictTaskFile` create as 
   ```
   
/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/file:/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet
   ```
   We should strip the `file:` prefix or trans to a local filesystem path, let 
the `conflictTaskFile`  create as 
   
   ```
   
/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on

2020-08-18 Thread GitBox


cloud-fan commented on pull request #29455:
URL: https://github.com/apache/spark/pull/29455#issuecomment-675866532


   I think the patent is about a specific way to optimize null aware anti join, 
not the general idea. We are definitely different from Oracle as Spark is 
distributed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29437:
URL: https://github.com/apache/spark/pull/29437#issuecomment-675862619







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675862630







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29469:
URL: https://github.com/apache/spark/pull/29469#issuecomment-675862636







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29469:
URL: https://github.com/apache/spark/pull/29469#issuecomment-675862636







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29437:
URL: https://github.com/apache/spark/pull/29437#issuecomment-675862619







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675862630







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation

2020-08-18 Thread GitBox


SparkQA removed a comment on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675852755


   **[Test build #127621 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127621/testReport)**
 for PR 29470 at commit 
[`d9933b7`](https://github.com/apache/spark/commit/d9933b7e4e8fdcb49faed0458559f72d7a313f1c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation

2020-08-18 Thread GitBox


SparkQA commented on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675862353


   **[Test build #127621 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127621/testReport)**
 for PR 29470 at commit 
[`d9933b7`](https://github.com/apache/spark/commit/d9933b7e4e8fdcb49faed0458559f72d7a313f1c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


SparkQA commented on pull request #29469:
URL: https://github.com/apache/spark/pull/29469#issuecomment-675862342


   **[Test build #127623 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127623/testReport)**
 for PR 29469 at commit 
[`dd3e558`](https://github.com/apache/spark/commit/dd3e558073fb85d96f1ba096c231b04d86274fc4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


SparkQA commented on pull request #29437:
URL: https://github.com/apache/spark/pull/29437#issuecomment-675862392


   **[Test build #127624 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127624/testReport)**
 for PR 29437 at commit 
[`6f147ee`](https://github.com/apache/spark/commit/6f147ee0ac20af8142a3dc715a9dbc7952f99265).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472718669



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala
##
@@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
 case class TextTable(
 name: String,
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,

Review comment:
   Fixed.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala
##
@@ -56,6 +57,13 @@ trait FileDataSourceV2 extends TableProvider with 
DataSourceRegister {
 paths ++ Option(map.get("path")).toSeq
   }
 
+  protected def getOptionsWithoutPaths(map: CaseInsensitiveStringMap): 
CaseInsensitiveStringMap = {

Review comment:
   Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brkyvz commented on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


brkyvz commented on pull request #29469:
URL: https://github.com/apache/spark/pull/29469#issuecomment-675861260


   @cloud-fan addressed your comments



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675860612


   BTW your new profile pic looks nice @srowen!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation

2020-08-18 Thread GitBox


HyukjinKwon closed pull request #29470:
URL: https://github.com/apache/spark/pull/29470


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675859877


   Merged to master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675858657







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675858657







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675857879


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127613/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


viirya commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472711665



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala
##
@@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
 case class TextTable(
 name: String,
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,

Review comment:
   We can revert this back like other `FileTable`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


SparkQA commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675858327


   **[Test build #127622 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127622/testReport)**
 for PR 29465 at commit 
[`84846a8`](https://github.com/apache/spark/commit/84846a873658fa609148e208f403d2e010e4114b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675857822







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


HyukjinKwon commented on a change in pull request #29465:
URL: https://github.com/apache/spark/pull/29465#discussion_r472710508



##
File path: .github/workflows/build_and_test.yml
##
@@ -0,0 +1,241 @@
+name: Build and test
+
+on:
+  push:
+branches:
+- branch-2.4
+  pull_request:
+branches:
+- branch-2.4
+
+jobs:
+  # Build: build Spark and run the tests for specified modules.
+  build:
+name: "Build modules: ${{ matrix.modules }} ${{ matrix.comment }} (JDK ${{ 
matrix.java }}, ${{ matrix.hadoop }})"
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java:
+  - 1.8
+hadoop:
+  - hadoop2.6
+# TODO(SPARK-32246): We don't test 'streaming-kinesis-asl' for now.
+# Kinesis tests depends on external Amazon kinesis service.
+# Note that the modules below are from sparktestsupport/modules.py.
+modules:
+  - >-
+core, unsafe, kvstore, avro,
+network-common, network-shuffle, repl, launcher,
+examples, sketch, graphx
+  - >-
+catalyst, hive-thriftserver
+  - >-
+streaming, sql-kafka-0-10, streaming-kafka-0-10,
+mllib-local, mllib,
+yarn, mesos, kubernetes, hadoop-cloud, spark-ganglia-lgpl,
+streaming-flume, streaming-flume-sink, streaming-kafka-0-8
+  - >-
+pyspark-sql, pyspark-mllib
+  - >-
+pyspark-core, pyspark-streaming, pyspark-ml
+  - >-
+sparkr
+  - >-
+sql
+# Here, we split Hive and SQL tests into some of slow ones and the 
rest of them.
+included-tags: [""]
+excluded-tags: [""]
+comment: [""]
+include:
+  # Hive tests
+  - modules: hive
+java: 1.8
+hadoop: hadoop2.6
+included-tags: org.apache.spark.tags.SlowHiveTest
+comment: "- slow tests"
+  - modules: hive
+java: 1.8
+hadoop: hadoop2.6
+excluded-tags: org.apache.spark.tags.SlowHiveTest
+comment: "- other tests"
+env:
+  MODULES_TO_TEST: ${{ matrix.modules }}
+  EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
+  INCLUDED_TAGS: ${{ matrix.included-tags }}
+  HADOOP_PROFILE: ${{ matrix.hadoop }}
+  # GitHub Actions' default miniconda to use in pip packaging test.
+  CONDA_PREFIX: /usr/share/miniconda
+  GITHUB_PREV_SHA: ${{ github.event.before }}
+  ARROW_PRE_0_15_IPC_FORMAT: 1
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  # In order to fetch changed files
+  with:
+fetch-depth: 0
+# Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+- name: Cache Scala, SBT, Maven and Zinc
+  uses: actions/cache@v1
+  with:
+path: build
+key: build-${{ hashFiles('**/pom.xml') }}
+restore-keys: |
+  build-
+- name: Cache Maven local repository
+  uses: actions/cache@v2
+  with:
+path: ~/.m2/repository
+key: ${{ matrix.java }}-${{ matrix.hadoop }}-maven-${{ 
hashFiles('**/pom.xml') }}
+restore-keys: |
+  ${{ matrix.java }}-${{ matrix.hadoop }}-maven-
+- name: Cache Ivy local repository
+  uses: actions/cache@v2
+  with:
+path: ~/.ivy2/cache
+key: ${{ matrix.java }}-${{ matrix.hadoop }}-ivy-${{ 
hashFiles('**/pom.xml') }}-${{ hashFiles('**/plugins.sbt') }}
+restore-keys: |
+  ${{ matrix.java }}-${{ matrix.hadoop }}-ivy-
+- name: Install JDK ${{ matrix.java }}
+  uses: actions/setup-java@v1
+  with:
+java-version: ${{ matrix.java }}
+# PySpark
+- name: Install PyPy3
+  # Note that order of Python installations here matters because default 
python is
+  # overridden.
+  uses: actions/setup-python@v2
+  if: contains(matrix.modules, 'pyspark')
+  with:
+python-version: pypy3
+architecture: x64
+- name: Install Python 3.6
+  uses: actions/setup-python@v2
+  if: contains(matrix.modules, 'pyspark')
+  with:
+python-version: 3.6
+architecture: x64
+- name: Install Python 2.7
+  uses: actions/setup-python@v2
+  # Yarn has a Python specific test too, for example, YarnClusterSuite.
+  if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 
'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 
'sql-'))
+  with:
+python-version: 2.7
+architecture: x64
+- name: Install Python packages (Python 3.6 and PyPy3)
+  if: contains(matrix.modules, 'pyspark')
+  # PyArrow is not supported in PyPy yet, see ARROW-2651.
+  # TODO(SPARK-32247): scipy installation with PyPy fails for an unknown 
reason.
+  run: |
+python3.6 -m pip install numpy pyarrow pandas scipy xmlrunner
+

[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675857822







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


HyukjinKwon commented on a change in pull request #29465:
URL: https://github.com/apache/spark/pull/29465#discussion_r472709812



##
File path: .github/workflows/build_and_test.yml
##
@@ -0,0 +1,241 @@
+name: Build and test
+
+on:
+  push:
+branches:
+- branch-2.4
+  pull_request:
+branches:
+- branch-2.4
+
+jobs:
+  # Build: build Spark and run the tests for specified modules.
+  build:
+name: "Build modules: ${{ matrix.modules }} ${{ matrix.comment }} (JDK ${{ 
matrix.java }}, ${{ matrix.hadoop }})"
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java:
+  - 1.8
+hadoop:
+  - hadoop2.6
+# TODO(SPARK-32246): We don't test 'streaming-kinesis-asl' for now.
+# Kinesis tests depends on external Amazon kinesis service.
+# Note that the modules below are from sparktestsupport/modules.py.
+modules:
+  - >-
+core, unsafe, kvstore, avro,
+network-common, network-shuffle, repl, launcher,
+examples, sketch, graphx
+  - >-
+catalyst, hive-thriftserver
+  - >-
+streaming, sql-kafka-0-10, streaming-kafka-0-10,
+mllib-local, mllib,
+yarn, mesos, kubernetes, hadoop-cloud, spark-ganglia-lgpl,
+streaming-flume, streaming-flume-sink, streaming-kafka-0-8
+  - >-
+pyspark-sql, pyspark-mllib
+  - >-
+pyspark-core, pyspark-streaming, pyspark-ml
+  - >-
+sparkr
+  - >-
+sql
+# Here, we split Hive and SQL tests into some of slow ones and the 
rest of them.
+included-tags: [""]
+excluded-tags: [""]
+comment: [""]
+include:
+  # Hive tests
+  - modules: hive
+java: 1.8
+hadoop: hadoop2.6
+included-tags: org.apache.spark.tags.SlowHiveTest
+comment: "- slow tests"
+  - modules: hive
+java: 1.8
+hadoop: hadoop2.6
+excluded-tags: org.apache.spark.tags.SlowHiveTest
+comment: "- other tests"
+env:
+  MODULES_TO_TEST: ${{ matrix.modules }}
+  EXCLUDED_TAGS: ${{ matrix.excluded-tags }}
+  INCLUDED_TAGS: ${{ matrix.included-tags }}
+  HADOOP_PROFILE: ${{ matrix.hadoop }}
+  # GitHub Actions' default miniconda to use in pip packaging test.
+  CONDA_PREFIX: /usr/share/miniconda
+  GITHUB_PREV_SHA: ${{ github.event.before }}
+  ARROW_PRE_0_15_IPC_FORMAT: 1
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  # In order to fetch changed files
+  with:
+fetch-depth: 0
+# Cache local repositories. Note that GitHub Actions cache has a 2G limit.
+- name: Cache Scala, SBT, Maven and Zinc
+  uses: actions/cache@v1
+  with:
+path: build
+key: build-${{ hashFiles('**/pom.xml') }}
+restore-keys: |
+  build-
+- name: Cache Maven local repository
+  uses: actions/cache@v2
+  with:
+path: ~/.m2/repository
+key: ${{ matrix.java }}-${{ matrix.hadoop }}-maven-${{ 
hashFiles('**/pom.xml') }}
+restore-keys: |
+  ${{ matrix.java }}-${{ matrix.hadoop }}-maven-
+- name: Cache Ivy local repository
+  uses: actions/cache@v2
+  with:
+path: ~/.ivy2/cache
+key: ${{ matrix.java }}-${{ matrix.hadoop }}-ivy-${{ 
hashFiles('**/pom.xml') }}-${{ hashFiles('**/plugins.sbt') }}
+restore-keys: |
+  ${{ matrix.java }}-${{ matrix.hadoop }}-ivy-
+- name: Install JDK ${{ matrix.java }}
+  uses: actions/setup-java@v1
+  with:
+java-version: ${{ matrix.java }}
+# PySpark
+- name: Install PyPy3
+  # Note that order of Python installations here matters because default 
python is
+  # overridden.
+  uses: actions/setup-python@v2
+  if: contains(matrix.modules, 'pyspark')
+  with:
+python-version: pypy3
+architecture: x64
+- name: Install Python 3.6
+  uses: actions/setup-python@v2
+  if: contains(matrix.modules, 'pyspark')
+  with:
+python-version: 3.6
+architecture: x64
+- name: Install Python 2.7
+  uses: actions/setup-python@v2
+  # Yarn has a Python specific test too, for example, YarnClusterSuite.
+  if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 
'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 
'sql-'))
+  with:
+python-version: 2.7
+architecture: x64
+- name: Install Python packages (Python 3.6 and PyPy3)
+  if: contains(matrix.modules, 'pyspark')
+  # PyArrow is not supported in PyPy yet, see ARROW-2651.
+  # TODO(SPARK-32247): scipy installation with PyPy fails for an unknown 
reason.
+  run: |
+python3.6 -m pip install numpy pyarrow pandas scipy xmlrunner
+

[GitHub] [spark] HyukjinKwon commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675857451


   It should be ready to be reviewed or merged now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4

2020-08-18 Thread GitBox


HyukjinKwon removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675837138







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on

2020-08-18 Thread GitBox


c21 commented on a change in pull request #29455:
URL: https://github.com/apache/spark/pull/29455#discussion_r472706333



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateNullAwareAntiJoin.scala
##
@@ -20,22 +20,50 @@ package org.apache.spark.sql.execution.adaptive
 import 
org.apache.spark.sql.catalyst.planning.ExtractSingleColumnNullAwareAntiJoin
 import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
 import org.apache.spark.sql.execution.joins.EmptyHashedRelationWithAllNullKeys
+import 
org.apache.spark.sql.execution.metric.SQLShuffleWriteMetricsReporter.SHUFFLE_RECORDS_WRITTEN
 
 /**
- * This optimization rule detects and convert a NAAJ to an Empty LocalRelation
- * when buildSide is EmptyHashedRelationWithAllNullKeys.
+ * This optimization rule try to Eliminate NAAJ by following patterns.
+ * 1. Convert a NAAJ to an Empty LocalRelation when buildSide is 
EmptyHashedRelationWithAllNullKeys.
+ * 2. Convert a NAAJ to left plan when buildSide is empty.
  */
 object EliminateNullAwareAntiJoin extends Rule[LogicalPlan] {
 
-  private def canEliminate(plan: LogicalPlan): Boolean = plan match {
+  private def canConvertToEmptyLocalRelation(plan: LogicalPlan): Boolean = 
plan match {
 case LogicalQueryStage(_, stage: BroadcastQueryStageExec) if 
stage.resultOption.get().isDefined
   && stage.broadcast.relationFuture.get().value == 
EmptyHashedRelationWithAllNullKeys => true
+case LogicalQueryStage(_, stage: ShuffleQueryStageExec) if 
stage.resultOption.get().isDefined =>
+  // Equivalent to EmptyHashedRelationWithAllNullKeys
+  val nullPartitionKeyMetrics = 
stage.shuffle.asInstanceOf[ShuffleExchangeExec]

Review comment:
   just +1 with @agrawaldevesh comment. I kind of feel aggregated 
`SQLMetrics` can be approximated (e.g. with failed/speculated tasks), and it 
seems to be quite dangerous here to depend on the exact value of `SQLMetrics` 
(e.g. >0 or == 0 below), which can have correctness concern.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675853113







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675853113







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports

2020-08-18 Thread GitBox


SparkQA commented on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675852755


   **[Test build #127621 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127621/testReport)**
 for PR 29470 at commit 
[`d9933b7`](https://github.com/apache/spark/commit/d9933b7e4e8fdcb49faed0458559f72d7a313f1c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29467:
URL: https://github.com/apache/spark/pull/29467#discussion_r472697970



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ObjectSerializerPruningSuite.scala
##
@@ -107,4 +108,34 @@ class ObjectSerializerPruningSuite extends PlanTest {
   comparePlans(optimized, expected)
 }
   }
+
+  test("SPARK-32652: Prune nested serializers: RowEncoder") {
+withSQLConf(SQLConf.SERIALIZER_NESTED_SCHEMA_PRUNING_ENABLED.key -> 
"true") {
+  val testRelation = LocalRelation('i.struct(StructType.fromDDL("a int, b 
string")), 'j.int)
+  val rowEncoder = RowEncoder(new StructType()
+.add("i", new StructType().add("a", "int").add("b", "string"))
+.add("j", "int"))
+  val serializerObject = CatalystSerde.serialize(
+CatalystSerde.deserialize(testRelation)(rowEncoder))(rowEncoder)
+  val query = serializerObject.select($"i.a")
+  val optimized = Optimize.execute(query.analyze)
+
+  val prunedSerializer = serializerObject.serializer.head.transformDown {
+case CreateNamedStruct(children) => CreateNamedStruct(children.take(2))
+  }.transformUp {
+// Aligns null literal in `If` expression to make it resolvable.
+case i @ If(invoke: Invoke, Literal(null, dt), ser) if 
invoke.functionName == "isNullAt" &&
+!dt.sameType(ser.dataType) =>
+  i.copy(trueValue = Literal(null, ser.dataType))
+  }.asInstanceOf[NamedExpression]
+
+  // `name` in `GetStructField` affects `comparePlans`. Maybe we can ignore
+  // `name` in `GetStructField.equals`?

Review comment:
   Note: I just copied this comment from the existing test case above. If 
we can fix it, we should remove comments from both test cases.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472696179



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala
##
@@ -56,6 +57,13 @@ trait FileDataSourceV2 extends TableProvider with 
DataSourceRegister {
 paths ++ Option(map.get("path")).toSeq
   }
 
+  protected def getOptionsWithoutPaths(map: CaseInsensitiveStringMap): 
CaseInsensitiveStringMap = {

Review comment:
   nit:
   ```
   val withoutPath = options.asCaseSensitiveMap().asScala.filterKeys {
 k => !k.equalsIgnoreCase("path") && !k.equalsIgnoreCase("paths")
   }
   new CaseInsensitiveStringMap(withoutPath.asJava)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472694528



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala
##
@@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
 case class TextTable(
 name: String,
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,

Review comment:
   do we still need to rename it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29470:
URL: https://github.com/apache/spark/pull/29470#issuecomment-675850875


   Oh actually we should add it into 
`python/docs/source/reference/pyspark.ml.rst` now. I know another missing one. 
Let me push it to here and just merge.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29104:
URL: https://github.com/apache/spark/pull/29104#discussion_r472691173



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -903,15 +910,61 @@ private[joins] object LongHashedRelation {
   if (!rowKey.isNullAt(0)) {
 val key = rowKey.getLong(0)
 map.append(key, unsafeRow)
+  } else if (isNullAware) {
+return EmptyHashedRelationWithAllNullKeys
   }
 }
 map.optimize()
 new LongHashedRelation(numFields, map)
   }
 }
 
+/**
+ * Common trait with dummy implementation for NAAJ special HashedRelation
+ * EmptyHashedRelation
+ * EmptyHashedRelationWithAllNullKeys
+ */
+trait NullAwareHashedRelation extends HashedRelation with Externalizable {
+  override def get(key: InternalRow): Iterator[InternalRow] = {
+throw new UnsupportedOperationException
+  }
+
+  override def getValue(key: InternalRow): InternalRow = {
+throw new UnsupportedOperationException
+  }
+
+  override def keyIsUnique: Boolean = true
+
+  override def keys(): Iterator[InternalRow] = {
+throw new UnsupportedOperationException
+  }
+
+  override def close(): Unit = {}
+
+  override def writeExternal(out: ObjectOutput): Unit = {}
+
+  override def readExternal(in: ObjectInput): Unit = {}
+
+  override def estimatedSize: Long = 0
+}
+
+/**
+ * A special HashedRelation indicates it built from a empty 
input:Iterator[InternalRow].
+ */
+object EmptyHashedRelation extends NullAwareHashedRelation {
+  override def asReadOnlyCopy(): EmptyHashedRelation.type = this
+}
+
+/**
+ * A special HashedRelation indicates it built from a non-empty 
input:Iterator[InternalRow],
+ * which contains all null columns key.
+ */
+object EmptyHashedRelationWithAllNullKeys extends NullAwareHashedRelation {
+  override def asReadOnlyCopy(): EmptyHashedRelationWithAllNullKeys.type = this

Review comment:
   probably just remove `Empty` to make it `HashedRelationWithAllNullKeys`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29467:
URL: https://github.com/apache/spark/pull/29467#issuecomment-675849620


   Merged to master and branch-3.0.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder

2020-08-18 Thread GitBox


HyukjinKwon closed pull request #29467:
URL: https://github.com/apache/spark/pull/29467


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder

2020-08-18 Thread GitBox


HyukjinKwon commented on a change in pull request #29467:
URL: https://github.com/apache/spark/pull/29467#discussion_r472689757



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ObjectSerializerPruningSuite.scala
##
@@ -107,4 +108,34 @@ class ObjectSerializerPruningSuite extends PlanTest {
   comparePlans(optimized, expected)
 }
   }
+
+  test("SPARK-32652: Prune nested serializers: RowEncoder") {
+withSQLConf(SQLConf.SERIALIZER_NESTED_SCHEMA_PRUNING_ENABLED.key -> 
"true") {
+  val testRelation = LocalRelation('i.struct(StructType.fromDDL("a int, b 
string")), 'j.int)
+  val rowEncoder = RowEncoder(new StructType()
+.add("i", new StructType().add("a", "int").add("b", "string"))
+.add("j", "int"))
+  val serializerObject = CatalystSerde.serialize(
+CatalystSerde.deserialize(testRelation)(rowEncoder))(rowEncoder)
+  val query = serializerObject.select($"i.a")
+  val optimized = Optimize.execute(query.analyze)
+
+  val prunedSerializer = serializerObject.serializer.head.transformDown {
+case CreateNamedStruct(children) => CreateNamedStruct(children.take(2))
+  }.transformUp {
+// Aligns null literal in `If` expression to make it resolvable.
+case i @ If(invoke: Invoke, Literal(null, dt), ser) if 
invoke.functionName == "isNullAt" &&
+!dt.sameType(ser.dataType) =>
+  i.copy(trueValue = Literal(null, ser.dataType))
+  }.asInstanceOf[NamedExpression]
+
+  // `name` in `GetStructField` affects `comparePlans`. Maybe we can ignore
+  // `name` in `GetStructField.equals`?

Review comment:
   Let me just merge this. In the PR that actually ignores the equality, we 
can remove this comment and also the codes clearing the name below.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on

2020-08-18 Thread GitBox


c21 commented on a change in pull request #29455:
URL: https://github.com/apache/spark/pull/29455#discussion_r472680557



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##
@@ -83,15 +83,18 @@ trait ShuffleExchangeLike extends Exchange {
 case class ShuffleExchangeExec(
 override val outputPartitioning: Partitioning,
 child: SparkPlan,
-noUserSpecifiedNumPartition: Boolean = true) extends ShuffleExchangeLike {
+noUserSpecifiedNumPartition: Boolean = true,
+isNullAwareAntiJoin: Boolean = false) extends ShuffleExchangeLike {

Review comment:
   Hi Devesh, I am totally with you for benchmarketing perspective and 
avoid customer surprise. I think we are discussing whether this kind of query 
pattern is popular or not. I think a renaming for the parameter of 
`isNullAwareAntiJoin` can alleviate my concern.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #28953:
URL: https://github.com/apache/spark/pull/28953#issuecomment-675844857







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #28953:
URL: https://github.com/apache/spark/pull/28953#issuecomment-675844857







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC

2020-08-18 Thread GitBox


SparkQA commented on pull request #28953:
URL: https://github.com/apache/spark/pull/28953#issuecomment-675844652


   **[Test build #127620 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127620/testReport)**
 for PR 28953 at commit 
[`13f0dfc`](https://github.com/apache/spark/commit/13f0dfc2078f0933be01338a0fbf4d69113a9f4e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brkyvz commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


brkyvz commented on a change in pull request #29469:
URL: https://github.com/apache/spark/pull/29469#discussion_r472671268



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/AlreadyPlannedSuite.scala
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.test.SharedSparkSession
+
+class AlreadyPlannedSuite extends SparkPlanTest with SharedSparkSession {
+
+  import testImplicits._
+
+  test("simple execution") {
+val df = spark.range(10)
+val planned = AlreadyPlanned.dataFrame(spark, df.queryExecution.sparkPlan)
+
+checkAnswer(planned, identity, df.toDF().collect())
+  }
+
+  test("planning on top won't work - filter") {
+val df = spark.range(10)
+val planned = AlreadyPlanned.dataFrame(spark, df.queryExecution.sparkPlan)
+
+val e = intercept[AnalysisException] {
+  planned.where('id < 5)

Review comment:
   yeah, you're right. This would actually prevent the source from adding 
modifications to the input dataFrame, e.g. attach hidden columns or such. Let 
me fix this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29453:
URL: https://github.com/apache/spark/pull/29453#issuecomment-675841844







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29453:
URL: https://github.com/apache/spark/pull/29453#issuecomment-675841844







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brkyvz commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


brkyvz commented on a change in pull request #29469:
URL: https://github.com/apache/spark/pull/29469#discussion_r472668735



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/AlreadyPlanned.scala
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan}
+
+/**
+ * A special node that allows skipping query planning all the way to physical 
execution. This node
+ * can be used when a query was planned to the physical level, but we had to 
go back to logical plan
+ * land for some reason (e.g. V1 DataSource write execution). This will allow 
the metrics, and the
+ * query plan to properly appear as part of the query execution.
+ */
+case class AlreadyPlanned(physicalPlan: SparkPlan) extends LeafNode {
+  override val output: Seq[Attribute] = physicalPlan.output
+  override lazy val resolved: Boolean = false

Review comment:
   that wasn't sufficient. We need to pass in the `executedPlan` for the 
spark and executed plans. Otherwise one of the test suites failed with:
   ```
   sbt.ForkMain$ForkError: java.lang.UnsupportedOperationException: null
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doProduce(WholeStageCodegenExec.scala:790)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos

2020-08-18 Thread GitBox


SparkQA commented on pull request #29453:
URL: https://github.com/apache/spark/pull/29453#issuecomment-675841543


   **[Test build #127619 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127619/testReport)**
 for PR 29453 at commit 
[`b99ced4`](https://github.com/apache/spark/commit/b99ced462b8fd105a5a9fefd0ef6dc4e852d3257).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a change in pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos

2020-08-18 Thread GitBox


ulysses-you commented on a change in pull request #29453:
URL: https://github.com/apache/spark/pull/29453#discussion_r472667471



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
##
@@ -3035,7 +3035,10 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
 val msg = intercept[AnalysisException] {
   sql("REFRESH FUNCTION md5")
 }.getMessage
-assert(msg.contains("Cannot refresh builtin function"))
+assert(msg.contains("Cannot refresh built-in function"))
+intercept[NoSuchFunctionException] {

Review comment:
   added.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm

2020-08-18 Thread GitBox


Ngone51 commented on a change in pull request #29452:
URL: https://github.com/apache/spark/pull/29452#discussion_r472663605



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##
@@ -1051,15 +1049,19 @@ private[spark] class TaskSetManager(
   logDebug("Task length threshold for speculation: " + threshold)
   for (tid <- runningTasksSet) {
 var speculated = checkAndSubmitSpeculatableTask(tid, time, threshold)
-if (!speculated && tidToExecutorKillTimeMapping.contains(tid)) {
-  // Check whether this task will finish before the exectorKillTime 
assuming
-  // it will take medianDuration overall. If this task cannot finish 
within
-  // executorKillInterval, then this task is a candidate for 
speculation
-  val taskEndTimeBasedOnMedianDuration = taskInfos(tid).launchTime + 
medianDuration
-  val canExceedDeadline = tidToExecutorKillTimeMapping(tid) <
-taskEndTimeBasedOnMedianDuration
-  if (canExceedDeadline) {
-speculated = checkAndSubmitSpeculatableTask(tid, time, 0)
+if (!speculated && executorDecommissionKillInterval.nonEmpty) {
+  val taskInfo = taskInfos(tid)
+  val decomState = 
sched.getExecutorDecommissionState(taskInfo.executorId)
+  if (decomState.nonEmpty) {
+// Check whether this task will finish before the exectorKillTime 
assuming

Review comment:
   We don't have `exectorKillTime` anymore after removing 
`tidToExecutorKillTimeMapping`. Shall we reword the comment a little bit?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-18 Thread GitBox


LuciferYang commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-675838627


   Thx~ @HyukjinKwon 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-675838315







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675838241







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-675838315







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675838241







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675838032


   **[Test build #127617 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127617/testReport)**
 for PR 29465 at commit 
[`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leanken commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on

2020-08-18 Thread GitBox


leanken commented on a change in pull request #29455:
URL: https://github.com/apache/spark/pull/29455#discussion_r472660831



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##
@@ -225,7 +231,8 @@ object ShuffleExchangeExec {
   outputAttributes: Seq[Attribute],
   newPartitioning: Partitioning,
   serializer: Serializer,
-  writeMetrics: Map[String, SQLMetric])
+  writeMetrics: Map[String, SQLMetric],
+  isNullAwareAntiJoin: Boolean = false)

Review comment:
   how about `collectNullPartitionKeyMetrics`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-18 Thread GitBox


SparkQA commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-675838022


   **[Test build #127618 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127618/testReport)**
 for PR 29434 at commit 
[`36a317a`](https://github.com/apache/spark/commit/36a317abca5a861a7b0c27317c80d030020838ed).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-675837861


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675831629


   **[Test build #127615 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127615/testReport)**
 for PR 29465 at commit 
[`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675832423







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675837138


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm

2020-08-18 Thread GitBox


Ngone51 commented on a change in pull request #29452:
URL: https://github.com/apache/spark/pull/29452#discussion_r472657515



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
##
@@ -18,11 +18,22 @@
 package org.apache.spark.scheduler
 
 /**
- * Provides more detail when an executor is being decommissioned.
+ * Message providing more detail when an executor is being decommissioned.
  * @param message Human readable reason for why the decommissioning is 
happening.
  * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in 
other places) is
  * being decommissioned too. Used to infer if the 
shuffle data might
  * be lost even if the external shuffle service is 
enabled.
  */
 private[spark]
 case class ExecutorDecommissionInfo(message: String, isHostDecommissioned: 
Boolean)
+
+/**
+ * State related to decommissioning that is kept by the TaskSchedulerImpl. 
This state is derived
+ * from the info message above but it is kept distinct to allow the state to 
evolve independently
+ * from the message.
+ */
+case class ExecutorDecommissionState(
+message: String,
+// Timestamp in milliseconds when decommissioning was triggered
+tsMillis: Long,

Review comment:
   How about `startTime`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675836455







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675836455







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm

2020-08-18 Thread GitBox


Ngone51 commented on a change in pull request #29452:
URL: https://github.com/apache/spark/pull/29452#discussion_r472655476



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##
@@ -1051,15 +1049,19 @@ private[spark] class TaskSetManager(
   logDebug("Task length threshold for speculation: " + threshold)
   for (tid <- runningTasksSet) {
 var speculated = checkAndSubmitSpeculatableTask(tid, time, threshold)
-if (!speculated && tidToExecutorKillTimeMapping.contains(tid)) {
-  // Check whether this task will finish before the exectorKillTime 
assuming
-  // it will take medianDuration overall. If this task cannot finish 
within
-  // executorKillInterval, then this task is a candidate for 
speculation
-  val taskEndTimeBasedOnMedianDuration = taskInfos(tid).launchTime + 
medianDuration
-  val canExceedDeadline = tidToExecutorKillTimeMapping(tid) <
-taskEndTimeBasedOnMedianDuration
-  if (canExceedDeadline) {
-speculated = checkAndSubmitSpeculatableTask(tid, time, 0)
+if (!speculated && executorDecommissionKillInterval.nonEmpty) {

Review comment:
   I see, we don't have `tidToExecutorKillTimeMapping` anymore.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA removed a comment on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675791761


   **[Test build #127607 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127607/testReport)**
 for PR 29460 at commit 
[`511d611`](https://github.com/apache/spark/commit/511d611ad0e307ebe9b8bd06227241774d1ba25f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA commented on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675835877


   **[Test build #127607 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127607/testReport)**
 for PR 29460 at commit 
[`511d611`](https://github.com/apache/spark/commit/511d611ad0e307ebe9b8bd06227241774d1ba25f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675832423







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675832473







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675832415


   **[Test build #127615 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127615/testReport)**
 for PR 29465 at commit 
[`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85).
* This patch **fails build dependency tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm

2020-08-18 Thread GitBox


Ngone51 commented on a change in pull request #29452:
URL: https://github.com/apache/spark/pull/29452#discussion_r472653962



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
##
@@ -18,11 +18,21 @@
 package org.apache.spark.scheduler
 
 /**
- * Provides more detail when an executor is being decommissioned.
+ * Message providing more detail when an executor is being decommissioned.
  * @param message Human readable reason for why the decommissioning is 
happening.
  * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in 
other places) is
  * being decommissioned too. Used to infer if the 
shuffle data might
  * be lost even if the external shuffle service is 
enabled.
  */
 private[spark]
 case class ExecutorDecommissionInfo(message: String, isHostDecommissioned: 
Boolean)
+
+/**
+ * State related to decommissioning that is kept by the TaskSchedulerImpl. 
This state is derived
+ * from the info message above but it is kept distinct to allow the state to 
evolve independently
+ * from the message.
+ */
+case class ExecutorDecommissionState(message: String,

Review comment:
   Yes, decoupling +1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675793629


   **[Test build #127608 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127608/testReport)**
 for PR 29465 at commit 
[`ed33409`](https://github.com/apache/spark/commit/ed334099b0ffd5fffa846e7651f706a659d29f1c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29468: [SPARK-32653][CORE] Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl

2020-08-18 Thread GitBox


Ngone51 commented on a change in pull request #29468:
URL: https://github.com/apache/spark/pull/29468#discussion_r472645226



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -1062,25 +1062,36 @@ private[spark] class TaskSchedulerImpl(
   }
 
   def getExecutorsAliveOnHost(host: String): Option[Set[String]] = 
synchronized {
-hostToExecutors.get(host).map(_.toSet)
+
hostToExecutors.get(host).map(_.filterNot(isExecutorDecommissioned)).map(_.toSet)
   }
 
   def hasExecutorsAliveOnHost(host: String): Boolean = synchronized {
-hostToExecutors.contains(host)
+  hostToExecutors.get(host)
+.exists(executors => executors.exists(e => 
!isExecutorDecommissioned(e)))
   }
 
   def hasHostAliveOnRack(rack: String): Boolean = synchronized {
-hostsByRack.contains(rack)
+hostsByRack.get(rack)

Review comment:
   `hostToExecutors` will also be used to handle executor removal and 
blacklisting stuff. In other words, a decommissioned executor could also be 
removed or blacklisted at the same time. So, remove the executor from 
`hostToExecutors` might also affect other places. I don't want to mix them 
together to make things complex.
   
   Besides, similar to `executorIdToRunningTaskIds`, a decommissioned executor 
could also have running tasks on it. So we don't eagerly remove it from 
`executorIdToRunningTaskIds` too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29468: [SPARK-32653][CORE] Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl

2020-08-18 Thread GitBox


Ngone51 commented on a change in pull request #29468:
URL: https://github.com/apache/spark/pull/29468#discussion_r472651414



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -1062,25 +1062,36 @@ private[spark] class TaskSchedulerImpl(
   }
 
   def getExecutorsAliveOnHost(host: String): Option[Set[String]] = 
synchronized {
-hostToExecutors.get(host).map(_.toSet)
+
hostToExecutors.get(host).map(_.filterNot(isExecutorDecommissioned)).map(_.toSet)

Review comment:
   hmm..I think this more like a problem of definition or something like 
that. In this PR, what I'm trying to do is to strengthen the definition of the 
active status for the executor or host. That is, for an active executor or 
host, it should not be decommissioned at the same time. This function is 
getting the active host, so we should filter out decommissioned ones. I think 
this's quite obvious. And this function is used by multiple tests.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation

2020-08-18 Thread GitBox


gengliangwang closed pull request #29458:
URL: https://github.com/apache/spark/pull/29458


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675832431


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127615/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leanken commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on

2020-08-18 Thread GitBox


leanken commented on a change in pull request #29455:
URL: https://github.com/apache/spark/pull/29455#discussion_r472649111



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##
@@ -235,8 +235,13 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   .getOrElse(createJoinWithoutHint())
 
   case j @ ExtractSingleColumnNullAwareAntiJoin(leftKeys, rightKeys) =>
-Seq(joins.BroadcastHashJoinExec(leftKeys, rightKeys, LeftAnti, 
BuildRight,
-  None, planLater(j.left), planLater(j.right), isNullAwareAntiJoin = 
true))
+if (!canBroadcastBySize(j.right, conf) && 
conf.adaptiveExecutionEnabled) {

Review comment:
   > Sorry I don't get why we don't worry about shuffled hash join OOM if 
sub-query is way too big/skewed in certain keys. In general, shuffled hash join 
OOM is a normal issue if people choose shuffled hash join blindly. Could we add 
a check to make sure build size is not too big to be shuffled hash join?
   > 
   > e.g.
   > 
   > ```
   > if (!canBroadcastBySize(j.right, conf) && 
canBuildLocalHashMapBySize(j.right, conf) && conf.adaptiveExecutionEnabled) {
   >   ...
   > }
   > ```
   
   Do you have any contact that I can offline communicate with you maybe, maybe 
you can reach me with my WeChat "18518298234"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #29472: [SPARK-32655][K8S] Support appId/execId placeholder in K8s `spark.local.dir` conf

2020-08-18 Thread GitBox


holdenk commented on pull request #29472:
URL: https://github.com/apache/spark/pull/29472#issuecomment-675833177


   Yeah ok I see the reason for the failure, that's my bad I should have run 
the K8s integration tests on the other decom test change PR since it indirectly 
impacted logging. Let me push a fix.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #29472: [SPARK-32655][K8S] Support appId/execId placeholder in K8s `spark.local.dir` conf

2020-08-18 Thread GitBox


holdenk commented on pull request #29472:
URL: https://github.com/apache/spark/pull/29472#issuecomment-675832017


   Ok thanks for the heads up. I'll look into this and the different PRs around 
test stuff tonight/tomorrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675832115


   **[Test build #127608 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127608/testReport)**
 for PR 29465 at commit 
[`ed33409`](https://github.com/apache/spark/commit/ed334099b0ffd5fffa846e7651f706a659d29f1c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675832002







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675832002







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation

2020-08-18 Thread GitBox


gengliangwang commented on pull request #29458:
URL: https://github.com/apache/spark/pull/29458#issuecomment-675832059


   Merging to master



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29458:
URL: https://github.com/apache/spark/pull/29458#issuecomment-675831971







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675831952







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29458:
URL: https://github.com/apache/spark/pull/29458#issuecomment-675831971







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


AmplabJenkins commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675831952







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation

2020-08-18 Thread GitBox


gengliangwang commented on a change in pull request #29458:
URL: https://github.com/apache/spark/pull/29458#discussion_r472644080



##
File path: docs/sql-migration-guide.md
##
@@ -36,6 +36,8 @@ license: |
 
   - In Spark 3.1, NULL elements of structures, arrays and maps are converted 
to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements 
are converted to empty strings. To restore the behavior before Spark 3.1, you 
can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.
 
+  - In Spark 3.1, when `spark.sql.ansi.enabled` is false, Spark always returns 
null if the sum of decimal type column overflows. In Spark 3.0 or earlier, when 
`spark.sql.ansi.enabled` is false, the sum of decimal type column may return 
null or incorrect result, or even fails at runtime (depending on the actual 
query plan execution).
+

Review comment:
   @maropu Thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA commented on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675831629


   **[Test build #127615 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127615/testReport)**
 for PR 29465 at commit 
[`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


SparkQA commented on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675831667


   **[Test build #127616 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127616/testReport)**
 for PR 29460 at commit 
[`d029dba`](https://github.com/apache/spark/commit/d029dba3f72c1d94d9a1b560168c06f4ca21fd6d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


HyukjinKwon commented on pull request #29460:
URL: https://github.com/apache/spark/pull/29460#issuecomment-675831293


   Should be ready to be reviewed or go merging.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29469:
URL: https://github.com/apache/spark/pull/29469#discussion_r472642251



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/AlreadyPlanned.scala
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan}
+
+/**
+ * A special node that allows skipping query planning all the way to physical 
execution. This node
+ * can be used when a query was planned to the physical level, but we had to 
go back to logical plan
+ * land for some reason (e.g. V1 DataSource write execution). This will allow 
the metrics, and the
+ * query plan to properly appear as part of the query execution.
+ */
+case class AlreadyPlanned(physicalPlan: SparkPlan) extends LeafNode {
+  override val output: Seq[Attribute] = physicalPlan.output
+  override lazy val resolved: Boolean = false

Review comment:
   OK I got it now. We want to forbid further operations to the DataFrame. 
But we don't have to forbid it, right?
   
   How about
   ```
   class AlreadyOptimizedExecution(plan: LogicalPlan) {
 override lazy val analyzed: LogicalPlan = plan
 override lazy val optimizedPlan: LogicalPlan = plan
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well

2020-08-18 Thread GitBox


HyukjinKwon removed a comment on pull request #29465:
URL: https://github.com/apache/spark/pull/29465#issuecomment-675792279


   Now it should pass all tests. I will port back once I get a green



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29469:
URL: https://github.com/apache/spark/pull/29469#discussion_r472642251



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/AlreadyPlanned.scala
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan}
+
+/**
+ * A special node that allows skipping query planning all the way to physical 
execution. This node
+ * can be used when a query was planned to the physical level, but we had to 
go back to logical plan
+ * land for some reason (e.g. V1 DataSource write execution). This will allow 
the metrics, and the
+ * query plan to properly appear as part of the query execution.
+ */
+case class AlreadyPlanned(physicalPlan: SparkPlan) extends LeafNode {
+  override val output: Seq[Attribute] = physicalPlan.output
+  override lazy val resolved: Boolean = false

Review comment:
   OK I got it now. We want to forbid further operations to the DataFrame.
   
   How about
   ```
   class AlreadyOptimizedExecution(plan: LogicalPlan) {
 override lazy val analyzed: LogicalPlan = plan
 override lazy val optimizedPlan: LogicalPlan = plan
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >