[GitHub] [spark] HyukjinKwon closed pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
HyukjinKwon closed pull request #29366: URL: https://github.com/apache/spark/pull/29366 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
HyukjinKwon commented on pull request #29366: URL: https://github.com/apache/spark/pull/29366#issuecomment-675868616 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on
cloud-fan commented on a change in pull request #29455: URL: https://github.com/apache/spark/pull/29455#discussion_r472731516 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala ## @@ -83,15 +83,18 @@ trait ShuffleExchangeLike extends Exchange { case class ShuffleExchangeExec( override val outputPartitioning: Partitioning, child: SparkPlan, -noUserSpecifiedNumPartition: Boolean = true) extends ShuffleExchangeLike { +noUserSpecifiedNumPartition: Boolean = true, +isNullAwareAntiJoin: Boolean = false) extends ShuffleExchangeLike { Review comment: This is too hacky, we shouldn't pollute the general shuffle node for one specific case. I think we need to develop a flexible stats collection framework for AQE first. NAAJ is just one use case: it asks AQE to collect null count. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode
LuciferYang commented on a change in pull request #29000: URL: https://github.com/apache/spark/pull/29000#discussion_r472729801 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala ## @@ -164,4 +165,48 @@ class PartitionedWriteSuite extends QueryTest with SharedSparkSession { assert(e.getMessage.contains("Found duplicate column(s) b, b: `b`;")) } } + + test("SPARK-27194 SPARK-29302: Fix commit collision in dynamic partition overwrite mode") { +withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> + SQLConf.PartitionOverwriteMode.DYNAMIC.toString, + SQLConf.FILE_COMMIT_PROTOCOL_CLASS.key -> +classOf[PartitionFileExistCommitProtocol].getName) { + withTempDir { d => +withTable("t") { + sql( +s""" + | create table t(c1 int, p1 int) using parquet partitioned by (p1) + | location '${d.getAbsolutePath}' +""".stripMargin) + + val df = Seq((1, 2)).toDF("c1", "p1") + df.write +.partitionBy("p1") +.mode("overwrite") +.saveAsTable("t") + checkAnswer(sql("select * from t"), df) +} + } +} + } +} + +/** + * A file commit protocol with pre-created partition file. when try to overwrite partition dir + * in dynamic partition mode, FileAlreadyExist exception would raise without SPARK-31968 + */ +private class PartitionFileExistCommitProtocol( +jobId: String, +path: String, +dynamicPartitionOverwrite: Boolean) + extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) { + override def setupJob(jobContext: JobContext): Unit = { +super.setupJob(jobContext) +val stagingDir = new File(path, s".spark-staging-$jobId") Review comment: @WinkerDu `path` string is a URI format like `file:/xxx`. Now the path of `conflictTaskFile` create as ``` /xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/file:/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet ``` It is wrong. We should strip the `file:` prefix or trans to a local filesystem path, let the `conflictTaskFile` create as ``` /xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode
LuciferYang commented on a change in pull request #29000: URL: https://github.com/apache/spark/pull/29000#discussion_r472729801 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala ## @@ -164,4 +165,48 @@ class PartitionedWriteSuite extends QueryTest with SharedSparkSession { assert(e.getMessage.contains("Found duplicate column(s) b, b: `b`;")) } } + + test("SPARK-27194 SPARK-29302: Fix commit collision in dynamic partition overwrite mode") { +withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> + SQLConf.PartitionOverwriteMode.DYNAMIC.toString, + SQLConf.FILE_COMMIT_PROTOCOL_CLASS.key -> +classOf[PartitionFileExistCommitProtocol].getName) { + withTempDir { d => +withTable("t") { + sql( +s""" + | create table t(c1 int, p1 int) using parquet partitioned by (p1) + | location '${d.getAbsolutePath}' +""".stripMargin) + + val df = Seq((1, 2)).toDF("c1", "p1") + df.write +.partitionBy("p1") +.mode("overwrite") +.saveAsTable("t") + checkAnswer(sql("select * from t"), df) +} + } +} + } +} + +/** + * A file commit protocol with pre-created partition file. when try to overwrite partition dir + * in dynamic partition mode, FileAlreadyExist exception would raise without SPARK-31968 + */ +private class PartitionFileExistCommitProtocol( +jobId: String, +path: String, +dynamicPartitionOverwrite: Boolean) + extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) { + override def setupJob(jobContext: JobContext): Unit = { +super.setupJob(jobContext) +val stagingDir = new File(path, s".spark-staging-$jobId") Review comment: @WinkerDu `path` string is a URI format like `file:/xxx`. now the path of `conflictTaskFile` create as ``` /xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/file:/xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet ``` We should strip the `file:` prefix or trans to a local filesystem path, let the `conflictTaskFile` create as ``` /xxx/spark-warehouse/org.apache.spark.sql.sources.PartitionedWriteSuite/t/.spark-staging-${jobId}/p1=2/part-0-${jobId}.c000.snappy.parquet ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on
cloud-fan commented on pull request #29455: URL: https://github.com/apache/spark/pull/29455#issuecomment-675866532 I think the patent is about a specific way to optimize null aware anti join, not the general idea. We are definitely different from Oracle as Spark is distributed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
AmplabJenkins removed a comment on pull request #29437: URL: https://github.com/apache/spark/pull/29437#issuecomment-675862619 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
AmplabJenkins removed a comment on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675862630 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
AmplabJenkins removed a comment on pull request #29469: URL: https://github.com/apache/spark/pull/29469#issuecomment-675862636 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
AmplabJenkins commented on pull request #29469: URL: https://github.com/apache/spark/pull/29469#issuecomment-675862636 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
AmplabJenkins commented on pull request #29437: URL: https://github.com/apache/spark/pull/29437#issuecomment-675862619 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
AmplabJenkins commented on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675862630 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
SparkQA removed a comment on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675852755 **[Test build #127621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127621/testReport)** for PR 29470 at commit [`d9933b7`](https://github.com/apache/spark/commit/d9933b7e4e8fdcb49faed0458559f72d7a313f1c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
SparkQA commented on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675862353 **[Test build #127621 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127621/testReport)** for PR 29470 at commit [`d9933b7`](https://github.com/apache/spark/commit/d9933b7e4e8fdcb49faed0458559f72d7a313f1c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
SparkQA commented on pull request #29469: URL: https://github.com/apache/spark/pull/29469#issuecomment-675862342 **[Test build #127623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127623/testReport)** for PR 29469 at commit [`dd3e558`](https://github.com/apache/spark/commit/dd3e558073fb85d96f1ba096c231b04d86274fc4). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
SparkQA commented on pull request #29437: URL: https://github.com/apache/spark/pull/29437#issuecomment-675862392 **[Test build #127624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127624/testReport)** for PR 29437 at commit [`6f147ee`](https://github.com/apache/spark/commit/6f147ee0ac20af8142a3dc715a9dbc7952f99265). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r472718669 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala ## @@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap case class TextTable( name: String, sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, Review comment: Fixed. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala ## @@ -56,6 +57,13 @@ trait FileDataSourceV2 extends TableProvider with DataSourceRegister { paths ++ Option(map.get("path")).toSeq } + protected def getOptionsWithoutPaths(map: CaseInsensitiveStringMap): CaseInsensitiveStringMap = { Review comment: Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
brkyvz commented on pull request #29469: URL: https://github.com/apache/spark/pull/29469#issuecomment-675861260 @cloud-fan addressed your comments This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
HyukjinKwon commented on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675860612 BTW your new profile pic looks nice @srowen! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
HyukjinKwon closed pull request #29470: URL: https://github.com/apache/spark/pull/29470 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
HyukjinKwon commented on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675859877 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
AmplabJenkins removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675858657 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
AmplabJenkins commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675858657 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
AmplabJenkins removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675857879 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127613/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
viirya commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r472711665 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala ## @@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap case class TextTable( name: String, sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, Review comment: We can revert this back like other `FileTable`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
SparkQA commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675858327 **[Test build #127622 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127622/testReport)** for PR 29465 at commit [`84846a8`](https://github.com/apache/spark/commit/84846a873658fa609148e208f403d2e010e4114b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
AmplabJenkins removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675857822 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
HyukjinKwon commented on a change in pull request #29465: URL: https://github.com/apache/spark/pull/29465#discussion_r472710508 ## File path: .github/workflows/build_and_test.yml ## @@ -0,0 +1,241 @@ +name: Build and test + +on: + push: +branches: +- branch-2.4 + pull_request: +branches: +- branch-2.4 + +jobs: + # Build: build Spark and run the tests for specified modules. + build: +name: "Build modules: ${{ matrix.modules }} ${{ matrix.comment }} (JDK ${{ matrix.java }}, ${{ matrix.hadoop }})" +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: + - 1.8 +hadoop: + - hadoop2.6 +# TODO(SPARK-32246): We don't test 'streaming-kinesis-asl' for now. +# Kinesis tests depends on external Amazon kinesis service. +# Note that the modules below are from sparktestsupport/modules.py. +modules: + - >- +core, unsafe, kvstore, avro, +network-common, network-shuffle, repl, launcher, +examples, sketch, graphx + - >- +catalyst, hive-thriftserver + - >- +streaming, sql-kafka-0-10, streaming-kafka-0-10, +mllib-local, mllib, +yarn, mesos, kubernetes, hadoop-cloud, spark-ganglia-lgpl, +streaming-flume, streaming-flume-sink, streaming-kafka-0-8 + - >- +pyspark-sql, pyspark-mllib + - >- +pyspark-core, pyspark-streaming, pyspark-ml + - >- +sparkr + - >- +sql +# Here, we split Hive and SQL tests into some of slow ones and the rest of them. +included-tags: [""] +excluded-tags: [""] +comment: [""] +include: + # Hive tests + - modules: hive +java: 1.8 +hadoop: hadoop2.6 +included-tags: org.apache.spark.tags.SlowHiveTest +comment: "- slow tests" + - modules: hive +java: 1.8 +hadoop: hadoop2.6 +excluded-tags: org.apache.spark.tags.SlowHiveTest +comment: "- other tests" +env: + MODULES_TO_TEST: ${{ matrix.modules }} + EXCLUDED_TAGS: ${{ matrix.excluded-tags }} + INCLUDED_TAGS: ${{ matrix.included-tags }} + HADOOP_PROFILE: ${{ matrix.hadoop }} + # GitHub Actions' default miniconda to use in pip packaging test. + CONDA_PREFIX: /usr/share/miniconda + GITHUB_PREV_SHA: ${{ github.event.before }} + ARROW_PRE_0_15_IPC_FORMAT: 1 +steps: +- name: Checkout Spark repository + uses: actions/checkout@v2 + # In order to fetch changed files + with: +fetch-depth: 0 +# Cache local repositories. Note that GitHub Actions cache has a 2G limit. +- name: Cache Scala, SBT, Maven and Zinc + uses: actions/cache@v1 + with: +path: build +key: build-${{ hashFiles('**/pom.xml') }} +restore-keys: | + build- +- name: Cache Maven local repository + uses: actions/cache@v2 + with: +path: ~/.m2/repository +key: ${{ matrix.java }}-${{ matrix.hadoop }}-maven-${{ hashFiles('**/pom.xml') }} +restore-keys: | + ${{ matrix.java }}-${{ matrix.hadoop }}-maven- +- name: Cache Ivy local repository + uses: actions/cache@v2 + with: +path: ~/.ivy2/cache +key: ${{ matrix.java }}-${{ matrix.hadoop }}-ivy-${{ hashFiles('**/pom.xml') }}-${{ hashFiles('**/plugins.sbt') }} +restore-keys: | + ${{ matrix.java }}-${{ matrix.hadoop }}-ivy- +- name: Install JDK ${{ matrix.java }} + uses: actions/setup-java@v1 + with: +java-version: ${{ matrix.java }} +# PySpark +- name: Install PyPy3 + # Note that order of Python installations here matters because default python is + # overridden. + uses: actions/setup-python@v2 + if: contains(matrix.modules, 'pyspark') + with: +python-version: pypy3 +architecture: x64 +- name: Install Python 3.6 + uses: actions/setup-python@v2 + if: contains(matrix.modules, 'pyspark') + with: +python-version: 3.6 +architecture: x64 +- name: Install Python 2.7 + uses: actions/setup-python@v2 + # Yarn has a Python specific test too, for example, YarnClusterSuite. + if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) + with: +python-version: 2.7 +architecture: x64 +- name: Install Python packages (Python 3.6 and PyPy3) + if: contains(matrix.modules, 'pyspark') + # PyArrow is not supported in PyPy yet, see ARROW-2651. + # TODO(SPARK-32247): scipy installation with PyPy fails for an unknown reason. + run: | +python3.6 -m pip install numpy pyarrow pandas scipy xmlrunner +
[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
AmplabJenkins commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675857822 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
HyukjinKwon commented on a change in pull request #29465: URL: https://github.com/apache/spark/pull/29465#discussion_r472709812 ## File path: .github/workflows/build_and_test.yml ## @@ -0,0 +1,241 @@ +name: Build and test + +on: + push: +branches: +- branch-2.4 + pull_request: +branches: +- branch-2.4 + +jobs: + # Build: build Spark and run the tests for specified modules. + build: +name: "Build modules: ${{ matrix.modules }} ${{ matrix.comment }} (JDK ${{ matrix.java }}, ${{ matrix.hadoop }})" +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: + - 1.8 +hadoop: + - hadoop2.6 +# TODO(SPARK-32246): We don't test 'streaming-kinesis-asl' for now. +# Kinesis tests depends on external Amazon kinesis service. +# Note that the modules below are from sparktestsupport/modules.py. +modules: + - >- +core, unsafe, kvstore, avro, +network-common, network-shuffle, repl, launcher, +examples, sketch, graphx + - >- +catalyst, hive-thriftserver + - >- +streaming, sql-kafka-0-10, streaming-kafka-0-10, +mllib-local, mllib, +yarn, mesos, kubernetes, hadoop-cloud, spark-ganglia-lgpl, +streaming-flume, streaming-flume-sink, streaming-kafka-0-8 + - >- +pyspark-sql, pyspark-mllib + - >- +pyspark-core, pyspark-streaming, pyspark-ml + - >- +sparkr + - >- +sql +# Here, we split Hive and SQL tests into some of slow ones and the rest of them. +included-tags: [""] +excluded-tags: [""] +comment: [""] +include: + # Hive tests + - modules: hive +java: 1.8 +hadoop: hadoop2.6 +included-tags: org.apache.spark.tags.SlowHiveTest +comment: "- slow tests" + - modules: hive +java: 1.8 +hadoop: hadoop2.6 +excluded-tags: org.apache.spark.tags.SlowHiveTest +comment: "- other tests" +env: + MODULES_TO_TEST: ${{ matrix.modules }} + EXCLUDED_TAGS: ${{ matrix.excluded-tags }} + INCLUDED_TAGS: ${{ matrix.included-tags }} + HADOOP_PROFILE: ${{ matrix.hadoop }} + # GitHub Actions' default miniconda to use in pip packaging test. + CONDA_PREFIX: /usr/share/miniconda + GITHUB_PREV_SHA: ${{ github.event.before }} + ARROW_PRE_0_15_IPC_FORMAT: 1 +steps: +- name: Checkout Spark repository + uses: actions/checkout@v2 + # In order to fetch changed files + with: +fetch-depth: 0 +# Cache local repositories. Note that GitHub Actions cache has a 2G limit. +- name: Cache Scala, SBT, Maven and Zinc + uses: actions/cache@v1 + with: +path: build +key: build-${{ hashFiles('**/pom.xml') }} +restore-keys: | + build- +- name: Cache Maven local repository + uses: actions/cache@v2 + with: +path: ~/.m2/repository +key: ${{ matrix.java }}-${{ matrix.hadoop }}-maven-${{ hashFiles('**/pom.xml') }} +restore-keys: | + ${{ matrix.java }}-${{ matrix.hadoop }}-maven- +- name: Cache Ivy local repository + uses: actions/cache@v2 + with: +path: ~/.ivy2/cache +key: ${{ matrix.java }}-${{ matrix.hadoop }}-ivy-${{ hashFiles('**/pom.xml') }}-${{ hashFiles('**/plugins.sbt') }} +restore-keys: | + ${{ matrix.java }}-${{ matrix.hadoop }}-ivy- +- name: Install JDK ${{ matrix.java }} + uses: actions/setup-java@v1 + with: +java-version: ${{ matrix.java }} +# PySpark +- name: Install PyPy3 + # Note that order of Python installations here matters because default python is + # overridden. + uses: actions/setup-python@v2 + if: contains(matrix.modules, 'pyspark') + with: +python-version: pypy3 +architecture: x64 +- name: Install Python 3.6 + uses: actions/setup-python@v2 + if: contains(matrix.modules, 'pyspark') + with: +python-version: 3.6 +architecture: x64 +- name: Install Python 2.7 + uses: actions/setup-python@v2 + # Yarn has a Python specific test too, for example, YarnClusterSuite. + if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) + with: +python-version: 2.7 +architecture: x64 +- name: Install Python packages (Python 3.6 and PyPy3) + if: contains(matrix.modules, 'pyspark') + # PyArrow is not supported in PyPy yet, see ARROW-2651. + # TODO(SPARK-32247): scipy installation with PyPy fails for an unknown reason. + run: | +python3.6 -m pip install numpy pyarrow pandas scipy xmlrunner +
[GitHub] [spark] HyukjinKwon commented on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
HyukjinKwon commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675857451 It should be ready to be reviewed or merged now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon removed a comment on pull request #29465: [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4
HyukjinKwon removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675837138 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on
c21 commented on a change in pull request #29455: URL: https://github.com/apache/spark/pull/29455#discussion_r472706333 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateNullAwareAntiJoin.scala ## @@ -20,22 +20,50 @@ package org.apache.spark.sql.execution.adaptive import org.apache.spark.sql.catalyst.planning.ExtractSingleColumnNullAwareAntiJoin import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec import org.apache.spark.sql.execution.joins.EmptyHashedRelationWithAllNullKeys +import org.apache.spark.sql.execution.metric.SQLShuffleWriteMetricsReporter.SHUFFLE_RECORDS_WRITTEN /** - * This optimization rule detects and convert a NAAJ to an Empty LocalRelation - * when buildSide is EmptyHashedRelationWithAllNullKeys. + * This optimization rule try to Eliminate NAAJ by following patterns. + * 1. Convert a NAAJ to an Empty LocalRelation when buildSide is EmptyHashedRelationWithAllNullKeys. + * 2. Convert a NAAJ to left plan when buildSide is empty. */ object EliminateNullAwareAntiJoin extends Rule[LogicalPlan] { - private def canEliminate(plan: LogicalPlan): Boolean = plan match { + private def canConvertToEmptyLocalRelation(plan: LogicalPlan): Boolean = plan match { case LogicalQueryStage(_, stage: BroadcastQueryStageExec) if stage.resultOption.get().isDefined && stage.broadcast.relationFuture.get().value == EmptyHashedRelationWithAllNullKeys => true +case LogicalQueryStage(_, stage: ShuffleQueryStageExec) if stage.resultOption.get().isDefined => + // Equivalent to EmptyHashedRelationWithAllNullKeys + val nullPartitionKeyMetrics = stage.shuffle.asInstanceOf[ShuffleExchangeExec] Review comment: just +1 with @agrawaldevesh comment. I kind of feel aggregated `SQLMetrics` can be approximated (e.g. with failed/speculated tasks), and it seems to be quite dangerous here to depend on the exact value of `SQLMetrics` (e.g. >0 or == 0 below), which can have correctness concern. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports
AmplabJenkins commented on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675853113 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports
AmplabJenkins removed a comment on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675853113 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports
SparkQA commented on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675852755 **[Test build #127621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127621/testReport)** for PR 29470 at commit [`d9933b7`](https://github.com/apache/spark/commit/d9933b7e4e8fdcb49faed0458559f72d7a313f1c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder
cloud-fan commented on a change in pull request #29467: URL: https://github.com/apache/spark/pull/29467#discussion_r472697970 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ObjectSerializerPruningSuite.scala ## @@ -107,4 +108,34 @@ class ObjectSerializerPruningSuite extends PlanTest { comparePlans(optimized, expected) } } + + test("SPARK-32652: Prune nested serializers: RowEncoder") { +withSQLConf(SQLConf.SERIALIZER_NESTED_SCHEMA_PRUNING_ENABLED.key -> "true") { + val testRelation = LocalRelation('i.struct(StructType.fromDDL("a int, b string")), 'j.int) + val rowEncoder = RowEncoder(new StructType() +.add("i", new StructType().add("a", "int").add("b", "string")) +.add("j", "int")) + val serializerObject = CatalystSerde.serialize( +CatalystSerde.deserialize(testRelation)(rowEncoder))(rowEncoder) + val query = serializerObject.select($"i.a") + val optimized = Optimize.execute(query.analyze) + + val prunedSerializer = serializerObject.serializer.head.transformDown { +case CreateNamedStruct(children) => CreateNamedStruct(children.take(2)) + }.transformUp { +// Aligns null literal in `If` expression to make it resolvable. +case i @ If(invoke: Invoke, Literal(null, dt), ser) if invoke.functionName == "isNullAt" && +!dt.sameType(ser.dataType) => + i.copy(trueValue = Literal(null, ser.dataType)) + }.asInstanceOf[NamedExpression] + + // `name` in `GetStructField` affects `comparePlans`. Maybe we can ignore + // `name` in `GetStructField.equals`? Review comment: Note: I just copied this comment from the existing test case above. If we can fix it, we should remove comments from both test cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
cloud-fan commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r472696179 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala ## @@ -56,6 +57,13 @@ trait FileDataSourceV2 extends TableProvider with DataSourceRegister { paths ++ Option(map.get("path")).toSeq } + protected def getOptionsWithoutPaths(map: CaseInsensitiveStringMap): CaseInsensitiveStringMap = { Review comment: nit: ``` val withoutPath = options.asCaseSensitiveMap().asScala.filterKeys { k => !k.equalsIgnoreCase("path") && !k.equalsIgnoreCase("paths") } new CaseInsensitiveStringMap(withoutPath.asJava) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
cloud-fan commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r472694528 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala ## @@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap case class TextTable( name: String, sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, Review comment: do we still need to rename it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29470: [MINOR][DOCS] Add KMeansSummary to pyspark clustering.py exports
HyukjinKwon commented on pull request #29470: URL: https://github.com/apache/spark/pull/29470#issuecomment-675850875 Oh actually we should add it into `python/docs/source/reference/pyspark.ml.rst` now. I know another missing one. Let me push it to here and just merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize
cloud-fan commented on a change in pull request #29104: URL: https://github.com/apache/spark/pull/29104#discussion_r472691173 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala ## @@ -903,15 +910,61 @@ private[joins] object LongHashedRelation { if (!rowKey.isNullAt(0)) { val key = rowKey.getLong(0) map.append(key, unsafeRow) + } else if (isNullAware) { +return EmptyHashedRelationWithAllNullKeys } } map.optimize() new LongHashedRelation(numFields, map) } } +/** + * Common trait with dummy implementation for NAAJ special HashedRelation + * EmptyHashedRelation + * EmptyHashedRelationWithAllNullKeys + */ +trait NullAwareHashedRelation extends HashedRelation with Externalizable { + override def get(key: InternalRow): Iterator[InternalRow] = { +throw new UnsupportedOperationException + } + + override def getValue(key: InternalRow): InternalRow = { +throw new UnsupportedOperationException + } + + override def keyIsUnique: Boolean = true + + override def keys(): Iterator[InternalRow] = { +throw new UnsupportedOperationException + } + + override def close(): Unit = {} + + override def writeExternal(out: ObjectOutput): Unit = {} + + override def readExternal(in: ObjectInput): Unit = {} + + override def estimatedSize: Long = 0 +} + +/** + * A special HashedRelation indicates it built from a empty input:Iterator[InternalRow]. + */ +object EmptyHashedRelation extends NullAwareHashedRelation { + override def asReadOnlyCopy(): EmptyHashedRelation.type = this +} + +/** + * A special HashedRelation indicates it built from a non-empty input:Iterator[InternalRow], + * which contains all null columns key. + */ +object EmptyHashedRelationWithAllNullKeys extends NullAwareHashedRelation { + override def asReadOnlyCopy(): EmptyHashedRelationWithAllNullKeys.type = this Review comment: probably just remove `Empty` to make it `HashedRelationWithAllNullKeys`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder
HyukjinKwon commented on pull request #29467: URL: https://github.com/apache/spark/pull/29467#issuecomment-675849620 Merged to master and branch-3.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder
HyukjinKwon closed pull request #29467: URL: https://github.com/apache/spark/pull/29467 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29467: [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder
HyukjinKwon commented on a change in pull request #29467: URL: https://github.com/apache/spark/pull/29467#discussion_r472689757 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ObjectSerializerPruningSuite.scala ## @@ -107,4 +108,34 @@ class ObjectSerializerPruningSuite extends PlanTest { comparePlans(optimized, expected) } } + + test("SPARK-32652: Prune nested serializers: RowEncoder") { +withSQLConf(SQLConf.SERIALIZER_NESTED_SCHEMA_PRUNING_ENABLED.key -> "true") { + val testRelation = LocalRelation('i.struct(StructType.fromDDL("a int, b string")), 'j.int) + val rowEncoder = RowEncoder(new StructType() +.add("i", new StructType().add("a", "int").add("b", "string")) +.add("j", "int")) + val serializerObject = CatalystSerde.serialize( +CatalystSerde.deserialize(testRelation)(rowEncoder))(rowEncoder) + val query = serializerObject.select($"i.a") + val optimized = Optimize.execute(query.analyze) + + val prunedSerializer = serializerObject.serializer.head.transformDown { +case CreateNamedStruct(children) => CreateNamedStruct(children.take(2)) + }.transformUp { +// Aligns null literal in `If` expression to make it resolvable. +case i @ If(invoke: Invoke, Literal(null, dt), ser) if invoke.functionName == "isNullAt" && +!dt.sameType(ser.dataType) => + i.copy(trueValue = Literal(null, ser.dataType)) + }.asInstanceOf[NamedExpression] + + // `name` in `GetStructField` affects `comparePlans`. Maybe we can ignore + // `name` in `GetStructField.equals`? Review comment: Let me just merge this. In the PR that actually ignores the equality, we can remove this comment and also the codes clearing the name below. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on
c21 commented on a change in pull request #29455: URL: https://github.com/apache/spark/pull/29455#discussion_r472680557 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala ## @@ -83,15 +83,18 @@ trait ShuffleExchangeLike extends Exchange { case class ShuffleExchangeExec( override val outputPartitioning: Partitioning, child: SparkPlan, -noUserSpecifiedNumPartition: Boolean = true) extends ShuffleExchangeLike { +noUserSpecifiedNumPartition: Boolean = true, +isNullAwareAntiJoin: Boolean = false) extends ShuffleExchangeLike { Review comment: Hi Devesh, I am totally with you for benchmarketing perspective and avoid customer surprise. I think we are discussing whether this kind of query pattern is popular or not. I think a renaming for the parameter of `isNullAwareAntiJoin` can alleviate my concern. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins removed a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-675844857 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
AmplabJenkins commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-675844857 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28953: [SPARK-32013][SQL] Support query execution before reading DataFrame and before/after writing DataFrame over JDBC
SparkQA commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-675844652 **[Test build #127620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127620/testReport)** for PR 28953 at commit [`13f0dfc`](https://github.com/apache/spark/commit/13f0dfc2078f0933be01338a0fbf4d69113a9f4e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
brkyvz commented on a change in pull request #29469: URL: https://github.com/apache/spark/pull/29469#discussion_r472671268 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/AlreadyPlannedSuite.scala ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.test.SharedSparkSession + +class AlreadyPlannedSuite extends SparkPlanTest with SharedSparkSession { + + import testImplicits._ + + test("simple execution") { +val df = spark.range(10) +val planned = AlreadyPlanned.dataFrame(spark, df.queryExecution.sparkPlan) + +checkAnswer(planned, identity, df.toDF().collect()) + } + + test("planning on top won't work - filter") { +val df = spark.range(10) +val planned = AlreadyPlanned.dataFrame(spark, df.queryExecution.sparkPlan) + +val e = intercept[AnalysisException] { + planned.where('id < 5) Review comment: yeah, you're right. This would actually prevent the source from adding modifications to the input dataFrame, e.g. attach hidden columns or such. Let me fix this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos
AmplabJenkins commented on pull request #29453: URL: https://github.com/apache/spark/pull/29453#issuecomment-675841844 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos
AmplabJenkins removed a comment on pull request #29453: URL: https://github.com/apache/spark/pull/29453#issuecomment-675841844 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
brkyvz commented on a change in pull request #29469: URL: https://github.com/apache/spark/pull/29469#discussion_r472668735 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/AlreadyPlanned.scala ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} +import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan} + +/** + * A special node that allows skipping query planning all the way to physical execution. This node + * can be used when a query was planned to the physical level, but we had to go back to logical plan + * land for some reason (e.g. V1 DataSource write execution). This will allow the metrics, and the + * query plan to properly appear as part of the query execution. + */ +case class AlreadyPlanned(physicalPlan: SparkPlan) extends LeafNode { + override val output: Seq[Attribute] = physicalPlan.output + override lazy val resolved: Boolean = false Review comment: that wasn't sufficient. We need to pass in the `executedPlan` for the spark and executed plans. Otherwise one of the test suites failed with: ``` sbt.ForkMain$ForkError: java.lang.UnsupportedOperationException: null at org.apache.spark.sql.execution.WholeStageCodegenExec.doProduce(WholeStageCodegenExec.scala:790) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos
SparkQA commented on pull request #29453: URL: https://github.com/apache/spark/pull/29453#issuecomment-675841543 **[Test build #127619 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127619/testReport)** for PR 29453 at commit [`b99ced4`](https://github.com/apache/spark/commit/b99ced462b8fd105a5a9fefd0ef6dc4e852d3257). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #29453: [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos
ulysses-you commented on a change in pull request #29453: URL: https://github.com/apache/spark/pull/29453#discussion_r472667471 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ## @@ -3035,7 +3035,10 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { val msg = intercept[AnalysisException] { sql("REFRESH FUNCTION md5") }.getMessage -assert(msg.contains("Cannot refresh builtin function")) +assert(msg.contains("Cannot refresh built-in function")) +intercept[NoSuchFunctionException] { Review comment: added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm
Ngone51 commented on a change in pull request #29452: URL: https://github.com/apache/spark/pull/29452#discussion_r472663605 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -1051,15 +1049,19 @@ private[spark] class TaskSetManager( logDebug("Task length threshold for speculation: " + threshold) for (tid <- runningTasksSet) { var speculated = checkAndSubmitSpeculatableTask(tid, time, threshold) -if (!speculated && tidToExecutorKillTimeMapping.contains(tid)) { - // Check whether this task will finish before the exectorKillTime assuming - // it will take medianDuration overall. If this task cannot finish within - // executorKillInterval, then this task is a candidate for speculation - val taskEndTimeBasedOnMedianDuration = taskInfos(tid).launchTime + medianDuration - val canExceedDeadline = tidToExecutorKillTimeMapping(tid) < -taskEndTimeBasedOnMedianDuration - if (canExceedDeadline) { -speculated = checkAndSubmitSpeculatableTask(tid, time, 0) +if (!speculated && executorDecommissionKillInterval.nonEmpty) { + val taskInfo = taskInfos(tid) + val decomState = sched.getExecutorDecommissionState(taskInfo.executorId) + if (decomState.nonEmpty) { +// Check whether this task will finish before the exectorKillTime assuming Review comment: We don't have `exectorKillTime` anymore after removing `tidToExecutorKillTimeMapping`. Shall we reword the comment a little bit? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
LuciferYang commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-675838627 Thx~ @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
AmplabJenkins removed a comment on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-675838315 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675838241 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
AmplabJenkins commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-675838315 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675838241 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
SparkQA commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675838032 **[Test build #127617 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127617/testReport)** for PR 29465 at commit [`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on
leanken commented on a change in pull request #29455: URL: https://github.com/apache/spark/pull/29455#discussion_r472660831 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala ## @@ -225,7 +231,8 @@ object ShuffleExchangeExec { outputAttributes: Seq[Attribute], newPartitioning: Partitioning, serializer: Serializer, - writeMetrics: Map[String, SQLMetric]) + writeMetrics: Map[String, SQLMetric], + isNullAwareAntiJoin: Boolean = false) Review comment: how about `collectNullPartitionKeyMetrics` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
SparkQA commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-675838022 **[Test build #127618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127618/testReport)** for PR 29434 at commit [`36a317a`](https://github.com/apache/spark/commit/36a317abca5a861a7b0c27317c80d030020838ed). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
HyukjinKwon commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-675837861 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
SparkQA removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675831629 **[Test build #127615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127615/testReport)** for PR 29465 at commit [`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675832423 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
HyukjinKwon commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675837138 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm
Ngone51 commented on a change in pull request #29452: URL: https://github.com/apache/spark/pull/29452#discussion_r472657515 ## File path: core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala ## @@ -18,11 +18,22 @@ package org.apache.spark.scheduler /** - * Provides more detail when an executor is being decommissioned. + * Message providing more detail when an executor is being decommissioned. * @param message Human readable reason for why the decommissioning is happening. * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in other places) is * being decommissioned too. Used to infer if the shuffle data might * be lost even if the external shuffle service is enabled. */ private[spark] case class ExecutorDecommissionInfo(message: String, isHostDecommissioned: Boolean) + +/** + * State related to decommissioning that is kept by the TaskSchedulerImpl. This state is derived + * from the info message above but it is kept distinct to allow the state to evolve independently + * from the message. + */ +case class ExecutorDecommissionState( +message: String, +// Timestamp in milliseconds when decommissioning was triggered +tsMillis: Long, Review comment: How about `startTime`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
AmplabJenkins removed a comment on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675836455 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
AmplabJenkins commented on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675836455 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm
Ngone51 commented on a change in pull request #29452: URL: https://github.com/apache/spark/pull/29452#discussion_r472655476 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -1051,15 +1049,19 @@ private[spark] class TaskSetManager( logDebug("Task length threshold for speculation: " + threshold) for (tid <- runningTasksSet) { var speculated = checkAndSubmitSpeculatableTask(tid, time, threshold) -if (!speculated && tidToExecutorKillTimeMapping.contains(tid)) { - // Check whether this task will finish before the exectorKillTime assuming - // it will take medianDuration overall. If this task cannot finish within - // executorKillInterval, then this task is a candidate for speculation - val taskEndTimeBasedOnMedianDuration = taskInfos(tid).launchTime + medianDuration - val canExceedDeadline = tidToExecutorKillTimeMapping(tid) < -taskEndTimeBasedOnMedianDuration - if (canExceedDeadline) { -speculated = checkAndSubmitSpeculatableTask(tid, time, 0) +if (!speculated && executorDecommissionKillInterval.nonEmpty) { Review comment: I see, we don't have `tidToExecutorKillTimeMapping` anymore. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
SparkQA removed a comment on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675791761 **[Test build #127607 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127607/testReport)** for PR 29460 at commit [`511d611`](https://github.com/apache/spark/commit/511d611ad0e307ebe9b8bd06227241774d1ba25f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
SparkQA commented on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675835877 **[Test build #127607 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127607/testReport)** for PR 29460 at commit [`511d611`](https://github.com/apache/spark/commit/511d611ad0e307ebe9b8bd06227241774d1ba25f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675832423 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675832473 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
SparkQA commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675832415 **[Test build #127615 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127615/testReport)** for PR 29465 at commit [`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29452: [SPARK-32643][CORE] Consolidate state decommissioning in the TaskSchedulerImpl realm
Ngone51 commented on a change in pull request #29452: URL: https://github.com/apache/spark/pull/29452#discussion_r472653962 ## File path: core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala ## @@ -18,11 +18,21 @@ package org.apache.spark.scheduler /** - * Provides more detail when an executor is being decommissioned. + * Message providing more detail when an executor is being decommissioned. * @param message Human readable reason for why the decommissioning is happening. * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in other places) is * being decommissioned too. Used to infer if the shuffle data might * be lost even if the external shuffle service is enabled. */ private[spark] case class ExecutorDecommissionInfo(message: String, isHostDecommissioned: Boolean) + +/** + * State related to decommissioning that is kept by the TaskSchedulerImpl. This state is derived + * from the info message above but it is kept distinct to allow the state to evolve independently + * from the message. + */ +case class ExecutorDecommissionState(message: String, Review comment: Yes, decoupling +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
SparkQA removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675793629 **[Test build #127608 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127608/testReport)** for PR 29465 at commit [`ed33409`](https://github.com/apache/spark/commit/ed334099b0ffd5fffa846e7651f706a659d29f1c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29468: [SPARK-32653][CORE] Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl
Ngone51 commented on a change in pull request #29468: URL: https://github.com/apache/spark/pull/29468#discussion_r472645226 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ## @@ -1062,25 +1062,36 @@ private[spark] class TaskSchedulerImpl( } def getExecutorsAliveOnHost(host: String): Option[Set[String]] = synchronized { -hostToExecutors.get(host).map(_.toSet) + hostToExecutors.get(host).map(_.filterNot(isExecutorDecommissioned)).map(_.toSet) } def hasExecutorsAliveOnHost(host: String): Boolean = synchronized { -hostToExecutors.contains(host) + hostToExecutors.get(host) +.exists(executors => executors.exists(e => !isExecutorDecommissioned(e))) } def hasHostAliveOnRack(rack: String): Boolean = synchronized { -hostsByRack.contains(rack) +hostsByRack.get(rack) Review comment: `hostToExecutors` will also be used to handle executor removal and blacklisting stuff. In other words, a decommissioned executor could also be removed or blacklisted at the same time. So, remove the executor from `hostToExecutors` might also affect other places. I don't want to mix them together to make things complex. Besides, similar to `executorIdToRunningTaskIds`, a decommissioned executor could also have running tasks on it. So we don't eagerly remove it from `executorIdToRunningTaskIds` too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29468: [SPARK-32653][CORE] Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl
Ngone51 commented on a change in pull request #29468: URL: https://github.com/apache/spark/pull/29468#discussion_r472651414 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ## @@ -1062,25 +1062,36 @@ private[spark] class TaskSchedulerImpl( } def getExecutorsAliveOnHost(host: String): Option[Set[String]] = synchronized { -hostToExecutors.get(host).map(_.toSet) + hostToExecutors.get(host).map(_.filterNot(isExecutorDecommissioned)).map(_.toSet) Review comment: hmm..I think this more like a problem of definition or something like that. In this PR, what I'm trying to do is to strengthen the definition of the active status for the executor or host. That is, for an active executor or host, it should not be decommissioned at the same time. This function is getting the active host, so we should filter out decommissioned ones. I think this's quite obvious. And this function is used by multiple tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation
gengliangwang closed pull request #29458: URL: https://github.com/apache/spark/pull/29458 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675832431 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127615/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on a change in pull request #29455: [SPARK-32644][SQL] NAAJ support for ShuffleHashJoin when AQE is on
leanken commented on a change in pull request #29455: URL: https://github.com/apache/spark/pull/29455#discussion_r472649111 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ## @@ -235,8 +235,13 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { .getOrElse(createJoinWithoutHint()) case j @ ExtractSingleColumnNullAwareAntiJoin(leftKeys, rightKeys) => -Seq(joins.BroadcastHashJoinExec(leftKeys, rightKeys, LeftAnti, BuildRight, - None, planLater(j.left), planLater(j.right), isNullAwareAntiJoin = true)) +if (!canBroadcastBySize(j.right, conf) && conf.adaptiveExecutionEnabled) { Review comment: > Sorry I don't get why we don't worry about shuffled hash join OOM if sub-query is way too big/skewed in certain keys. In general, shuffled hash join OOM is a normal issue if people choose shuffled hash join blindly. Could we add a check to make sure build size is not too big to be shuffled hash join? > > e.g. > > ``` > if (!canBroadcastBySize(j.right, conf) && canBuildLocalHashMapBySize(j.right, conf) && conf.adaptiveExecutionEnabled) { > ... > } > ``` Do you have any contact that I can offline communicate with you maybe, maybe you can reach me with my WeChat "18518298234" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #29472: [SPARK-32655][K8S] Support appId/execId placeholder in K8s `spark.local.dir` conf
holdenk commented on pull request #29472: URL: https://github.com/apache/spark/pull/29472#issuecomment-675833177 Yeah ok I see the reason for the failure, that's my bad I should have run the K8s integration tests on the other decom test change PR since it indirectly impacted logging. Let me push a fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #29472: [SPARK-32655][K8S] Support appId/execId placeholder in K8s `spark.local.dir` conf
holdenk commented on pull request #29472: URL: https://github.com/apache/spark/pull/29472#issuecomment-675832017 Ok thanks for the heads up. I'll look into this and the different PRs around test stuff tonight/tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
SparkQA commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675832115 **[Test build #127608 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127608/testReport)** for PR 29465 at commit [`ed33409`](https://github.com/apache/spark/commit/ed334099b0ffd5fffa846e7651f706a659d29f1c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
AmplabJenkins commented on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675832002 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
AmplabJenkins removed a comment on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675832002 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation
gengliangwang commented on pull request #29458: URL: https://github.com/apache/spark/pull/29458#issuecomment-675832059 Merging to master This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation
AmplabJenkins removed a comment on pull request #29458: URL: https://github.com/apache/spark/pull/29458#issuecomment-675831971 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675831952 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation
AmplabJenkins commented on pull request #29458: URL: https://github.com/apache/spark/pull/29458#issuecomment-675831971 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
AmplabJenkins commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675831952 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #29458: [SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation
gengliangwang commented on a change in pull request #29458: URL: https://github.com/apache/spark/pull/29458#discussion_r472644080 ## File path: docs/sql-migration-guide.md ## @@ -36,6 +36,8 @@ license: | - In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`. + - In Spark 3.1, when `spark.sql.ansi.enabled` is false, Spark always returns null if the sum of decimal type column overflows. In Spark 3.0 or earlier, when `spark.sql.ansi.enabled` is false, the sum of decimal type column may return null or incorrect result, or even fails at runtime (depending on the actual query plan execution). + Review comment: @maropu Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
SparkQA commented on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675831629 **[Test build #127615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127615/testReport)** for PR 29465 at commit [`1393174`](https://github.com/apache/spark/commit/1393174703bdc7236d2c96ae9efdbaa8f8993f85). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
SparkQA commented on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675831667 **[Test build #127616 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127616/testReport)** for PR 29460 at commit [`d029dba`](https://github.com/apache/spark/commit/d029dba3f72c1d94d9a1b560168c06f4ca21fd6d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29460: [SPARK-32249][3.0] Run Github Actions builds in other branches as well
HyukjinKwon commented on pull request #29460: URL: https://github.com/apache/spark/pull/29460#issuecomment-675831293 Should be ready to be reviewed or go merging. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
cloud-fan commented on a change in pull request #29469: URL: https://github.com/apache/spark/pull/29469#discussion_r472642251 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/AlreadyPlanned.scala ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} +import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan} + +/** + * A special node that allows skipping query planning all the way to physical execution. This node + * can be used when a query was planned to the physical level, but we had to go back to logical plan + * land for some reason (e.g. V1 DataSource write execution). This will allow the metrics, and the + * query plan to properly appear as part of the query execution. + */ +case class AlreadyPlanned(physicalPlan: SparkPlan) extends LeafNode { + override val output: Seq[Attribute] = physicalPlan.output + override lazy val resolved: Boolean = false Review comment: OK I got it now. We want to forbid further operations to the DataFrame. But we don't have to forbid it, right? How about ``` class AlreadyOptimizedExecution(plan: LogicalPlan) { override lazy val analyzed: LogicalPlan = plan override lazy val optimizedPlan: LogicalPlan = plan } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon removed a comment on pull request #29465: [SPARK-32249][2.4] Run Github Actions builds in other branches as well
HyukjinKwon removed a comment on pull request #29465: URL: https://github.com/apache/spark/pull/29465#issuecomment-675792279 Now it should pass all tests. I will port back once I get a green This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29469: [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
cloud-fan commented on a change in pull request #29469: URL: https://github.com/apache/spark/pull/29469#discussion_r472642251 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/AlreadyPlanned.scala ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} +import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan} + +/** + * A special node that allows skipping query planning all the way to physical execution. This node + * can be used when a query was planned to the physical level, but we had to go back to logical plan + * land for some reason (e.g. V1 DataSource write execution). This will allow the metrics, and the + * query plan to properly appear as part of the query execution. + */ +case class AlreadyPlanned(physicalPlan: SparkPlan) extends LeafNode { + override val output: Seq[Attribute] = physicalPlan.output + override lazy val resolved: Boolean = false Review comment: OK I got it now. We want to forbid further operations to the DataFrame. How about ``` class AlreadyOptimizedExecution(plan: LogicalPlan) { override lazy val analyzed: LogicalPlan = plan override lazy val optimizedPlan: LogicalPlan = plan } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org