[GitHub] [spark] AmplabJenkins removed a comment on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
AmplabJenkins removed a comment on pull request #28840: URL: https://github.com/apache/spark/pull/28840#issuecomment-647291121 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
AmplabJenkins commented on pull request #28840: URL: https://github.com/apache/spark/pull/28840#issuecomment-647291121 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
SparkQA commented on pull request #28840: URL: https://github.com/apache/spark/pull/28840#issuecomment-647290651 **[Test build #124352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124352/testReport)** for PR 28840 at commit [`5d5fe71`](https://github.com/apache/spark/commit/5d5fe71f61d596062545a7e9cc0efd023840307f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agrawaldevesh commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
agrawaldevesh commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r443318618 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -157,7 +158,8 @@ class BlockManagerMasterEndpoint( context.reply(true) case DecommissionBlockManagers(executorIds) => Review comment: I wanted to confirm my understanding that this PR does not do anything about shuffle files written by executors that have since finished ? That would be the External Shuffle Service's responsibility (if enabled), Right ? And that piece of the puzzle isn't implemented yet ? My question stems from how we handle this scenario: (Consider the standalone scheduler mode here) - Executor wrote some shuffle data and is enabled with external shuffle service - Executor finished normally and nothing is running on the worker. - DecommissionExecutor is not even called because the executor does not exist - Thus the DecommissionBlockManager isn't called and thus no migrations take place. Therefore fetch failures would happen for other tasks that try to read this executor's shuffle data (via the External Shuffle Service). I believe fixing this is not in the scope of this PR. Can you please check this understanding. Thanks. ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -489,6 +491,24 @@ class BlockManagerMasterEndpoint( storageLevel: StorageLevel, memSize: Long, diskSize: Long): Boolean = { +logInfo(s"Updating block info on master ${blockId} for ${blockManagerId}") + +if (blockId.isInternalShuffle) { + blockId match { +case ShuffleIndexBlockId(shuffleId, mapId, _) => + // Don't update the map output on just the index block + logDebug(s"Received shuffle index block update for ${shuffleId} ${mapId}, ignoring.") + return true +case ShuffleDataBlockId(shuffleId: Int, mapId: Long, reduceId: Int) => + logInfo(s"Received shuffle data block update for ${shuffleId} ${mapId}, updating.") Review comment: Should this be a logDebug ? ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala ## @@ -0,0 +1,280 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import java.util.concurrent.ExecutorService + +import scala.collection.JavaConverters._ +import scala.collection.mutable +import scala.util.control.NonFatal + +import org.apache.spark._ +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config +import org.apache.spark.shuffle.{MigratableResolver, ShuffleBlockInfo} +import org.apache.spark.storage.BlockManagerMessages.ReplicateBlock +import org.apache.spark.util.ThreadUtils + +/** + * Class to handle block manager decommissioning retries. + * It creates a Thread to retry offloading all RDD cache and Shuffle blocks + */ +private[storage] class BlockManagerDecommissioner( + conf: SparkConf, + bm: BlockManager) extends Logging { + + private val maxReplicationFailuresForDecommission = +conf.get(config.STORAGE_DECOMMISSION_MAX_REPLICATION_FAILURE_PER_BLOCK) + + /** + * This runnable consumes any shuffle blocks in the queue for migration. This part of a + * producer/consumer where the main migration loop updates the queue of blocks to be migrated + * periodically. On migration failure, the current thread will reinsert the block for another + * thread to consume. Each thread migrates blocks to a different particular executor to avoid + * distribute the blocks as quickly as possible without overwhelming any particular executor. + * + * There is no preference for which peer a given block is migrated to. + * This is notable different than the RDD cache block migration (further down in this file) + * which uses the existing priority mechanism for determining where to replicate blocks to. + * Generally speaking cache blocks are less impactful as they normally represent narrow + * transformations and we normally have less cache
[GitHub] [spark] cloud-fan commented on pull request #28780: [SPARK-31952][SQL]Fix incorrect memory spill metric when doing Aggregate
cloud-fan commented on pull request #28780: URL: https://github.com/apache/spark/pull/28780#issuecomment-647288933 > number of in-memory bytes spilled what's the difference between it and `spill (disk)`? IIUC `spill` means we dump data from memory to disk. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
cloud-fan commented on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647286415 repartition by key is another story and should support partition coalesce. If it's not the case now, please create a bug report in the JIRA, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyE
AmplabJenkins removed a comment on pull request #26339: URL: https://github.com/apache/spark/pull/26339#issuecomment-647285293 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsExc
AmplabJenkins commented on pull request #26339: URL: https://github.com/apache/spark/pull/26339#issuecomment-647285293 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
SparkQA commented on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647285180 **[Test build #124351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124351/testReport)** for PR 28889 at commit [`62b3e79`](https://github.com/apache/spark/commit/62b3e79a49446ae8836567fbbdce4441a6e5c86a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] prakharjain09 commented on a change in pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
prakharjain09 commented on a change in pull request #28619: URL: https://github.com/apache/spark/pull/28619#discussion_r443322288 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -1842,6 +1842,17 @@ package object config { .timeConf(TimeUnit.MILLISECONDS) .createOptional + private[spark] val EXECUTOR_DECOMMISSION_KILL_INTERVAL = +ConfigBuilder("spark.executor.decommission.killInterval") + .doc("Duration after which a decommissioned executor will be killed forcefully." + +"This config is useful for cloud environments where we know in advance when " + +"an executor is going to go down after decommissioning signal i.e. around 2 mins " + +"in aws spot nodes, 1/2 hrs in spot block nodes etc. This config is currently " + Review comment: @cloud-fan As per my understanding, Worker Decommissioning is getting triggered currently using SIGPWR signal (and not via some message coming from YARN/Kubernetes Cluster manager). So getting this timeout from Spark Cluster Manager might not be possible. We might be able to do this once Spark's Worker Decommissioning logic starts triggering via communication from YARN etc in future. cc @holdenk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsE
SparkQA removed a comment on pull request #26339: URL: https://github.com/apache/spark/pull/26339#issuecomment-647229068 **[Test build #124343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124343/testReport)** for PR 26339 at commit [`717d9a5`](https://github.com/apache/spark/commit/717d9a56faf128e4864844cdf6341cb7b8731307). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsException
SparkQA commented on pull request #26339: URL: https://github.com/apache/spark/pull/26339#issuecomment-647284372 **[Test build #124343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124343/testReport)** for PR 26339 at commit [`717d9a5`](https://github.com/apache/spark/commit/717d9a56faf128e4864844cdf6341cb7b8731307). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
AmplabJenkins commented on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647282993 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
AmplabJenkins removed a comment on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647282993 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28890: [MINOR][SQL] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predic
AmplabJenkins removed a comment on pull request #28890: URL: https://github.com/apache/spark/pull/28890#issuecomment-647282918 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
AmplabJenkins removed a comment on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647268689 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28890: [MINOR][SQL] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate
AmplabJenkins commented on pull request #28890: URL: https://github.com/apache/spark/pull/28890#issuecomment-647282918 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
HyukjinKwon commented on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647282568 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28890: [MINOR][SQL] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate
SparkQA commented on pull request #28890: URL: https://github.com/apache/spark/pull/28890#issuecomment-647282523 **[Test build #124350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124350/testReport)** for PR 28890 at commit [`1f915e4`](https://github.com/apache/spark/commit/1f915e41e8b02e5a714c6341c925da37a10ec8a1). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on pull request #28890: [SQL][MINOR] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate
dilipbiswal commented on pull request #28890: URL: https://github.com/apache/spark/pull/28890#issuecomment-647281805 cc @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle
viirya commented on pull request #28846: URL: https://github.com/apache/spark/pull/28846#issuecomment-647281932 @cloud-fan Thanks for clarifying. The idea sounds worth exploring as we can avoid local shuffle under current parallelism design of independent stages in AQE. I will explore the possibility. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal opened a new pull request #28890: [SQL][MINOR] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate
dilipbiswal opened a new pull request #28890: URL: https://github.com/apache/spark/pull/28890 ### What changes were proposed in this pull request? A minor PR that adds a couple of usage examples for ArrayFilter and ArrayExists that shows how to deal with NULL data. ### Why are the changes needed? Enhances the examples that shows how to filter out null values from an array and also to test if null value exists in an array. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested in command shell. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dbtsai commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
dbtsai commented on a change in pull request #27366: URL: https://github.com/apache/spark/pull/27366#discussion_r443316503 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonFiltersSuite.scala ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import org.apache.spark.sql.catalyst.{StructFilters, StructFiltersSuite} +import org.apache.spark.sql.sources +import org.apache.spark.sql.types.StructType + +class JsonFiltersSuite extends StructFiltersSuite { + def createFilters(filters: Seq[sources.Filter], schema: StructType): StructFilters = { +new JsonFilters(filters, schema) + } Review comment: Do you think it's possible to implement nested situations? Although JSON is not columnar format, but if we are able to push down nested predicate, we will be able to avoid de-serialize unnecessary data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dbtsai commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
dbtsai commented on a change in pull request #27366: URL: https://github.com/apache/spark/pull/27366#discussion_r443316503 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonFiltersSuite.scala ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import org.apache.spark.sql.catalyst.{StructFilters, StructFiltersSuite} +import org.apache.spark.sql.sources +import org.apache.spark.sql.types.StructType + +class JsonFiltersSuite extends StructFiltersSuite { + def createFilters(filters: Seq[sources.Filter], schema: StructType): StructFilters = { +new JsonFilters(filters, schema) + } Review comment: Do you think it's possible to extend to nested cases? Although JSON is not columnar format, but if we are able to push down nested predicate, we will be able to avoid de-serialize unnecessary data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
koertkuipers edited a comment on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612 @cloud-fan so how can i repartition by a column while the number of partitions is set smartly (based on data size) instead of using some user specified number of partitions or hardcoded value? repartitioning a dataframe by columns is fairly typical before writing to a partitioned file sink to avoid too many files per directory. see for example: https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L472 in these situations its beneficial to write out the optimal number of files, not a fixed/hardcoded number... and personally for repartition i would expect the optimal number of files to be written if AQE is enabled and i did not specify the number of partitions. thats why i was so confused by the current results. but thats just my opinion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle
cloud-fan commented on pull request #28846: URL: https://github.com/apache/spark/pull/28846#issuecomment-647270331 And as @maryannxue said, you may trigger the large side first and it doesn't make sense to hold off. Ideally we should trigger both sides and cancel the large side if the small side completes very quickly. It will be great if you can explore the cancelation approach. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle
cloud-fan commented on pull request #28846: URL: https://github.com/apache/spark/pull/28846#issuecomment-647269450 > If first stage uses all resources, I think later stage still needs to held off? That's true, but that's an assumption. It's also possible that these 2 jobs indeed run together. > the speed-up of AQE is gained by triggering all stages (not holding off other stage as you said) together, or optimizing join from SMJ to BHJ (if we only consider join case) In the benchmark, the default parallelism takes all the CPU cores. I think the most perf gain should be from shuffle partition coalescing and SMJ -> BHJ. cc @JkSelf That said, by design AQE triggers all independent stages at the same time, to maximize the parallelism. And it's helpful if the resource is sufficient (or auto-scaling). I don't think we should change this design. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
AmplabJenkins commented on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647268689 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
AmplabJenkins removed a comment on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647268426 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
AmplabJenkins commented on pull request #28889: URL: https://github.com/apache/spark/pull/28889#issuecomment-647268426 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
koertkuipers edited a comment on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612 @cloud-fan so how can i repartition by a column while the number of partitions is set smartly (based on data size) instead of using some user specified number of partitions or hardcoded value? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
koertkuipers edited a comment on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612 @cloud-fan so how can i repartition by a column while the number of partitions is set smartly instead of using some hardcoded value? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] williamhyun opened a new pull request #28889: [SPARK-XXX][BUILD] Upgrade to Apache Commons Lang 3.10
williamhyun opened a new pull request #28889: URL: https://github.com/apache/spark/pull/28889 ### What changes were proposed in this pull request? This PR aims to upgrade to Apache Commons Lang 3.10. ### Why are the changes needed? This will bring the latest bug fixes. https://commons.apache.org/proper/commons-lang/release-notes/RELEASE-NOTES-3.10.txt ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] koertkuipers commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
koertkuipers commented on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612 @cloud-fan so how can i repartition by a column without having to specify the number of partitions? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
cloud-fan edited a comment on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647264492 note: `repartition` is a physical operation that users specify how many partitions to have after the shuffle. That's why Spark can't coalesce the partition, as it breaks the user's expectation. Other optimization can still happen, like SMJ to BHJ. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
cloud-fan commented on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647264492 note: `repartition` is a physical operation that users specify how many partitions to have after the shuffle. That's why Spark can't coalesce the partition. Other optimization can still happen, like SMJ to BHJ. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
AmplabJenkins removed a comment on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647264059 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
SparkQA commented on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647263966 **[Test build #124349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124349/testReport)** for PR 28880 at commit [`10b9f7e`](https://github.com/apache/spark/commit/10b9f7e3bbe3685ccd27e2d71ac50e40b17588f9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
AmplabJenkins commented on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647264059 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
SparkQA removed a comment on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647258478 **[Test build #124349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124349/testReport)** for PR 28880 at commit [`10b9f7e`](https://github.com/apache/spark/commit/10b9f7e3bbe3685ccd27e2d71ac50e40b17588f9). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
AmplabJenkins removed a comment on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647258791 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
AmplabJenkins commented on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647258791 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
SparkQA commented on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647258478 **[Test build #124349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124349/testReport)** for PR 28880 at commit [`10b9f7e`](https://github.com/apache/spark/commit/10b9f7e3bbe3685ccd27e2d71ac50e40b17588f9). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
AmplabJenkins removed a comment on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647257913 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
SparkQA commented on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647257837 **[Test build #124346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124346/testReport)** for PR 2 at commit [`23f3dda`](https://github.com/apache/spark/commit/23f3dda8b67c18716042c670ea5caa88eac68af1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
SparkQA removed a comment on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647250427 **[Test build #124346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124346/testReport)** for PR 2 at commit [`23f3dda`](https://github.com/apache/spark/commit/23f3dda8b67c18716042c670ea5caa88eac68af1). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
AmplabJenkins commented on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647257913 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
AmplabJenkins removed a comment on pull request #28869: URL: https://github.com/apache/spark/pull/28869#issuecomment-647254915 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
AmplabJenkins commented on pull request #28869: URL: https://github.com/apache/spark/pull/28869#issuecomment-647254915 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
SparkQA commented on pull request #28869: URL: https://github.com/apache/spark/pull/28869#issuecomment-647254624 **[Test build #124348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124348/testReport)** for PR 28869 at commit [`04a2d44`](https://github.com/apache/spark/commit/04a2d4485021b8be1aa66936e2ca071bd52990a6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
Ngone51 commented on a change in pull request #28869: URL: https://github.com/apache/spark/pull/28869#discussion_r443302172 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/BaseAggregateExec.scala ## @@ -53,15 +53,31 @@ trait BaseAggregateExec extends UnaryExecNode { // can't bind the `mergeExpressions` with the output of the partial aggregate, as they use // the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Instead, // we shall use `inputAggBufferAttributes` after copy to match the new `mergeExpressions`. - val aggAttrs = aggregateExpressions -// there're exactly four cases needs `inputAggBufferAttributes` from child according to the -// agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final, -// Partial -> PartialMerge, PartialMerge -> PartialMerge. -.filter(a => a.mode == Final || a.mode == PartialMerge).map(_.aggregateFunction) -.flatMap(_.inputAggBufferAttributes) + val aggAttrs = inputAggBufferAttributes child.output.dropRight(aggAttrs.length) ++ aggAttrs } else { child.output } } + + private val inputAggBufferAttributes: Seq[Attribute] = { +aggregateExpressions + // there're exactly four cases needs `inputAggBufferAttributes` from child according to the + // agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final, + // Partial -> PartialMerge, PartialMerge -> PartialMerge. + .filter(a => a.mode == Final || a.mode == PartialMerge) + .flatMap(_.aggregateFunction.inputAggBufferAttributes) + } + + protected val aggregateBufferAttributes: Seq[AttributeReference] = { +aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes) + } + + override def producedAttributes: AttributeSet = +AttributeSet(aggregateAttributes) ++ + AttributeSet(resultExpressions.diff(groupingExpressions).map(_.toAttribute)) ++ +AttributeSet(aggregateBufferAttributes) ++ +// it's not empty when the inputAggBufferAttributes is from the child Aggregate, which contains +// subquery in AggregateFunction. See SPARK-31620 for more details. +AttributeSet(inputAggBufferAttributes.filterNot(child.output.contains)) Review comment: Oh..the comment should actually be: `it's not empty when the inputAggBufferAttributes is not equal to the aggerate buffer attributes of child Aggregate, when the child Aggregate contains the subquery in AggregateFunction.` After SPARK-31620, the inputAggBufferAttributes is not from children but the node itself when there're Final/PartialMerge aggregate expressions, see `inputAttributes()` above. (But please note that agg attributes are still the same between the parent agg node and child agg node when there's no subquery in agg expression) Therefore, in the case of SPARK-31620, we actually use the attributes produced by the node itself. But for other cases, we still use the agg buffer attributes from the children, since `inputAggBufferAttributes` is equal to the agg buffer attributes from the children and so `inputAggBufferAttributes.filterNot(child.output.contains)` is empty. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
Ngone51 commented on a change in pull request #28869: URL: https://github.com/apache/spark/pull/28869#discussion_r443302172 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/BaseAggregateExec.scala ## @@ -53,15 +53,31 @@ trait BaseAggregateExec extends UnaryExecNode { // can't bind the `mergeExpressions` with the output of the partial aggregate, as they use // the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Instead, // we shall use `inputAggBufferAttributes` after copy to match the new `mergeExpressions`. - val aggAttrs = aggregateExpressions -// there're exactly four cases needs `inputAggBufferAttributes` from child according to the -// agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final, -// Partial -> PartialMerge, PartialMerge -> PartialMerge. -.filter(a => a.mode == Final || a.mode == PartialMerge).map(_.aggregateFunction) -.flatMap(_.inputAggBufferAttributes) + val aggAttrs = inputAggBufferAttributes child.output.dropRight(aggAttrs.length) ++ aggAttrs } else { child.output } } + + private val inputAggBufferAttributes: Seq[Attribute] = { +aggregateExpressions + // there're exactly four cases needs `inputAggBufferAttributes` from child according to the + // agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final, + // Partial -> PartialMerge, PartialMerge -> PartialMerge. + .filter(a => a.mode == Final || a.mode == PartialMerge) + .flatMap(_.aggregateFunction.inputAggBufferAttributes) + } + + protected val aggregateBufferAttributes: Seq[AttributeReference] = { +aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes) + } + + override def producedAttributes: AttributeSet = +AttributeSet(aggregateAttributes) ++ + AttributeSet(resultExpressions.diff(groupingExpressions).map(_.toAttribute)) ++ +AttributeSet(aggregateBufferAttributes) ++ +// it's not empty when the inputAggBufferAttributes is from the child Aggregate, which contains +// subquery in AggregateFunction. See SPARK-31620 for more details. +AttributeSet(inputAggBufferAttributes.filterNot(child.output.contains)) Review comment: Oh..the comment should actually be: `it's not empty when the child Aggregate contains the subquery in AggregateFunction.` After SPARK-31620, the inputAggBufferAttributes is not from children but the node itself when there're Final/PartialMerge aggregate expressions, see `inputAttributes()` above. (But please note that agg attributes are still the same between the parent agg node and child agg node when there's no subquery in agg expression) Therefore, in the case of SPARK-31620, we actually use the attributes produced by the node itself. But for other cases, we still use the agg buffer attributes from the children, since `inputAggBufferAttributes` is equal to the agg buffer attributes from the children and so `inputAggBufferAttributes.filterNot(child.output.contains)` is empty. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28883: [SPARK-32042][SQL] Support UTF8String literals
HyukjinKwon commented on pull request #28883: URL: https://github.com/apache/spark/pull/28883#issuecomment-647251277 Seems like that's the codes you wrote .. not in the OSS Spark site. You should convert it to the regular Scala types such as `CatalystTypeConverters.createToScalaConverter`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
AmplabJenkins removed a comment on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647250806 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure
AmplabJenkins removed a comment on pull request #28866: URL: https://github.com/apache/spark/pull/28866#issuecomment-647250808 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
AmplabJenkins commented on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647250806 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure
AmplabJenkins commented on pull request #28866: URL: https://github.com/apache/spark/pull/28866#issuecomment-647250808 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure
SparkQA commented on pull request #28866: URL: https://github.com/apache/spark/pull/28866#issuecomment-647250445 **[Test build #124347 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124347/testReport)** for PR 28866 at commit [`d1742fe`](https://github.com/apache/spark/commit/d1742fe43be3f1f66a557f56ff15299215a59684). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
SparkQA commented on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647250427 **[Test build #124346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124346/testReport)** for PR 2 at commit [`23f3dda`](https://github.com/apache/spark/commit/23f3dda8b67c18716042c670ea5caa88eac68af1). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
yaooqinn commented on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-647249888 cc @dongjoon-hyun, thanks for reviewing this pr in advance This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn opened a new pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
yaooqinn opened a new pull request #2: URL: https://github.com/apache/spark/pull/2 … ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/28870 which ports https://issues.apache.org/jira/browse/HIVE-14817 for spark thrift server. ### Why are the changes needed? Port HIVE-14817 to fix related issues ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing Jenkins This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure
beliefer commented on a change in pull request #28866: URL: https://github.com/apache/spark/pull/28866#discussion_r443300023 ## File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ## @@ -405,6 +405,15 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi runEvent(JobCancelled(jobId, None)) } + /** Make task set success and check result. */ + private def checkAnswer( +taskSet: TaskSet, +results: Seq[(TaskEndReason, Any)], Review comment: Let's update `results` to `taskEndInfos` for `complete` too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure
beliefer commented on a change in pull request #28866: URL: https://github.com/apache/spark/pull/28866#discussion_r443299866 ## File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ## @@ -405,6 +405,15 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi runEvent(JobCancelled(jobId, None)) } + /** Make task set success and check result. */ + private def checkAnswer( Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
AmplabJenkins removed a comment on pull request #28876: URL: https://github.com/apache/spark/pull/28876#issuecomment-647248573 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
AmplabJenkins commented on pull request #28876: URL: https://github.com/apache/spark/pull/28876#issuecomment-647248573 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
SparkQA commented on pull request #28876: URL: https://github.com/apache/spark/pull/28876#issuecomment-647248244 **[Test build #124345 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124345/testReport)** for PR 28876 at commit [`bc159ab`](https://github.com/apache/spark/commit/bc159ab6d5191a9b7e01bafd7a65d39c03eb419c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #28883: [SPARK-32042][SQL] Support UTF8String literals
wangyum commented on pull request #28883: URL: https://github.com/apache/spark/pull/28883#issuecomment-647247953 Actually, I use it internal: https://github.com/wangyum/spark/blob/SPARK-27227/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala#L227-L228 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
viirya commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443298986 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.aggregate import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers Review comment: yea. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
HyukjinKwon commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443298784 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.aggregate import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers Review comment: nit: I guess it can be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
AmplabJenkins removed a comment on pull request #28876: URL: https://github.com/apache/spark/pull/28876#issuecomment-647246339 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
AmplabJenkins commented on pull request #28876: URL: https://github.com/apache/spark/pull/28876#issuecomment-647246334 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
HyukjinKwon commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647246379 @kiszk too FYI This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
HyukjinKwon commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647246269 You can create new one. Can I have some feedback from @rednaxelafx before we go ahead? @rednaxelafx, does it looks good enough to you or do you have any guidance on the testing for him? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
viirya commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443298402 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -144,11 +145,16 @@ object AggUtils { // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children -val namedDistinctExpressions = distinctExpressions.map { - case ne: NamedExpression => ne - case other => Alias(other, other.toString)() +val normalizedNamedDistinctExpressions = distinctExpressions.map { e => + // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because + // `groupingExpressions` is not extracted during logical phase. + NormalizeFloatingNumbers.normalize(e) match { +case ne: NamedExpression => ne +case other => Alias(other, other.toString)() Review comment: Moved to `SparkStrategies`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
viirya commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443298442 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -144,11 +145,16 @@ object AggUtils { // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children -val namedDistinctExpressions = distinctExpressions.map { - case ne: NamedExpression => ne - case other => Alias(other, other.toString)() +val normalizedNamedDistinctExpressions = distinctExpressions.map { e => + // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because + // `groupingExpressions` is not extracted during logical phase. Review comment: Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647245944 Seems merged PR cann't reopen. Is there any way ? If not I will send an another pr for this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
SparkQA commented on pull request #28876: URL: https://github.com/apache/spark/pull/28876#issuecomment-647245921 **[Test build #124344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124344/testReport)** for PR 28876 at commit [`e5211c3`](https://github.com/apache/spark/commit/e5211c3538584c2bcb6ebe0569278a01dc313857). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure
beliefer commented on a change in pull request #28866: URL: https://github.com/apache/spark/pull/28866#discussion_r443297998 ## File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ## @@ -405,6 +405,15 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi runEvent(JobCancelled(jobId, None)) } + /** Make task set success and check result. */ Review comment: You said right, I will change the comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
cloud-fan commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443297885 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -144,11 +145,16 @@ object AggUtils { // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children -val namedDistinctExpressions = distinctExpressions.map { - case ne: NamedExpression => ne - case other => Alias(other, other.toString)() +val normalizedNamedDistinctExpressions = distinctExpressions.map { e => + // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because + // `groupingExpressions` is not extracted during logical phase. + NormalizeFloatingNumbers.normalize(e) match { +case ne: NamedExpression => ne +case other => Alias(other, other.toString)() Review comment: It looks weird if we do normalization in both `SparkStrategies` and `AggUtils`, better to put them in one place. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
cloud-fan commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443297250 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -144,11 +145,16 @@ object AggUtils { // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children -val namedDistinctExpressions = distinctExpressions.map { - case ne: NamedExpression => ne - case other => Alias(other, other.toString)() +val normalizedNamedDistinctExpressions = distinctExpressions.map { e => + // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because + // `groupingExpressions` is not extracted during logical phase. Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647244110 test3 ``` val df1 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c1") val df2 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c2") val start = System.currentTimeMillis df1.join(df2).where("c1 like c2").count() // 3 times test // before 159226, 159147, 159587 // after 159641, 160960, 160091 println(System.currentTimeMillis - start) ``` The worst case is that do compare and compile each row. And it seems only little regression. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
HyukjinKwon commented on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647242250 @koertkuipers, are you able to show the exact reproducible steps? Sounds like ideally we should file a separate JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28883: [SPARK-32042][SQL] Support UTF8String literals
HyukjinKwon commented on pull request #28883: URL: https://github.com/apache/spark/pull/28883#issuecomment-647241272 Yeah, it's internal. Let's don't do this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 edited a comment on pull request #28887: [SPARK-32044][SS] Kakfa continuous processing print mislead initial o…
warrenzhu25 edited a comment on pull request #28887: URL: https://github.com/apache/spark/pull/28887#issuecomment-647238451 > Probably, it would be nice to explain the difference between Optional.orElse vs Optional.orElseGet for Java a bit in PR description. It's very confusing in Scala developers as we expect lazy evaluation natively. > > And one more, please rebase the target branch to master. We normally receive patches against master branch. If the problem no longer exists in master branch, we may want to find and port back the commit which fixed the issue previously, instead of picking up newer commit. 1. Updated description to explain more about Optional.orElse vs Optional.orElseGet 2. In master branch, KafkaContinuousReader.scala has been refactored out in e75488718. This file is only in branch-2.4. Any suggestions about how should I do? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
Ngone51 commented on pull request #28869: URL: https://github.com/apache/spark/pull/28869#issuecomment-647240697 > Could you put the explanation into the PR description? Sure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] koertkuipers commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
koertkuipers commented on pull request #27986: URL: https://github.com/apache/spark/pull/27986#issuecomment-647240428 @wangyum i have `spark.sql.adaptive.coalescePartitions.enabled=true` and the data size is small. how can i see that the step does coalesce? in number of tasks (i always see 2048)? in number of output files (i always see 2048)? i have `spark.sql.shuffle.partitions=2048` and `spark.sql.adaptive.coalescePartitions.initialPartitionNum=2048`. when i do a groupBy instead of repartition than the number of tasks varies with data size (and is less than 2048) and the number of output files varies too (and is less than 2048). with repartition the tasks and output files are always fixed at 2048. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
dongjoon-hyun commented on pull request #28869: URL: https://github.com/apache/spark/pull/28869#issuecomment-647240319 Got it. Thanks for the explanation, @Ngone51 . Could you put the explanation into the PR description? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28885: [SPARK-29375][SPARK-28940][SQL] Whole plan exchange and subquery reuse
HyukjinKwon commented on pull request #28885: URL: https://github.com/apache/spark/pull/28885#issuecomment-647239990 cc @maryannxue too FYI This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved
HyukjinKwon closed pull request #28881: URL: https://github.com/apache/spark/pull/28881 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved
HyukjinKwon commented on pull request #28881: URL: https://github.com/apache/spark/pull/28881#issuecomment-647239682 Closing as a dup This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
HyukjinKwon commented on pull request #28880: URL: https://github.com/apache/spark/pull/28880#issuecomment-647238811 cc @tgravescs FYI This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
Ngone51 commented on pull request #28869: URL: https://github.com/apache/spark/pull/28869#issuecomment-647238903 > Thanks. The fix looks good. Do you think we can have a UT failing on master/branch-3.0 without this patch, @Ngone51 ? Before this patch, for an Aggregate operator, its input attributes will always be equal to or more than(because it refers to its own attributes while it should refer to the attributes from the child) its reference attributes. Therefore, its missing inputs must always be empty. Thus, I think we couldn't have a UT to cover it. After we correct the right references in this PR, the problem is then exposed in the UT of SPARK-31620, since missing inputs are no longer always empty. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28870: [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
dongjoon-hyun commented on pull request #28870: URL: https://github.com/apache/spark/pull/28870#issuecomment-647238927 Thank you! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #28870: [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
yaooqinn commented on pull request #28870: URL: https://github.com/apache/spark/pull/28870#issuecomment-647238559 OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 commented on pull request #28887: [SPARK-32044][SS] Kakfa continuous processing print mislead initial o…
warrenzhu25 commented on pull request #28887: URL: https://github.com/apache/spark/pull/28887#issuecomment-647238451 > Probably, it would be nice to explain the difference between Optional.orElse vs Optional.orElseGet for Java a bit in PR description. It's very confusing in Scala developers as we expect lazy evaluation natively. > > And one more, please rebase the target branch to master. We normally receive patches against master branch. If the problem no longer exists in master branch, we may want to find and port back the commit which fixed the issue previously, instead of picking up newer commit. 1. Updated description to explain more about Optional.orElse vs Optional.orElseGet 2. In master branch, KafkaContinuousReader.scala is no longer existed. This class only in branch-2.4. Any suggestions about how should I do? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #28616: [SPARK-31798][SHUFFLE][API] Shuffle Writer API changes to return custom map output metadata
dongjoon-hyun closed pull request #28616: URL: https://github.com/apache/spark/pull/28616 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28870: [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
dongjoon-hyun commented on pull request #28870: URL: https://github.com/apache/spark/pull/28870#issuecomment-647237258 If you don't mind, could you make a backporting PR then? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
dongjoon-hyun commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443291919 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -144,11 +145,16 @@ object AggUtils { // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children -val namedDistinctExpressions = distinctExpressions.map { - case ne: NamedExpression => ne - case other => Alias(other, other.toString)() +val normalizedNamedDistinctExpressions = distinctExpressions.map { e => + // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because + // `groupingExpressions` is not extracted during logical phase. + NormalizeFloatingNumbers.normalize(e) match { +case ne: NamedExpression => ne +case other => Alias(other, other.toString)() Review comment: If we broaden the scope, `SparkStrategies` already is looking at the detail of `functionsWithDistinct` like the following. ```scala val (functionsWithDistinct, functionsWithoutDistinct) = aggregateExpressions.partition(_.isDistinct) if (functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length > 1) { // This is a sanity check. We should not reach here when we have multiple distinct // column sets. Our `RewriteDistinctAggregates` should take care this case. sys.error("You hit a query analyzer bug. Please report your query to " + "Spark user mailing list.") } ``` And the very next line is the same logic block for `groupingExpression`. ```scala // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because // `groupingExpressions` is not extracted during logical phase. val normalizedGroupingExpressions = groupingExpressions.map { e => NormalizeFloatingNumbers.normalize(e) match { case n: NamedExpression => n case other => Alias(other, e.name)(exprId = e.exprId) } } ``` Given the above, I guess what you concerned is only one line source code, `val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children`. And, the following comment is about the definition of `functionsWithDistinct` which is generated by `SparkStrategies`. So, it's not a leaked detail or hidden from `SparkStrategies`. For me, it seems to be an assumption given by `SparkStrategies`. ``` // functionsWithDistinct is guaranteed to be non-empty. Even though it may contain more than one // DISTINCT aggregate function, all of those functions will have the same column expressions. // For example, it would be valid for functionsWithDistinct to be // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
dongjoon-hyun commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443291919 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -144,11 +145,16 @@ object AggUtils { // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children -val namedDistinctExpressions = distinctExpressions.map { - case ne: NamedExpression => ne - case other => Alias(other, other.toString)() +val normalizedNamedDistinctExpressions = distinctExpressions.map { e => + // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because + // `groupingExpressions` is not extracted during logical phase. + NormalizeFloatingNumbers.normalize(e) match { +case ne: NamedExpression => ne +case other => Alias(other, other.toString)() Review comment: If we broaden the scope, `SparkStrategies` already is looking at the detail of `functionsWithDistinct` like the following. ```scala val (functionsWithDistinct, functionsWithoutDistinct) = aggregateExpressions.partition(_.isDistinct) if (functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length > 1) { // This is a sanity check. We should not reach here when we have multiple distinct // column sets. Our `RewriteDistinctAggregates` should take care this case. sys.error("You hit a query analyzer bug. Please report your query to " + "Spark user mailing list.") } ``` And the very next line is the same logic block for `groupingExpression`. ```scala // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because // `groupingExpressions` is not extracted during logical phase. val normalizedGroupingExpressions = groupingExpressions.map { e => NormalizeFloatingNumbers.normalize(e) match { case n: NamedExpression => n case other => Alias(other, e.name)(exprId = e.exprId) } } ``` Given the above, I guess what you concerned is only one line source code, `val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children`. And, the comment is about the definition of `functionsWithDistinct` which is generated from the above. The comment is not a detail hidden from `SparkStrategies`. For me, it seems to be an assumption given by `SparkStrategies`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate
dongjoon-hyun commented on a change in pull request #28876: URL: https://github.com/apache/spark/pull/28876#discussion_r443291919 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala ## @@ -144,11 +145,16 @@ object AggUtils { // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children -val namedDistinctExpressions = distinctExpressions.map { - case ne: NamedExpression => ne - case other => Alias(other, other.toString)() +val normalizedNamedDistinctExpressions = distinctExpressions.map { e => + // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because + // `groupingExpressions` is not extracted during logical phase. + NormalizeFloatingNumbers.normalize(e) match { +case ne: NamedExpression => ne +case other => Alias(other, other.toString)() Review comment: If we broaden the scope, `SparkStrategies` already is looking at the detail of `functionsWithDistinct` like the following. ```scala val (functionsWithDistinct, functionsWithoutDistinct) = aggregateExpressions.partition(_.isDistinct) if (functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length > 1) { // This is a sanity check. We should not reach here when we have multiple distinct // column sets. Our `RewriteDistinctAggregates` should take care this case. sys.error("You hit a query analyzer bug. Please report your query to " + "Spark user mailing list.") } ``` And the very next line is the same logic block for `groupingExpression`. ```scala // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because // `groupingExpressions` is not extracted during logical phase. val normalizedGroupingExpressions = groupingExpressions.map { e => NormalizeFloatingNumbers.normalize(e) match { case n: NamedExpression => n case other => Alias(other, e.name)(exprId = e.exprId) } } ``` Given the above, I guess what you concerned is only one line source code, `val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children`. And, the following comment is about the definition of `functionsWithDistinct` which is generated from the above. The comment is not a detail hidden from `SparkStrategies`. For me, it seems to be an assumption given by `SparkStrategies`. ``` // functionsWithDistinct is guaranteed to be non-empty. Even though it may contain more than one // DISTINCT aggregate function, all of those functions will have the same column expressions. // For example, it would be valid for functionsWithDistinct to be // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is // disallowed because those two distinct aggregates have different column expressions. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org