Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]
dongjoon-hyun commented on code in PR #43736: URL: https://github.com/apache/spark/pull/43736#discussion_r1395285496 ## .github/workflows/build_and_test.yml: ## @@ -555,6 +555,81 @@ jobs: with: name: test-results-sparkr--${{ inputs.java }}-${{ inputs.hadoop }}-hive2.3 path: "**/target/test-reports/*.xml" + + kinesis-asl: Review Comment: BTW, do we need to add a new pipeline? If this is a small test, we can append this to the existing pipeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]
dongjoon-hyun commented on code in PR #43736: URL: https://github.com/apache/spark/pull/43736#discussion_r1395285496 ## .github/workflows/build_and_test.yml: ## @@ -555,6 +555,81 @@ jobs: with: name: test-results-sparkr--${{ inputs.java }}-${{ inputs.hadoop }}-hive2.3 path: "**/target/test-reports/*.xml" + + kinesis-asl: Review Comment: BTW, do we need to add a new pipeline? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]
dongjoon-hyun commented on code in PR #43736: URL: https://github.com/apache/spark/pull/43736#discussion_r1395282977 ## pom.xml: ## @@ -202,6 +202,7 @@ 4.1.17 14.0.1 3.1.9 +2.2.11 Review Comment: You can spin off this together with https://github.com/apache/spark/pull/43736/files#r1395282370 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]
dongjoon-hyun commented on code in PR #43736: URL: https://github.com/apache/spark/pull/43736#discussion_r1395282370 ## connector/kinesis-asl/pom.xml: ## @@ -76,6 +76,12 @@ jackson-dataformat-cbor ${fasterxml.jackson.version} + + javax.xml.bind + jaxb-api + ${jaxb-api.version} + test + Review Comment: If we need this for testing, this looks like an independent issue. Could you spin off from this GitHub Action PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]
dongjoon-hyun commented on code in PR #43736: URL: https://github.com/apache/spark/pull/43736#discussion_r1395281539 ## .github/workflows/build_and_test.yml: ## @@ -1049,7 +1124,7 @@ jobs: sudo install minikube-linux-amd64 /usr/local/bin/minikube rm minikube-linux-amd64 # Github Action limit cpu:2, memory: 6947MB, limit to 2U6G for better resource statistic - minikube start --cpus 2 --memory 6144 + minikube start --cpus 2 --memory 6144 --force Review Comment: Why do you touch `k8s-integration-tests` in `kinesis` PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]
LuciferYang commented on PR #43818: URL: https://github.com/apache/spark/pull/43818#issuecomment-1813946114 Thanks @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]
LuciferYang closed pull request #43818: [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` URL: https://github.com/apache/spark/pull/43818 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]
dongjoon-hyun commented on code in PR #43736: URL: https://github.com/apache/spark/pull/43736#discussion_r1395280300 ## .github/workflows/build_and_test.yml: ## @@ -555,6 +555,81 @@ jobs: with: name: test-results-sparkr--${{ inputs.java }}-${{ inputs.hadoop }}-hive2.3 path: "**/target/test-reports/*.xml" + + kinesis-asl: +needs: [precondition, infra-image] +# always run if sparkr == 'true', even infra-image is skip (such as non-master job) +#if: (!cancelled()) && fromJson(needs.precondition.outputs.required).sparkr == 'true' Review Comment: ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45948][K8S] Make single-pod spark jobs respect `spark.app.id` [spark]
dongjoon-hyun commented on PR #43833: URL: https://github.com/apache/spark/pull/43833#issuecomment-1813937357 Could you review this when you have some time, please, @LuciferYang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][INFRA] Test PyArrow 14 [spark]
zhengruifeng commented on PR #43829: URL: https://github.com/apache/spark/pull/43829#issuecomment-1813930288 ``` pyarrow 14.0.1 pydantic 2.5.1 pydantic_core2.14.3 PyGObject3.36.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45948][K8S] Make single-pod spark jobs respect `spark.app.id` [spark]
dongjoon-hyun opened a new pull request, #43833: URL: https://github.com/apache/spark/pull/43833 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45946][SS] Fix use of deprecated FileUtils write to pass default charset in RocksDBSuite [spark]
anishshri-db commented on PR #43832: URL: https://github.com/apache/spark/pull/43832#issuecomment-1813879801 cc - @HeartSaVioR - PTAL, thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45946] Fix use of deprecated FileUtils write to pass default charset in RocksDBSuite [spark]
anishshri-db opened a new pull request, #43832: URL: https://github.com/apache/spark/pull/43832 ### What changes were proposed in this pull request? Fix use of deprecated FileUtils write to pass default charset in RocksDBSuite ### Why are the changes needed? Without the change, we were getting this compilation warning ``` [warn] /Users/anish.shrigondekar/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala:854:17: method write in class FileUtils is deprecated [warn] Applicable -Wconf / @nowarn filters for this warning: msg=, cat=deprecation, site=org.apache.spark.sql.execution.streaming.state.RocksDBSuite, origin=org.apache.commons.io.FileUtils.write [warn] FileUtils.write(file2, s"v2\n$json2") [warn] ^ [warn] /Users/anish.shrigondekar/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala:1272:17: method write in class FileUtils is deprecated [warn] Applicable -Wconf / @nowarn filters for this warning: msg=, cat=deprecation, site=org.apache.spark.sql.execution.streaming.state.RocksDBSuite.generateFiles.$anonfun, origin=org.apache.commons.io.FileUtils.write [warn] FileUtils.write(file, "a" * length) [warn] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran test suite ``` 22:47:45.700 WARN org.apache.spark.sql.execution.streaming.state.RocksDBSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true) = [info] Run completed in 1 minute, 55 seconds. [info] Total number of tests run: 77 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 77, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 172 s (02:52), completed Nov 15, 2023, 10:47:46 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
wbo4958 commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1395211436 ## core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.HashMap + +import org.apache.spark.SparkException +import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT +import org.apache.spark.resource.ResourceProfile + +/** + * Class to hold information about a series of resources belonging to an executor. + * A resource could be a GPU, FPGA, etc. And it is used as a temporary + * class to calculate the resources amounts when offering resources to + * the tasks in the [[TaskSchedulerImpl]] + * + * One example is GPUs, where the addresses would be the indices of the GPUs + * + * @param resources The executor available resources and amount. eg, + * Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT, Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
wbo4958 commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1395204414 ## core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.HashMap + +import org.apache.spark.SparkException +import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT +import org.apache.spark.resource.ResourceProfile + +/** + * Class to hold information about a series of resources belonging to an executor. + * A resource could be a GPU, FPGA, etc. And it is used as a temporary + * class to calculate the resources amounts when offering resources to + * the tasks in the [[TaskSchedulerImpl]] + * + * One example is GPUs, where the addresses would be the indices of the GPUs + * + * @param resources The executor available resources and amount. eg, + * Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT, + * "1" -> 1.0*RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.3*RESOURCE_TOTAL_AMOUNT, + *"b" -> 0.9*RESOURCE_TOTAL_AMOUNT) + * ) + */ +private[spark] class ExecutorResourcesAmounts( +private val resources: Map[String, Map[String, Long]]) extends Serializable { + + /** + * convert the resources to be mutable HashMap + */ + private val internalResources: Map[String, HashMap[String, Long]] = { +resources.map { case (rName, addressAmounts) => + rName -> HashMap(addressAmounts.toSeq: _*) +} + } + + /** + * The total address count of each resource. Eg, + * Map("gpu" -> Map("0" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "1" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "2" -> 0.5 * RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT)) + * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2) + */ + lazy val resourceAmount: Map[String, Int] = internalResources.map { case (rName, addressMap) => Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
wbo4958 commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1395200464 ## core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.HashMap + +import org.apache.spark.SparkException +import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT +import org.apache.spark.resource.ResourceProfile + +/** + * Class to hold information about a series of resources belonging to an executor. + * A resource could be a GPU, FPGA, etc. And it is used as a temporary + * class to calculate the resources amounts when offering resources to + * the tasks in the [[TaskSchedulerImpl]] + * + * One example is GPUs, where the addresses would be the indices of the GPUs + * + * @param resources The executor available resources and amount. eg, + * Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT, + * "1" -> 1.0*RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.3*RESOURCE_TOTAL_AMOUNT, + *"b" -> 0.9*RESOURCE_TOTAL_AMOUNT) + * ) + */ +private[spark] class ExecutorResourcesAmounts( +private val resources: Map[String, Map[String, Long]]) extends Serializable { + + /** + * convert the resources to be mutable HashMap + */ + private val internalResources: Map[String, HashMap[String, Long]] = { +resources.map { case (rName, addressAmounts) => + rName -> HashMap(addressAmounts.toSeq: _*) +} + } + + /** + * The total address count of each resource. Eg, + * Map("gpu" -> Map("0" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "1" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "2" -> 0.5 * RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT)) + * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2) + */ + lazy val resourceAmount: Map[String, Int] = internalResources.map { case (rName, addressMap) => +rName -> addressMap.size + } + + /** + * For testing purpose. convert internal resources back to the "fraction" resources. + */ + private[spark] def availableResources: Map[String, Map[String, Double]] = { +internalResources.map { case (rName, addressMap) => + rName -> addressMap.map { case (address, amount) => +address -> amount.toDouble / RESOURCE_TOTAL_AMOUNT + }.toMap +} + } + + /** + * Acquire the resource and update the resource + * @param assignedResource the assigned resource information + */ + def acquire(assignedResource: Map[String, Map[String, Long]]): Unit = { +assignedResource.foreach { case (rName, taskResAmounts) => + val availableResourceAmounts = internalResources.getOrElse(rName, +throw new SparkException(s"Try to acquire an address from $rName that doesn't exist")) + taskResAmounts.foreach { case (address, amount) => +val prevInternalTotalAmount = availableResourceAmounts.getOrElse(address, + throw new SparkException(s"Try to acquire an address that doesn't exist. $rName " + +s"address $address doesn't exist.")) + +val left = prevInternalTotalAmount - amount +if (left < 0) { + throw new SparkException(s"The total amount ${left.toDouble / RESOURCE_TOTAL_AMOUNT} " + +s"after acquiring $rName address $address should be >= 0") +} +internalResources(rName)(address) = left + } +} + } + + /** + * Release the assigned resources to the resource pool + * @param assignedResource resource to be released + */ + def release(assignedResource: Map[String, Map[String, Long]]): Unit = { +assignedResource.foreach { case (rName, taskResAmounts) => + val availableResourceAmounts = internalResources.getOrElse(rName, +throw new SparkException(s"Try to release an address from $rName that doesn't exist")) + taskResAmounts.foreach { case (address, amount) => +val prevInternalTotalAmount = availableResourceA
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
wbo4958 commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1395199672 ## core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.HashMap + +import org.apache.spark.SparkException +import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT +import org.apache.spark.resource.ResourceProfile + +/** + * Class to hold information about a series of resources belonging to an executor. + * A resource could be a GPU, FPGA, etc. And it is used as a temporary + * class to calculate the resources amounts when offering resources to + * the tasks in the [[TaskSchedulerImpl]] + * + * One example is GPUs, where the addresses would be the indices of the GPUs + * + * @param resources The executor available resources and amount. eg, + * Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT, + * "1" -> 1.0*RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.3*RESOURCE_TOTAL_AMOUNT, + *"b" -> 0.9*RESOURCE_TOTAL_AMOUNT) + * ) + */ +private[spark] class ExecutorResourcesAmounts( +private val resources: Map[String, Map[String, Long]]) extends Serializable { + + /** + * convert the resources to be mutable HashMap + */ + private val internalResources: Map[String, HashMap[String, Long]] = { +resources.map { case (rName, addressAmounts) => + rName -> HashMap(addressAmounts.toSeq: _*) +} + } + + /** + * The total address count of each resource. Eg, + * Map("gpu" -> Map("0" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "1" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "2" -> 0.5 * RESOURCE_TOTAL_AMOUNT), + * "fpga" -> Map("a" -> 0.5 * RESOURCE_TOTAL_AMOUNT, + * "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT)) + * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2) + */ + lazy val resourceAmount: Map[String, Int] = internalResources.map { case (rName, addressMap) => +rName -> addressMap.size + } + + /** + * For testing purpose. convert internal resources back to the "fraction" resources. + */ + private[spark] def availableResources: Map[String, Map[String, Double]] = { +internalResources.map { case (rName, addressMap) => + rName -> addressMap.map { case (address, amount) => +address -> amount.toDouble / RESOURCE_TOTAL_AMOUNT + }.toMap +} + } + + /** + * Acquire the resource and update the resource + * @param assignedResource the assigned resource information + */ + def acquire(assignedResource: Map[String, Map[String, Long]]): Unit = { +assignedResource.foreach { case (rName, taskResAmounts) => + val availableResourceAmounts = internalResources.getOrElse(rName, +throw new SparkException(s"Try to acquire an address from $rName that doesn't exist")) + taskResAmounts.foreach { case (address, amount) => +val prevInternalTotalAmount = availableResourceAmounts.getOrElse(address, + throw new SparkException(s"Try to acquire an address that doesn't exist. $rName " + +s"address $address doesn't exist.")) + +val left = prevInternalTotalAmount - amount +if (left < 0) { + throw new SparkException(s"The total amount ${left.toDouble / RESOURCE_TOTAL_AMOUNT} " + +s"after acquiring $rName address $address should be >= 0") +} +internalResources(rName)(address) = left + } +} + } + + /** + * Release the assigned resources to the resource pool + * @param assignedResource resource to be released + */ + def release(assignedResource: Map[String, Map[String, Long]]): Unit = { +assignedResource.foreach { case (rName, taskResAmounts) => + val availableResourceAmounts = internalResources.getOrElse(rName, +throw new SparkException(s"Try to release an address from $rName that doesn't exist")) + taskResAmounts.foreach { case (address, amount) => +val prevInternalTotalAmount = availableResourceA
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
wbo4958 commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1395199364 ## core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala: ## @@ -20,6 +20,42 @@ package org.apache.spark.resource import scala.collection.mutable import org.apache.spark.SparkException +import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT + +private[spark] object ResourceAmountUtils { + /** + * Using "double" to do the resource calculation may encounter a problem of precision loss. Eg + * + * scala> val taskAmount = 1.0 / 9 + * taskAmount: Double = 0. + * + * scala> var total = 1.0 + * total: Double = 1.0 + * + * scala> for (i <- 1 to 9 ) { + * | if (total >= taskAmount) { + * | total -= taskAmount + * | println(s"assign $taskAmount for task $i, total left: $total") + * | } else { + * | println(s"ERROR Can't assign $taskAmount for task $i, total left: $total") + * | } + * | } + * assign 0. for task 1, total left: 0. + * assign 0. for task 2, total left: 0. + * assign 0. for task 3, total left: 0.6665 + * assign 0. for task 4, total left: 0.5554 + * assign 0. for task 5, total left: 0.44425 + * assign 0. for task 6, total left: 0.33315 + * assign 0. for task 7, total left: 0.22204 + * assign 0. for task 8, total left: 0.11094 + * ERROR Can't assign 0. for task 9, total left: 0.11094 + * + * So we multiply RESOURCE_TOTAL_AMOUNT to convert the double to long to avoid this limitation. + * Double can display up to 16 decimal places, so we set the factor to + * 10, 000, 000, 000, 000, 000L. + */ + final val RESOURCE_TOTAL_AMOUNT: Long = 1L Review Comment: Really good suggestion. Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
wbo4958 commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1395199162 ## core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala: ## @@ -191,7 +191,10 @@ private[spark] class CoarseGrainedExecutorBackend( } else { val taskDesc = TaskDescription.decode(data.value) logInfo("Got assigned task " + taskDesc.taskId) -taskResources.put(taskDesc.taskId, taskDesc.resources) +// Convert resources amounts into ResourceInformation +val resources = taskDesc.resources.map { case (rName, addressesAmounts) => + rName -> new ResourceInformation(rName, addressesAmounts.keys.toSeq.sorted.toArray)} +taskResources.put(taskDesc.taskId, resources) Review Comment: Sounds good. new commits have removed the taskResources -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45927][PYTHON] Update path handling for Python data source [spark]
allisonwang-db commented on code in PR #43809: URL: https://github.com/apache/spark/pull/43809#discussion_r1395186286 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -246,7 +246,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val builder = sparkSession.sharedState.dataSourceManager.lookupDataSource(source) // Unless the legacy path option behavior is enabled, the extraOptions here // should not include "path" or "paths" as keys. -val plan = builder(sparkSession, source, paths, userSpecifiedSchema, extraOptions) +// Add path to the options field. Note currently it only supports a single path. +val optionsWithPath = if (paths.isEmpty) { + extraOptions +} else if (paths.length == 1) { +extraOptions + ("path" -> paths.head) +} else { + throw QueryCompilationErrors.multiplePathsUnsupportedError(source, paths) Review Comment: Yea, let's just follow the DSv2 approach (options['paths'] = json serialized string list) to make Python data source behave the same as DSv2. I will update this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45511] Fix state reader suite flakiness by clean up resources after each test run [spark]
chaoqin-li1123 commented on PR #43831: URL: https://github.com/apache/spark/pull/43831#issuecomment-1813830603 @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45511] fix state reader suite flakiness by clean up resources after each test run [spark]
chaoqin-li1123 opened a new pull request, #43831: URL: https://github.com/apache/spark/pull/43831 ### What changes were proposed in this pull request? Fix state reader suite flakiness by clean up resources after each test. Because all state store instance share the same maintainence task pool, the failed maintainence task from previous test run may affect future test runs and cause test failure. Clean up StateStore explicitly to unflake the test. ### Why are the changes needed? To unflake the test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
panbingkun commented on PR #37588: URL: https://github.com/apache/spark/pull/37588#issuecomment-1813824544 > thanks, merging to master! Thank you again for your great help! ❤️❤️❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45927][PYTHON] Update path handling for Python data source [spark]
cloud-fan commented on code in PR #43809: URL: https://github.com/apache/spark/pull/43809#discussion_r1395160860 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -246,7 +246,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val builder = sparkSession.sharedState.dataSourceManager.lookupDataSource(source) // Unless the legacy path option behavior is enabled, the extraOptions here // should not include "path" or "paths" as keys. -val plan = builder(sparkSession, source, paths, userSpecifiedSchema, extraOptions) +// Add path to the options field. Note currently it only supports a single path. +val optionsWithPath = if (paths.isEmpty) { + extraOptions +} else if (paths.length == 1) { +extraOptions + ("path" -> paths.head) +} else { + throw QueryCompilationErrors.multiplePathsUnsupportedError(source, paths) Review Comment: does it help to add a `paths` option using JSON to hold String[]? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45927][PYTHON] Update path handling for Python data source [spark]
cloud-fan commented on code in PR #43809: URL: https://github.com/apache/spark/pull/43809#discussion_r1395160402 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -246,7 +246,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val builder = sparkSession.sharedState.dataSourceManager.lookupDataSource(source) // Unless the legacy path option behavior is enabled, the extraOptions here // should not include "path" or "paths" as keys. -val plan = builder(sparkSession, source, paths, userSpecifiedSchema, extraOptions) +// Add path to the options field. Note currently it only supports a single path. +val optionsWithPath = if (paths.isEmpty) { + extraOptions +} else if (paths.length == 1) { +extraOptions + ("path" -> paths.head) Review Comment: ```suggestion extraOptions + ("path" -> paths.head) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on PR #37588: URL: https://github.com/apache/spark/pull/37588#issuecomment-1813802035 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan closed pull request #37588: [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 URL: https://github.com/apache/spark/pull/37588 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45764][PYTHON][DOCS][3.3] Make code block copyable [spark]
panbingkun opened a new pull request, #43830: URL: https://github.com/apache/spark/pull/43830 ### What changes were proposed in this pull request? The pr aims to make code block `copyable `in pyspark docs. Backport above to `branch 3.3`. Master branch pr: https://github.com/apache/spark/pull/43799 ### Why are the changes needed? Improving the usability of PySpark documents. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to easily copy code block in pyspark docs. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][INFRA] Test PyArrow 14 [spark]
zhengruifeng opened a new pull request, #43829: URL: https://github.com/apache/spark/pull/43829 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45764][PYTHON][DOCS][3.4] Make code block copyable [spark]
panbingkun opened a new pull request, #43828: URL: https://github.com/apache/spark/pull/43828 ### What changes were proposed in this pull request? The pr aims to make code block `copyable `in pyspark docs. Backport above to `branch 3.4`. Master branch pr: https://github.com/apache/spark/pull/43799 ### Why are the changes needed? Improving the usability of PySpark documents. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to easily copy code block in pyspark docs. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45747][SS] Use prefix key information in state metadata to handle reading state for session window aggregation [spark]
HeartSaVioR closed pull request #43788: [SPARK-45747][SS] Use prefix key information in state metadata to handle reading state for session window aggregation URL: https://github.com/apache/spark/pull/43788 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45764][PYTHON][DOCS][3.5] Make code block copyable [spark]
panbingkun commented on PR #43827: URL: https://github.com/apache/spark/pull/43827#issuecomment-1813732360 I am making backports for other branches: branch-3.3, branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827][SQL] Fix variant parquet reader. [spark]
cloud-fan closed pull request #43825: [SPARK-45827][SQL] Fix variant parquet reader. URL: https://github.com/apache/spark/pull/43825 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827][SQL] Fix variant parquet reader. [spark]
cloud-fan commented on PR #43825: URL: https://github.com/apache/spark/pull/43825#issuecomment-1813730404 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45764][PYTHON][DOCS][3.5] Make code block copyable [spark]
panbingkun opened a new pull request, #43827: URL: https://github.com/apache/spark/pull/43827 ### What changes were proposed in this pull request? The pr aims to make code block `copyable `in pyspark docs. The pr is backporting to `branch 3.5`. ### Why are the changes needed? Improving the usability of PySpark documents. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to easily copy code block in pyspark docs. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45945][CONNECT] Add a helper function for `parser` [spark]
zhengruifeng opened a new pull request, #43826: URL: https://github.com/apache/spark/pull/43826 ### What changes were proposed in this pull request? Add a helper function for `parser` ### Why are the changes needed? we don't use other parser in planner, add this helper just for simplification and consistency ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45506][CONNECT] Add ivy URI support to SparkConnect addArtifact [spark]
LuciferYang commented on PR #43354: URL: https://github.com/apache/spark/pull/43354#issuecomment-1813722799 @vsevolodstep-db I found that after moving MavenUtilsSuite.scala to the common-utils module, it cannot pass the test. Do you know why? The current GA does not test this case (this issue will be fixed later), and local reproduction can be executed by `build/sbt "common-utils/test"`. then ``` [info] MavenUtilsSuite: [info] - incorrect maven coordinate throws error (8 milliseconds) [info] - create repo resolvers (24 milliseconds) [info] - create additional resolvers (3 milliseconds) :: loading settings :: url = jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/apache/ivy/ivy/2.5.1/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml [info] - add dependencies works correctly (35 milliseconds) [info] - excludes works correctly (2 milliseconds) [info] - ivy path works correctly (3 seconds, 759 milliseconds) [info] - search for artifact at local repositories *** FAILED *** (2 seconds, 833 milliseconds) [info] java.lang.RuntimeException: [unresolved dependency: my.great.lib#mylib;0.1: java.text.ParseException: [[Fatal Error] ivy-0.1.xml.original:22:18: XML 文档结构必须从头至尾包含在同一个实体内。 in f/SourceCode/git/spark-mine-sbt/target/tmp/ivy-8b860aca-a9c4-4af9-b15a-ac8c6049b773/cache/my.great.lib/mylib/ivy-0.1.xml.original [info] ]] [info] at org.apache.spark.util.MavenUtils$.resolveMavenCoordinates(MavenUtils.scala:459) [info] at org.apache.spark.util.MavenUtilsSuite.$anonfun$new$25(MavenUtilsSuite.scala:173) [info] at org.apache.spark.util.MavenUtilsSuite.$anonfun$new$25$adapted(MavenUtilsSuite.scala:172) [info] at org.apache.spark.util.IvyTestUtils$.withRepository(IvyTestUtils.scala:373) [info] at org.apache.spark.util.MavenUtilsSuite.$anonfun$new$18(MavenUtilsSuite.scala:172) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:333) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) [info] at org.scalatest.Suite.run(Suite.scala:1114) [info] at org.scalatest.Suite.run$(Suite.scala:1096) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) [info] at org.apache.spark.util.MavenUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(MavenUtilsSuite.scala:36) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.util.MavenUtilsSuite.
Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]
LuciferYang commented on code in PR #43818: URL: https://github.com/apache/spark/pull/43818#discussion_r1395096661 ## dev/sparktestsupport/modules.py: ## @@ -178,7 +178,7 @@ def __hash__(self): core = Module( name="core", -dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher], +dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, utils], Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]
zhengruifeng commented on code in PR #43818: URL: https://github.com/apache/spark/pull/43818#discussion_r1395096255 ## dev/sparktestsupport/modules.py: ## @@ -113,6 +113,14 @@ def __hash__(self): ], ) +utils = Module( Review Comment: yeah, this is python :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45764][PYTHON][DOCS] Make code block copyable [spark]
panbingkun commented on PR #43799: URL: https://github.com/apache/spark/pull/43799#issuecomment-1813716352 > @panbingkun would you mind creating a backporting PR? Actually yeah I think it's an important improvement in docs. Okay, Let me do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core` module in `module.py` [spark]
LuciferYang commented on code in PR #43818: URL: https://github.com/apache/spark/pull/43818#discussion_r1395095578 ## dev/sparktestsupport/modules.py: ## @@ -113,6 +113,14 @@ def __hash__(self): ], ) +utils = Module( Review Comment: Moving it is because of https://github.com/apache/spark/assets/1475305/cc40aa65-2c04-4ca6-9a34-3b3da30954c1";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core` module in `module.py` [spark]
zhengruifeng commented on code in PR #43818: URL: https://github.com/apache/spark/pull/43818#discussion_r1395092781 ## dev/sparktestsupport/modules.py: ## @@ -178,7 +178,7 @@ def __hash__(self): core = Module( name="core", -dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher], +dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, utils], Review Comment: > utils module is also a direct dependency of unsafe and network-common let's also add this dependency -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]
LuciferYang commented on PR #43796: URL: https://github.com/apache/spark/pull/43796#issuecomment-1813710853 rebased -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR] Fix some typo [spark]
HyukjinKwon closed pull request #43724: [MINOR] Fix some typo URL: https://github.com/apache/spark/pull/43724 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45922][CONNECT][CLIENT] Minor retries refactoring (follow-up to multiple policies) [spark]
HyukjinKwon commented on PR #43800: URL: https://github.com/apache/spark/pull/43800#issuecomment-1813710341 Mind retriggering https://github.com/cdkrot/apache_spark/actions/runs/6877183050/job/18704368968? I think it might be related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR] Fix some typo [spark]
HyukjinKwon commented on PR #43724: URL: https://github.com/apache/spark/pull/43724#issuecomment-1813710430 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][DOCS] Regenerate `docs/sql-error-conditions.md` and add `42KDF` to `SQLSTATE table` in `error/README.md` [spark]
LuciferYang commented on PR #43817: URL: https://github.com/apache/spark/pull/43817#issuecomment-1813709899 Thanks @dongjoon-hyun @HyukjinKwon @beliefer @sandip-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45930][SQL] Support non-deterministic UDFs in MapInPandas/MapInArrow [spark]
HyukjinKwon closed pull request #43810: [SPARK-45930][SQL] Support non-deterministic UDFs in MapInPandas/MapInArrow URL: https://github.com/apache/spark/pull/43810 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45930][SQL] Support non-deterministic UDFs in MapInPandas/MapInArrow [spark]
HyukjinKwon commented on PR #43810: URL: https://github.com/apache/spark/pull/43810#issuecomment-1813707722 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44488][SQL] Support deserializing long types when creating `Metadata` object from JObject [spark]
HyukjinKwon commented on PR #42083: URL: https://github.com/apache/spark/pull/42083#issuecomment-1813704504 It will be available from 4.0.0 most likely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]
LuciferYang commented on code in PR #43502: URL: https://github.com/apache/spark/pull/43502#discussion_r1395081107 ## common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java: ## @@ -182,23 +193,34 @@ public boolean skip(long n) { @Override public synchronized void close() throws IOException { -db.notifyIteratorClosed(this); +db.notifyIteratorClosed(it); if (!closed) { - it.close(); - closed = true; - next = null; + try { +it.close(); + } finally { +closed = true; +next = null; +cancelResourceClean(); Review Comment: Yes, we have discussed this issue. The reason for not directly calling `this.cleaner.clean()` is because the close process in `Cleaner` has added the operation of `synchronized (this._db)`, which is slightly different from the semantics of the original `close()` method. For specific discussions, please refer to this thread: https://github.com/apache/spark/pull/43502#discussion_r1372954706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]
yaooqinn commented on PR #43746: URL: https://github.com/apache/spark/pull/43746#issuecomment-1813660864 > What do you mean by this, are you saying the Spark on YARN handling of preempted containers is not working properly? Meaning if the container is preempted it should not show up as an executor failure. Are you seeing those preempted containers show up as failed? Or are you saying that yes Spark on YARN doesn't mark preempted as failed? PREEMPTED is ok, and its cases are not counted by executor failure tracker, I was wrong about this, sorry to bother. > If that is the case then Spark should allow users to turn spark.executor.maxNumFailures off or I assume you could do the same thing by setting it to int.maxvalue. There are pros and cons to this suggestion, I guess. Disabling the executor failure tracker certainly keeps the app alive, but at the same time invalidates fast fail. > As implemented this seems very arbitrary and I would think hard for a normal user to set and use this feature. Most of configurations with numeric value and the defaults in spark are arbitrary? > I don't understand why this isn't the same as minimum number of executors as that seems more in line - saying you need some minimum number for this application to run and by the way its ok to keep running with this is launching new executors is failing. minimum number of executors can be 0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45931][PYTHON][DOCS] Refine docstring of mapInPandas [spark]
HyukjinKwon closed pull request #43811: [SPARK-45931][PYTHON][DOCS] Refine docstring of mapInPandas URL: https://github.com/apache/spark/pull/43811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45931][PYTHON][DOCS] Refine docstring of mapInPandas [spark]
HyukjinKwon commented on PR #43811: URL: https://github.com/apache/spark/pull/43811#issuecomment-1813617078 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45936][PS] Optimize `Index.symmetric_difference` [spark]
HyukjinKwon closed pull request #43816: [SPARK-45936][PS] Optimize `Index.symmetric_difference` URL: https://github.com/apache/spark/pull/43816 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45936][PS] Optimize `Index.symmetric_difference` [spark]
HyukjinKwon commented on PR #43816: URL: https://github.com/apache/spark/pull/43816#issuecomment-1813613763 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error [spark]
panbingkun commented on code in PR #43815: URL: https://github.com/apache/spark/pull/43815#discussion_r1395047884 ## python/docs/source/conf.py: ## @@ -102,9 +102,9 @@ .. |examples| replace:: Examples .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python .. |downloading| replace:: Downloading -.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html +.. _downloading: https://spark.apache.org/docs/{1}/#downloading .. |building_spark| replace:: Building Spark -.. _building_spark: https://spark.apache.org/docs/{1}/#downloading +.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html Review Comment: Yes, I have checked the branches: branch-3.3, branch-3.4, and branch-3.5, which have all been affected. Therefore, I have added: 3.5.0, 3.4.1, 3.3.3. If there are conflicts during the merge process, please let me know and I will resubmit them on each branch. Thank you very much for your reminder. https://github.com/apache/spark/assets/15246973/00795396-3aef-47af-ab39-daff66686228";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Fix variant parquet reader. [spark]
chenhao-db commented on code in PR #43825: URL: https://github.com/apache/spark/pull/43825#discussion_r1395045528 ## sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala: ## @@ -73,5 +73,12 @@ class VariantSuite extends QueryTest with SharedSparkSession { values.map(v => if (v == null) "null" else v.debugString()).sorted } assert(prepareAnswer(input) == prepareAnswer(result)) + +withTempDir { dir => Review Comment: Because the variant values it writes are all non-null. This only causes an issue when there is a null variant value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Fix variant parquet reader. [spark]
cloud-fan commented on code in PR #43825: URL: https://github.com/apache/spark/pull/43825#discussion_r1395044163 ## sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala: ## @@ -73,5 +73,12 @@ class VariantSuite extends QueryTest with SharedSparkSession { values.map(v => if (v == null) "null" else v.debugString()).sorted } assert(prepareAnswer(input) == prepareAnswer(result)) + +withTempDir { dir => Review Comment: The `basic tests` test case also test parquet write an read, why it didn't expose the bug? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
panbingkun commented on PR #37588: URL: https://github.com/apache/spark/pull/37588#issuecomment-1813555489 @cloud-fan If you have time, could you please take a look at this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Fix variant parquet reader. [spark]
chenhao-db commented on PR #43825: URL: https://github.com/apache/spark/pull/43825#issuecomment-1813534407 @cloud-fan @HyukjinKwon could you help take a look? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44699][CORE] Add log when finished write events to file in EventLogFileWriter.closeWriter [spark]
github-actions[bot] commented on PR #42372: URL: https://github.com/apache/spark/pull/42372#issuecomment-1813504639 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [Spark Ticket][WIP]Added a warning to pop up in the case the user doesn't use gpus [spark]
github-actions[bot] commented on PR #42308: URL: https://github.com/apache/spark/pull/42308#issuecomment-1813504691 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44685][SQL] Remove deprecated Catalog#createExternalTable [spark]
github-actions[bot] commented on PR #42356: URL: https://github.com/apache/spark/pull/42356#issuecomment-1813504669 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45525][SQL][PYTHON] Initial support for Python data source write [spark]
allisonwang-db commented on PR #43791: URL: https://github.com/apache/spark/pull/43791#issuecomment-1813486370 @cloud-fan @HyukjinKwon @ueshin This PR is ready for review. It focuses on the optimizer/execution part of data source write and is independent of the DataFrameWriter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45592][SPARK-45282][SQL] Correctness issue in AQE with InMemoryTableScanExec [spark]
dongjoon-hyun commented on PR #43760: URL: https://github.com/apache/spark/pull/43760#issuecomment-1813474266 For the record, I landed at branch-3.4 after resolving conflicts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error [spark]
dongjoon-hyun commented on code in PR #43815: URL: https://github.com/apache/spark/pull/43815#discussion_r1394963625 ## python/docs/source/conf.py: ## @@ -102,9 +102,9 @@ .. |examples| replace:: Examples .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python .. |downloading| replace:: Downloading -.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html +.. _downloading: https://spark.apache.org/docs/{1}/#downloading .. |building_spark| replace:: Building Spark -.. _building_spark: https://spark.apache.org/docs/{1}/#downloading +.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html Review Comment: If this happens in Apache Spark 3.5.0, could you add `3.5.0` to the affected version, @panbingkun ? https://github.com/apache/spark/assets/9700541/d3966d02-8572-4a96-b56a-a4bf729e65f9";> ## python/docs/source/conf.py: ## @@ -102,9 +102,9 @@ .. |examples| replace:: Examples .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python .. |downloading| replace:: Downloading -.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html +.. _downloading: https://spark.apache.org/docs/{1}/#downloading .. |building_spark| replace:: Building Spark -.. _building_spark: https://spark.apache.org/docs/{1}/#downloading +.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html Review Comment: If this happens in Apache Spark 3.5.0, could you add `3.5.0` to the affected version, @panbingkun ? https://github.com/apache/spark/assets/9700541/d3966d02-8572-4a96-b56a-a4bf729e65f9";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45930][SQL] Support non-deterministic UDFs in MapInPandas/MapInArrow [spark]
allisonwang-db commented on PR #43810: URL: https://github.com/apache/spark/pull/43810#issuecomment-1813393808 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45827] Fix variant parquet reader. [spark]
chenhao-db opened a new pull request, #43825: URL: https://github.com/apache/spark/pull/43825 ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/43707. The previous PR missed a piece in the variant parquet reader: we are treating the variant type as `struct`, so it also needs a similar `assembleStruct` process in the Parquet reader to correctly set the nullness of variant values from def/rep levels. ## How was this patch tested? Extend the existing unit test. It would fail without the change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813372467 I also cherry-picked this to branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44488][SQL] Support deserializing long types when creating `Metadata` object from JObject [spark]
scottsand-db commented on PR #42083: URL: https://github.com/apache/spark/pull/42083#issuecomment-1813363910 Will this make Apache Spark 3.6 release? Or 4.0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813346533 Also, thank you, @yaooqinn and @bjornjorgensen , too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun closed pull request #43814: [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout URL: https://github.com/apache/spark/pull/43814 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813345234 Thank you so much, @huaxingao . Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
huaxingao commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813344307 LGTM Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45719][K8S][TESTS] Upgrade AWS SDK to v2 for Kubernetes IT [spark]
dongjoon-hyun commented on PR #43510: URL: https://github.com/apache/spark/pull/43510#issuecomment-1813343142 Welcome to the Apache Spark community, @junyuc25 ! I added you to the Apache Spark contributor group and assigned SPARK-45719 to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45719][K8S][TESTS] Upgrade AWS SDK to v2 for Kubernetes IT [spark]
dongjoon-hyun closed pull request #43510: [SPARK-45719][K8S][TESTS] Upgrade AWS SDK to v2 for Kubernetes IT URL: https://github.com/apache/spark/pull/43510 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]
ahshahid commented on PR #43824: URL: https://github.com/apache/spark/pull/43824#issuecomment-1813334777 I will add the documentation to the new methods in next commit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]
ahshahid opened a new pull request, #43824: URL: https://github.com/apache/spark/pull/43824 ### What changes were proposed in this pull request? The main change in this PR is to augment the trait of SupportsRuntimeV2Filtering by adding two new methods `default boolean equalToIgnoreRuntimeFilters(Scan other) { return this.equals(other); } default int hashCodeIgnoreRuntimeFilters() { return this.hashCode(); }` which the underlying V2 Scan should implement. The BatchScanExec also gets modified accordingly where it invokes this two methods to check the equality of the Scan. Pls note that this PR includes code of 2 other PRs too 1) [SPARK-45658](https://github.com/apache/spark/pull/43737) This PR though not required per se, but is good to have for correctness ( & my other PR for broadcast var pushdown relies on this fix) 2) [SPARK-45926](https://github.com/apache/spark/pull/43808) This PR is necessary to reproduce the issue and hence its code is needed for this PR to show the issue. **Also for this test to pass the code of DataSourceV2Relation.computeStats should disable throwing assertion error in testing, as that is a separate bug which gets hit, when the bug test for this PR is run.** ### Why are the changes needed? This change is need IMO to fix the issue of re-use of exchange not happening when DPP filters are pushed to the scan level. The issue is this: In certain types of queries for eg TPCDS Query 14b, the reuse of exchange does not happen in AQE , resulting in perf degradation. The spark TPCDS tests are unable to catch the problem, because the InMemoryScan used for testing do not implement the equals & hashCode correctly , in the sense , that they do take into account the pushed down run time filters. In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the equality check , apart from other things, also involves Runtime Filters pushed ( which is correct). Below is description of how this issue surfaces. For a given stage being materialized, just before materialization starts, the run time filters are confined to the BatchScanExec level. Only when the actual RDD corresponding to the BatchScanExec, is being evaluated, do the runtime filters get pushed to the underlying Scan. Now if a new stage is created and it checks in the stageCache using its canonicalized plan to see if a stage can be reused, it fails to find the r-usable stage even if the stage exists, because the canonicalized spark plan present in the stage cache, has now the run time filters pushed to the Scan , so the incoming canonicalized spark plan does not match the key as their underlying scans differ . that is incoming spark plan's scan does not have runtime filters , while the canonicalized spark plan present as key in the stage cache has the scan with runtime filters pushed. The fix as I have worked is to provide, two methods in the SupportsRuntimeV2Filtering interface , default boolean equalToIgnoreRuntimeFilters(Scan other) { return this.equals(other); } default int hashCodeIgnoreRuntimeFilters() { return this.hashCode(); } In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters And the underlying Scan implementations should provide equality which excludes run time filters. Similarly the hashCode of BatchScanExec, should use scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode). ### Does this PR introduce _any_ user-facing change? No. But the respective DataSourceV2Relations may need to augment their code. ### How was this patch tested? Added bug test for the same. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]
abellina commented on code in PR #43627: URL: https://github.com/apache/spark/pull/43627#discussion_r1394870890 ## core/src/main/scala/org/apache/spark/SparkEnv.scala: ## @@ -415,6 +418,11 @@ object SparkEnv extends Logging { advertiseAddress, blockManagerPort, numUsableCores, blockManagerMaster.driverEndpoint) // NB: blockManager is not valid until initialize() is called later. +// SPARK-45762 introduces a change where the ShuffleManager is initialized later +// in the SparkContext and Executor, to allow for custom ShuffleManagers defined +// in user jars. In the executor, the BlockManager uses a lazy val to obtain the +// shuffleManager from the SparkEnv. In the driver, the SparkEnv's shuffleManager Review Comment: Thanks @tgravescs. Handled both comments here: https://github.com/apache/spark/pull/43627/commits/6d002a361ac2c1dfad48ee530766c9b0a605696f -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813319975 Could you review this `Spark Standalone` documentation PR when you have some time, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45856] Move ArtifactManager from Spark Connect into SparkSession (sql/core) [spark]
dongjoon-hyun commented on code in PR #43735: URL: https://github.com/apache/spark/pull/43735#discussion_r1394847206 ## sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -243,6 +244,16 @@ class SparkSession private( @Unstable def streams: StreamingQueryManager = sessionState.streamingQueryManager + /** + * Returns an `ArtifactManager` that supports adding, managing and using session-scoped artifacts + * (jars, classfiles, etc). + * + * @since 3.5.1 Review Comment: This should be 4.0.0 because this PR is for Apache Spark 4.0.0, @vicennial . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]
tgravescs commented on code in PR #43627: URL: https://github.com/apache/spark/pull/43627#discussion_r1394837773 ## core/src/main/scala/org/apache/spark/SparkEnv.scala: ## @@ -415,6 +418,11 @@ object SparkEnv extends Logging { advertiseAddress, blockManagerPort, numUsableCores, blockManagerMaster.driverEndpoint) // NB: blockManager is not valid until initialize() is called later. +// SPARK-45762 introduces a change where the ShuffleManager is initialized later +// in the SparkContext and Executor, to allow for custom ShuffleManagers defined +// in user jars. In the executor, the BlockManager uses a lazy val to obtain the +// shuffleManager from the SparkEnv. In the driver, the SparkEnv's shuffleManager Review Comment: I think this comment it no longer true. Driver SparkEnv shufflemanager is created after the plugin initialized. ## core/src/main/scala/org/apache/spark/SparkEnv.scala: ## @@ -71,6 +70,12 @@ class SparkEnv ( val outputCommitCoordinator: OutputCommitCoordinator, val conf: SparkConf) extends Logging { + // We initialize the ShuffleManager later in SparkContext, and Executor, to allow Review Comment: ```suggestion // We initialize the ShuffleManager later in SparkContext and Executor to allow ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]
ueshin commented on PR #43682: URL: https://github.com/apache/spark/pull/43682#issuecomment-1813313871 Thanks! merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]
ueshin closed pull request #43682: [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table URL: https://github.com/apache/spark/pull/43682 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45868][CONNECT] Make sure `spark.table` use the same parser with vanilla spark [spark]
dongjoon-hyun commented on PR #43741: URL: https://github.com/apache/spark/pull/43741#issuecomment-1813308124 Merged to master. Thank you, @zhengruifeng and all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45868][CONNECT] Make sure `spark.table` use the same parser with vanilla spark [spark]
dongjoon-hyun closed pull request #43741: [SPARK-45868][CONNECT] Make sure `spark.table` use the same parser with vanilla spark URL: https://github.com/apache/spark/pull/43741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813302230 > Thank you for fixing the dokumentasjon for K8S and Standalone :) Thanks, but I'm going to proceed K8s part in a new JIRA because of the previous comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]
bjornjorgensen commented on PR #43822: URL: https://github.com/apache/spark/pull/43822#issuecomment-1813301006 Thank you @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
bjornjorgensen commented on PR #43814: URL: https://github.com/apache/spark/pull/43814#issuecomment-1813297066 Thank you for fixing the dokumentasjon for K8S and Standalone :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
bjornjorgensen commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394833447 ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: yes, I did read the K8s part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394832622 ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: Here. ``` $ git diff HEAD~2 --stat docs/spark-standalone.md | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45925][SQL] Making SubqueryBroadcastExec equivalent to SubqueryAdaptiveBroadcastExec [spark]
ahshahid commented on PR #43807: URL: https://github.com/apache/spark/pull/43807#issuecomment-1813289805 @beliefer I think you may be right. In my another PR for broadcast-var-pushdown, I am seeing unmodified SubqueryAdaptiveBroadcastExec in the stage cache 's keys. May be it is an issue in my code or something else. Will check my code again for this. So as of now, I think it makes sense to close this PR and also the other PR in SubqueryBroadcastHashExec -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45925][SQL] Making SubqueryBroadcastExec equivalent to SubqueryAdaptiveBroadcastExec [spark]
ahshahid closed pull request #43807: [SPARK-45925][SQL] Making SubqueryBroadcastExec equivalent to SubqueryAdaptiveBroadcastExec URL: https://github.com/apache/spark/pull/43807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]
ahshahid closed pull request #43806: [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec URL: https://github.com/apache/spark/pull/43806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45942][Core] Only do the thread interruption check for putIterator on executors [spark]
huanliwang-db opened a new pull request, #43823: URL: https://github.com/apache/spark/pull/43823 ### What changes were proposed in this pull request? Only do the thread interruption check for putIterator on executors ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-45025 introduces a peaceful thread interruption handling. However, there is an edge case: when a streaming query is stopped on the driver, it interrupts the stream execution thread. If the streaming query is doing memory store operations on driver and performs doPutIterator at the same time, the [unroll process will be broken](https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L224) and [returns used memory](https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L245-L247). This can result in closeChannelException as it falls into this [case clause](https://github.com/apache/spark/blob/aa646d3050028272f7333deaef52f20e6975e0ed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1614-L1622) which opens an I/O channel and persists the data into the disk. However, because the thread is interrupted, the channel will be closed at the begin: https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/nio/channels/spi/AbstractInterruptibleChannel.java#L172 and throws out closeChannelException On executors, [the task will be killed if the thread is interrupted](https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L374), however, we don't do it on the driver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran MemoryStoreSuite ``` [info] MemoryStoreSuite: [info] - reserve/release unroll memory (36 milliseconds) [info] - safely unroll blocks (70 milliseconds) [info] - safely unroll blocks through putIteratorAsValues (10 milliseconds) [info] - safely unroll blocks through putIteratorAsValues off-heap (21 milliseconds) [info] - safely unroll blocks through putIteratorAsBytes (138 milliseconds) [info] - PartiallySerializedBlock.valuesIterator (6 milliseconds) [info] - PartiallySerializedBlock.finishWritingToStream (5 milliseconds) [info] - multiple unrolls by the same thread (8 milliseconds) [info] - lazily create a big ByteBuffer to avoid OOM if it cannot be put into MemoryStore (3 milliseconds) [info] - put a small ByteBuffer to MemoryStore (3 milliseconds) [info] - SPARK-22083: Release all locks in evictBlocksToFreeSpace (43 milliseconds) [info] - put user-defined objects to MemoryStore and remove (5 milliseconds) [info] - put user-defined objects to MemoryStore and clear (4 milliseconds) [info] Run completed in 1 second, 587 milliseconds. [info] Total number of tests run: 13 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]
ahshahid commented on PR #43806: URL: https://github.com/apache/spark/pull/43806#issuecomment-1813288994 @beliefer I think you may be right. In my another PR for broadcast-var-pushdown, I am seeing unmodified SubqueryAdaptiveBroadcastExec in the stage cache 's keys. May be it is an issue in my code or something else. Will check my code again for this. So as of now, I think it makes sense to close this PR and also the other PR in SubqueryBroadcastHashExec -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
tgravescs commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1384051957 ## core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala: ## @@ -170,16 +170,16 @@ private[spark] object ResourceUtils extends Logging { // integer amount and the number of slots per address. For instance, if the amount is 0.5, // the we get (1, 2) back out. This indicates that for each 1 address, it has 2 slots per // address, which allows you to put 2 tasks on that address. Note if amount is greater - // than 1, then the number of slots per address has to be 1. This would indicate that a + // than 1, then the number of parts per address has to be 1. This would indicate that a // would have multiple addresses assigned per task. This can be used for calculating // the number of tasks per executor -> (executorAmount * numParts) / (integer amount). // Returns tuple of (integer amount, numParts) def calculateAmountAndPartsForFraction(doubleAmount: Double): (Int, Int) = { -val parts = if (doubleAmount <= 0.5) { +val parts = if (doubleAmount <= 1.0) { Review Comment: did you move this check somewhere else? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394826363 ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: It seems that you are looking at the first commit. I removed K8s part from this PR completely at the latest commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]
dongjoon-hyun commented on code in PR #43814: URL: https://github.com/apache/spark/pull/43814#discussion_r1394824967 ## docs/running-on-kubernetes.md: ## @@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for information on Spark config 3.0.0 - memoryOverheadFactor + spark.kubernetes.memoryOverheadFactor 0.1 -This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. +This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, various systems processes, and tmpfs-based local directories when spark.kubernetes.local.dirs.tmpfs is true. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This will be overridden by the value set by spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor explicitly. Review Comment: This is only for `Spark Standalone` documetation, @bjornjorgensen 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org