[GitHub] [spark] HyukjinKwon commented on pull request #39400: [SPARK-41891][CONNECT][TESTS] Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_inlin
HyukjinKwon commented on PR #39400: URL: https://github.com/apache/spark/pull/39400#issuecomment-1371890753 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #39394: [SPARK-41575][SQL] Assign name to _LEGACY_ERROR_TEMP_2054
MaxGekk commented on code in PR #39394: URL: https://github.com/apache/spark/pull/39394#discussion_r1062189805 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -784,7 +784,7 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { def taskFailedWhileWritingRowsError(cause: Throwable): Throwable = { new SparkException( - errorClass = "_LEGACY_ERROR_TEMP_2054", + errorClass = "TASK_WRITE_FAILED", messageParameters = Map("message" -> cause.getMessage), cause = cause) Review Comment: Do we really need to append `cause.getMessage` if we attach the `cause` exception? cc @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39401: [SPARK-41893][BUILD] Publish SBOM artifacts
dongjoon-hyun commented on PR #39401: URL: https://github.com/apache/spark/pull/39401#issuecomment-1371884646 Ah, it seems that I missed some failures. I convert this as `Draft`. Let me dig this. ``` [WARNING] An unexpected issue occurred attempting to resolve the effective pom for org.xerial.snappy:snappy-java:1.1.8.4 org.apache.maven.project.ProjectBuildingException: Some problems were encountered while processing the POMs: [ERROR] Unknown packaging: bundle @ line 6, column 16 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39401: [SPARK-41893][BUILD] Publish SBOM artifacts
dongjoon-hyun commented on PR #39401: URL: https://github.com/apache/spark/pull/39401#issuecomment-1371876411 cc @srowen and @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #39402: [SPARK-41889][SQL] Attach root cause to invalidPatternError
panbingkun opened a new pull request, #39402: URL: https://github.com/apache/spark/pull/39402 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #39401: [SPARK-41893][BUILD] Publish SBOM artifacts
dongjoon-hyun opened a new pull request, #39401: URL: https://github.com/apache/spark/pull/39401 ### What changes were proposed in this pull request? This PR aims to publish `SBOM` artifacts. ### Why are the changes needed? Here is an article to give some context. - https://www.activestate.com/blog/why-the-us-government-is-mandating-software-bill-of-materials-sbom/ Software Bill of Materials (SBOM) are additional artifacts containing the aggregate of all direct and transitive dependencies of a project. The US Government (based on NIST recommendations) currently accepts only the three most popular SBOM standards as valid, namely: [CycloneDX](https://cyclonedx.org/), [Software Identification (SWID) tag](https://csrc.nist.gov/projects/Software-Identification-SWID), [Software Package Data Exchange® (SPDX)](https://spdx.dev/). This PR uses [CycloneDX maven plugin](https://github.com/CycloneDX/cyclonedx-maven-plugin), a lightweight software bill of materials (SBOM) standard designed for use in application security contexts and supply chain component analysis. ### Does this PR introduce _any_ user-facing change? Yes, but dev-only changes. ### How was this patch tested? Manually test. ``` $ mvn install -DskipTests ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`
zhengruifeng commented on PR #39398: URL: https://github.com/apache/spark/pull/39398#issuecomment-1371869339 merged into master, thank you @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`
zhengruifeng closed pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions` URL: https://github.com/apache/spark/pull/39398 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38356: [SPARK-40885] `Sort` may not take effect when it is the last 'Transform' operator
EnricoMi commented on code in PR #38356: URL: https://github.com/apache/spark/pull/38356#discussion_r1021126305 ## sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala: ## @@ -220,6 +220,23 @@ class PartitionedWriteSuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-40885: V1 write uses the sort with partitionBy operator") { +withTempPath { f => + Seq((20, 30, "partition"), (15, 20, "partition"), +(30, 70, "partition"), (18, 40, "partition")) +.toDF("id", "sort_col", "p") +.repartition(1) +.sortWithinPartitions("p", "sort_col") +.write +.partitionBy("p") Review Comment: > `partitionBy("p")` requires sorting the data by `p` per partition. Why does `write.partitionBy("p")` require sorting the data by `p` per partition? I understand why `df.groupBy($"p")` requires in-partition order by `p` (`GroupedIterator`). But `write.partitionBy("p")`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #39131: [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations
EnricoMi commented on PR #39131: URL: https://github.com/apache/spark/pull/39131#issuecomment-1371848352 > @EnricoMi thanks for the fix! which spark version starts to have this bug? This was introduced in Spark 3.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149
MaxGekk commented on code in PR #39258: URL: https://github.com/apache/spark/pull/39258#discussion_r1062158983 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite { super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "csv") + + private val carsFile = "test-data/cars.csv" + + test("test for FAILFAST parsing mode on CSV v1") { +Seq(false, true).foreach { multiLine => + val exception = intercept[SparkException] { +spark.read + .format("csv") + .option("multiLine", multiLine) + .options(Map("header" -> "true", "mode" -> "failfast")) + .load(testFile(carsFile)).collect() + } + + checkError( +exception = exception.getCause.asInstanceOf[SparkException], +errorClass = "_LEGACY_ERROR_TEMP_2177", Review Comment: Could you explain why did you add the test for the error class in the PR. ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite { super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "csv") + + private val carsFile = "test-data/cars.csv" Review Comment: The same is defined in the parent class (just make it as `protected`). Please, remove it. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -319,15 +319,17 @@ class UnivocityParser( throw BadRecordException( () => getCurrentInput, () => None, -QueryExecutionErrors.malformedCSVRecordError()) +QueryExecutionErrors.malformedCSVRecordError("")) } +val currentInput = getCurrentInput Review Comment: It is not used in regular cases, correct? Don't think we should introduce additional overhead. Please, use `getCurrentInput` directly in errors. ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite { super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "csv") + + private val carsFile = "test-data/cars.csv" + + test("test for FAILFAST parsing mode on CSV v1") { +Seq(false, true).foreach { multiLine => + val exception = intercept[SparkException] { +spark.read + .format("csv") + .option("multiLine", multiLine) + .options(Map("header" -> "true", "mode" -> "failfast")) + .load(testFile(carsFile)).collect() + } + + checkError( +exception = exception.getCause.asInstanceOf[SparkException], +errorClass = "_LEGACY_ERROR_TEMP_2177", +parameters = Map("failFastMode" -> "FAILFAST") + ) +} + } } class CSVv2Suite extends CSVSuite { override protected def sparkConf: SparkConf = super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "") + + private val carsFile = "test-data/cars.csv" + + test("test for FAILFAST parsing mode on CSV v2") { +Seq(false, true).foreach { multiLine => + val exception = intercept[SparkException] { +spark.read + .format("csv") + .option("multiLine", multiLine) + .options(Map("header" -> "true", "mode" -> "failfast")) + .load(testFile(carsFile)).collect() + } + + checkError( +exception = exception.getCause.asInstanceOf[SparkException], +errorClass = "_LEGACY_ERROR_TEMP_2064", Review Comment: The same question like above. How this is related to assigning a name to `_LEGACY_ERROR_TEMP_2149`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #38163: [SPARK-40711][SQL] Add spill size metrics for window
ulysses-you commented on PR #38163: URL: https://github.com/apache/spark/pull/38163#issuecomment-1371842097 let me rebase this again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #39131: [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations
EnricoMi commented on code in PR #39131: URL: https://github.com/apache/spark/pull/39131#discussion_r1062152718 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LeftSemiAntiJoinPushDownSuite.scala: ## @@ -46,7 +46,7 @@ class LeftSemiPushdownSuite extends PlanTest { val testRelation1 = LocalRelation($"d".int) val testRelation2 = LocalRelation($"e".int) - test("Project: LeftSemiAnti join pushdown") { + test("Project: LeftSemi join pushdown") { Review Comment: The term `LeftSemiAnti` is wrong and misleading for individual tests, correcting this while I am touching the file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #39377: [SPARK-41867][SQL] Selective predicate should respect InMemoryRelation
ulysses-you commented on code in PR #39377: URL: https://github.com/apache/spark/pull/39377#discussion_r1062152134 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala: ## @@ -167,44 +167,10 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join } val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat -val overhead = calculatePlanOverhead(otherPlan) +val overhead = collectLeafPlanOverhead(otherPlan).sum.toFloat estimatePruningSideSize > overhead } - /** - * Calculates a heuristic overhead of a logical plan. Normally it returns the total - * size in bytes of all scan relations. We don't count in-memory relation which uses - * only memory. - */ - private def calculatePlanOverhead(plan: LogicalPlan): Float = { -val (cached, notCached) = plan.collectLeaves().partition(p => p match { - case _: InMemoryRelation => true - case _ => false -}) -val scanOverhead = notCached.map(_.stats.sizeInBytes).sum.toFloat -val cachedOverhead = cached.map { - case m: InMemoryRelation if m.cacheBuilder.storageLevel.useDisk && Review Comment: this method moved into the new trait. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #39377: [SPARK-41867][SQL] Selective predicate should respect InMemoryRelation
ulysses-you commented on code in PR #39377: URL: https://github.com/apache/spark/pull/39377#discussion_r1062151959 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala: ## @@ -167,44 +167,10 @@ object PartitionPruning extends Rule[LogicalPlan] with PredicateHelper with Join } val estimatePruningSideSize = filterRatio * partPlan.stats.sizeInBytes.toFloat -val overhead = calculatePlanOverhead(otherPlan) +val overhead = collectLeafPlanOverhead(otherPlan).sum.toFloat estimatePruningSideSize > overhead } - /** - * Calculates a heuristic overhead of a logical plan. Normally it returns the total - * size in bytes of all scan relations. We don't count in-memory relation which uses - * only memory. - */ - private def calculatePlanOverhead(plan: LogicalPlan): Float = { -val (cached, notCached) = plan.collectLeaves().partition(p => p match { - case _: InMemoryRelation => true - case _ => false -}) -val scanOverhead = notCached.map(_.stats.sizeInBytes).sum.toFloat -val cachedOverhead = cached.map { - case m: InMemoryRelation if m.cacheBuilder.storageLevel.useDisk && Review Comment: the cached relation should be materialized right ? otherwise its statistics should be same with normal plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #39377: [SPARK-41867][SQL] Selective predicate should respect InMemoryRelation
ulysses-you commented on PR #39377: URL: https://github.com/apache/spark/pull/39377#issuecomment-1371839438 @cloud-fan good suggestion. Follow the current code path: 1. check if build side has selective predicate. Now it always returns ture if the cached relation is materialized 2. check if the statistics is match the requirements. Now both DPP and Runtime Filter use the same calculate overhead method which considers the cached relation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
LuciferYang commented on PR #39385: URL: https://github.com/apache/spark/pull/39385#issuecomment-1371825146 Rebased(merged https://github.com/apache/spark/pull/39226), please help to review if you have time, thanks @gengliangwang @dongjoon-hyun @techaddict -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
LuciferYang commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062138688 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala: ## @@ -1007,6 +1002,34 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils } } +class SQLAppStatusListenerWithInMemoryStoreSuite extends SQLAppStatusListenerSuite { + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} + +class SQLAppStatusListenerWithRocksDBBackendSuite extends SQLAppStatusListenerSuite { + private val storePath = Utils.createTempDir() Review Comment: still make `storePath` as class field and override `afterAll` to delete base `storePath` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #39389: [SPARK-41574][SQL] Assign name to _LEGACY_ERROR_TEMP_2009
MaxGekk commented on code in PR #39389: URL: https://github.com/apache/spark/pull/39389#discussion_r1062135726 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -378,10 +378,9 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { ) } - def dataTypeOperationUnsupportedError(): SparkUnsupportedOperationException = { -new SparkUnsupportedOperationException( - errorClass = "_LEGACY_ERROR_TEMP_2009", - messageParameters = Map.empty) + def dataTypeOperationUnsupportedError(): Throwable = { +SparkException.internalError( + s"""Operation dataType is not supported""") Review Comment: ```suggestion "The operation `dataType` is not supported.") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #39305: [SPARK-41580][SQL] Assign name to _LEGACY_ERROR_TEMP_2137
MaxGekk closed pull request #39305: [SPARK-41580][SQL] Assign name to _LEGACY_ERROR_TEMP_2137 URL: https://github.com/apache/spark/pull/39305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #39305: [SPARK-41580][SQL] Assign name to _LEGACY_ERROR_TEMP_2137
MaxGekk commented on PR #39305: URL: https://github.com/apache/spark/pull/39305#issuecomment-1371814622 +1, LGTM. Merging to master. Thank you, @itholic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe
beliefer commented on PR #39378: URL: https://github.com/apache/spark/pull/39378#issuecomment-1371811698 @HyukjinKwon @zhengruifeng Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #39281: [SPARK-41576][SQL] Assign name to _LEGACY_ERROR_TEMP_2051
MaxGekk closed pull request #39281: [SPARK-41576][SQL] Assign name to _LEGACY_ERROR_TEMP_2051 URL: https://github.com/apache/spark/pull/39281 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #39281: [SPARK-41576][SQL] Assign name to _LEGACY_ERROR_TEMP_2051
MaxGekk commented on PR #39281: URL: https://github.com/apache/spark/pull/39281#issuecomment-1371809041 +1, LGTM. Merging to master. Thank you, @itholic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe
HyukjinKwon closed pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe URL: https://github.com/apache/spark/pull/39378 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe
HyukjinKwon commented on PR #39378: URL: https://github.com/apache/spark/pull/39378#issuecomment-1371806726 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39400: [SPARK-41891][CONNECT][TESTS] Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls,
HyukjinKwon commented on code in PR #39400: URL: https://github.com/apache/spark/pull/39400#discussion_r1062125012 ## python/pyspark/sql/tests/connect/test_parity_functions.py: ## @@ -68,30 +60,14 @@ def test_date_add_function(self): def test_date_sub_function(self): super().test_date_sub_function() -@unittest.skip("Fails in Spark Connect, should enable.") -def test_dayofweek(self): -super().test_dayofweek() - @unittest.skip("Fails in Spark Connect, should enable.") def test_explode(self): super().test_explode() -@unittest.skip("Fails in Spark Connect, should enable.") -def test_first_last_ignorenulls(self): -super().test_first_last_ignorenulls() - -@unittest.skip("Fails in Spark Connect, should enable.") -def test_function_parity(self): Review Comment: Let's exclude this. It uses `SparkContect` that Spark Connect doesn't support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39397: [MINOR][CONNECT] Fix typos in connect/plan.py
HyukjinKwon closed pull request #39397: [MINOR][CONNECT] Fix typos in connect/plan.py URL: https://github.com/apache/spark/pull/39397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39397: [MINOR][CONNECT] Fix typos in connect/plan.py
HyukjinKwon commented on PR #39397: URL: https://github.com/apache/spark/pull/39397#issuecomment-1371800977 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, float or int
HyukjinKwon closed pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, float or int URL: https://github.com/apache/spark/pull/39393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, float or int
HyukjinKwon commented on PR #39393: URL: https://github.com/apache/spark/pull/39393#issuecomment-1371800022 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`
zhengruifeng commented on PR #39398: URL: https://github.com/apache/spark/pull/39398#issuecomment-1371784933 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dengziming commented on pull request #39388: [SPARK-41354][CONNECT][PYTHON] implement RepartitionByExpression
dengziming commented on PR #39388: URL: https://github.com/apache/spark/pull/39388#issuecomment-1371775675 > just to confirm, the proto `RepartitionByExpression repartition_by_expression = 27` can support both > > `def repartition(self, *cols: "ColumnOrName")` `def repartitionByRange(self, *cols: "ColumnOrName")` Yes, you are right @zhengruifeng . Both `repartition(cols)` and `repartitionByRange(cols)` will be transformed to `RepartitionByExpression`, this can be seen in `Dataset.scala` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #39395: [SQL] Use foldLeft for DeduplicateRelations
cloud-fan commented on PR #39395: URL: https://github.com/apache/spark/pull/39395#issuecomment-1371775038 looks fine if tests pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict opened a new pull request, #39400: [SPARK-41891][CONNECT][TESTS] Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_funct
techaddict opened a new pull request, #39400: URL: https://github.com/apache/spark/pull/39400 ### What changes were proposed in this pull request? Enabling tests in connect/test_parity_functions.py ### Why are the changes needed? Improved coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New Tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files
cloud-fan closed pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files URL: https://github.com/apache/spark/pull/36700 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files
cloud-fan commented on PR #36700: URL: https://github.com/apache/spark/pull/36700#issuecomment-1371774108 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38163: [SPARK-40711][SQL] Add spill size metrics for window
cloud-fan commented on PR #38163: URL: https://github.com/apache/spark/pull/38163#issuecomment-1371773428 LGTM if all tests pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`
LuciferYang commented on PR #39226: URL: https://github.com/apache/spark/pull/39226#issuecomment-1371772823 Thanks @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`
gengliangwang closed pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()` URL: https://github.com/apache/spark/pull/39226 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`
gengliangwang commented on PR #39226: URL: https://github.com/apache/spark/pull/39226#issuecomment-1371770833 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #39399: [SPARK-41890][CORE][SQL][UI] Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/`SparkPlanGraphWrapperSerializer` for Scala 2.13
LuciferYang opened a new pull request, #39399: URL: https://github.com/apache/spark/pull/39399 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode
shrprasa commented on PR #37880: URL: https://github.com/apache/spark/pull/37880#issuecomment-1371749436 @dongjoon-hyun @holdenk Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`
zhengruifeng opened a new pull request, #39398: URL: https://github.com/apache/spark/pull/39398 ### What changes were proposed in this pull request? Add the missing ordering parameter in `Sort` and `sortWithinPartitions` ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? enabled doctests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`
zhengruifeng commented on PR #39396: URL: https://github.com/apache/spark/pull/39396#issuecomment-1371745236 thank you @HyukjinKwon for reviews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`
HyukjinKwon closed pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show` URL: https://github.com/apache/spark/pull/39396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`
HyukjinKwon commented on PR #39396: URL: https://github.com/apache/spark/pull/39396#issuecomment-1371739702 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on pull request #39397: [MINOR] fix typos
techaddict commented on PR #39397: URL: https://github.com/apache/spark/pull/39397#issuecomment-1371734926 cc: @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict opened a new pull request, #39397: [MINOR] fix typos
techaddict opened a new pull request, #39397: URL: https://github.com/apache/spark/pull/39397 ### What changes were proposed in this pull request? Fixing typos in connect/plan.py ### Why are the changes needed? Typos ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? just fixing typos -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #39260: [SPARK-41579][SQL] Assign name to _LEGACY_ERROR_TEMP_1249
itholic commented on code in PR #39260: URL: https://github.com/apache/spark/pull/39260#discussion_r1062073949 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -2405,22 +2405,24 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { messageParameters = Map.empty) } - def cmdOnlyWorksOnPartitionedTablesError(cmd: String, tableIdentWithDB: String): Throwable = { + def cmdOnlyWorksOnPartitionedTablesError( + operation: String, + tableIdentWithDB: String): Throwable = { new AnalysisException( - errorClass = "_LEGACY_ERROR_TEMP_1249", + errorClass = "NOT_A_PARTITIONED_TABLE", messageParameters = Map( -"cmd" -> cmd, +"operation" -> toSQLStmt(operation), "tableIdentWithDB" -> tableIdentWithDB)) } - def cmdOnlyWorksOnTableWithLocationError(cmd: String, tableIdentWithDB: String): Throwable = { + def cmdOnlyWorksOnTableWithLocationError( Review Comment: Oh... yeah seems like we should introduce new error class. Just turned this error into `_LEGACY_ERROR_TEMP_2446`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #39282: [SPARK-41581][SQL] Assign name to _LEGACY_ERROR_TEMP_1230
itholic commented on code in PR #39282: URL: https://github.com/apache/spark/pull/39282#discussion_r1062069986 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala: ## @@ -680,6 +681,18 @@ class QueryCompilationErrorsSuite context = ExpectedContext("", "", 7, 13, "CAST(1)") ) } + + test("NEGATIVE_SCALE_NOT_ALLOWED: negative scale for Decimal is not allowed") { Review Comment: Thanks! Moved and migrate the existing test into `checkError`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on a diff in pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int
techaddict commented on code in PR #39393: URL: https://github.com/apache/spark/pull/39393#discussion_r1062062273 ## python/pyspark/sql/connect/dataframe.py: ## @@ -480,9 +480,10 @@ def to_jcols( def hint(self, name: str, *params: Any) -> "DataFrame": for param in params: -if param is not None and not isinstance(param, (int, str)): +if param is not None and not isinstance(param, (int, str, float, list)): Review Comment: Awesome, will create a JIRA to add list later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38163: [SPARK-40711][SQL] Add spill size metrics for window
ulysses-you commented on code in PR #38163: URL: https://github.com/apache/spark/pull/38163#discussion_r1062060973 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala: ## @@ -337,6 +338,7 @@ case class WindowInPandasExec( if (!found) { // clear final partition buffer.clear() +spillSize += buffer.spillSize Review Comment: `ExternalAppendOnlyUnsafeRowArray` supported to report spill size since https://github.com/apache/spark/pull/34999 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int
HyukjinKwon commented on PR #39393: URL: https://github.com/apache/spark/pull/39393#issuecomment-1371708907 yeah that's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int
HyukjinKwon commented on code in PR #39393: URL: https://github.com/apache/spark/pull/39393#discussion_r1062060563 ## python/pyspark/sql/connect/dataframe.py: ## @@ -480,9 +480,10 @@ def to_jcols( def hint(self, name: str, *params: Any) -> "DataFrame": for param in params: -if param is not None and not isinstance(param, (int, str)): +if param is not None and not isinstance(param, (int, str, float, list)): Review Comment: Let's add `float` only for now then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #39277: [SPARK-41708][SQL] Pull v1write information to `WriteFiles`
ulysses-you commented on code in PR #39277: URL: https://github.com/apache/spark/pull/39277#discussion_r1062052748 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala: ## @@ -105,4 +112,164 @@ trait V1WritesHiveUtils { .map(_ => Map(BucketingUtils.optionForHiveCompatibleBucketWrite -> "true")) .getOrElse(Map.empty) } + + def setupCompression( + fileSinkConf: FileSinkDesc, + hadoopConf: Configuration, + sparkSession: SparkSession): Unit = { +val isCompressed = + fileSinkConf.getTableInfo.getOutputFileFormatClassName.toLowerCase(Locale.ROOT) match { +case formatName if formatName.endsWith("orcoutputformat") => + // For ORC,"mapreduce.output.fileoutputformat.compress", + // "mapreduce.output.fileoutputformat.compress.codec", and + // "mapreduce.output.fileoutputformat.compress.type" + // have no impact because it uses table properties to store compression information. + false +case _ => hadoopConf.get("hive.exec.compress.output", "false").toBoolean + } + +if (isCompressed) { + hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true") + fileSinkConf.setCompressed(true) + fileSinkConf.setCompressCodec(hadoopConf +.get("mapreduce.output.fileoutputformat.compress.codec")) + fileSinkConf.setCompressType(hadoopConf +.get("mapreduce.output.fileoutputformat.compress.type")) +} else { + // Set compression by priority + HiveOptions.getHiveWriteCompression(fileSinkConf.getTableInfo, sparkSession.sessionState.conf) +.foreach { case (compression, codec) => hadoopConf.set(compression, codec) } +} + } + + /** + * Return two paths: + * 1. The first path is `stagingDir` which can be the parent path of `externalTmpPath` + * 2. The second path is `externalTmpPath`, e.g. `$stagingDir/-ext-1` + * The call side should create `stagingDir` before using `externalTmpPath` and + * delete `stagingDir` at the end. Review Comment: wrapped using `HiveTempPath` since it would be used by `InsertIntoHiveDirCommand` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int
techaddict commented on PR #39393: URL: https://github.com/apache/spark/pull/39393#issuecomment-1371689457 @HyukjinKwon After spending some time with this, looks like the change is much bigger Proto Message Hint expected parameters to be repeated literal https://github.com/apache/spark/blob/master/connector/connect/common/src/main/protobuf/spark/connect/relations.proto#L698-L710 adding list to this would require more changes in proto definition, I'm not super familiar with proto3, but it doesn't support extending and we can't do repeated oneof either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
LuciferYang commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062051944 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala: ## @@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils } } +class SQLAppStatusListenerWithInMemoryStoreSuite extends SQLAppStatusListenerSuite { + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} Review Comment: OK, let me fixed it together after https://github.com/apache/spark/pull/39226 merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
LuciferYang commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062051333 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala: ## @@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession with BeforeAndAfter { } } +class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite { + override protected def createStatusStore: SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} + +class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite { + // TODO: SPARK-41882 remove this field after RocksDB can automatically cleanup Review Comment: Thanks for your review and guidance @dongjoon-hyun , let's wait https://github.com/apache/spark/pull/39226, then I think we can clear the TODO in this pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
LuciferYang commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062050562 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala: ## @@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils } } +class SQLAppStatusListenerWithInMemoryStoreSuite extends SQLAppStatusListenerSuite { + protected def createStatusStore(): SQLAppStatusStore = { Review Comment: Thanks @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39357: [SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for StreamingQueryProgressWrapper
LuciferYang commented on PR #39357: URL: https://github.com/apache/spark/pull/39357#issuecomment-1371679675 Thanks @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39391: [SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15
LuciferYang commented on PR #39391: URL: https://github.com/apache/spark/pull/39391#issuecomment-1371678208 has been re-triggered the failed task -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`
LuciferYang commented on PR #39226: URL: https://github.com/apache/spark/pull/39226#issuecomment-1371675917 @gengliangwang has been re-triggered the failed task, should not related to this pr ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files
AngersZh commented on PR #36700: URL: https://github.com/apache/spark/pull/36700#issuecomment-1371674024 ping @cloud-fan @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #39383: [SPARK-41780][SQL] Should throw INVALID_PARAMETER_VALUE when the parameters `regexp` in regexp_replace is invalid
panbingkun commented on code in PR #39383: URL: https://github.com/apache/spark/pull/39383#discussion_r1062048545 ## sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala: ## @@ -663,4 +664,18 @@ class StringFunctionsSuite extends QueryTest with SharedSparkSession { start = 7, stop = 47)) } + + test("SPARK-41780: INVALID_PARAMETER_VALUE - invalid parameters `regexp` in regexp_replace") { Review Comment: Ok, Let me use `checkErrorInExpression` replace `checkExceptionInExpression ` to check exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
gengliangwang commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062043632 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala: ## @@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession with BeforeAndAfter { } } +class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite { + override protected def createStatusStore: SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} + +class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite { + // TODO: SPARK-41882 remove this field after RocksDB can automatically cleanup Review Comment: FYI I am going to merge https://github.com/apache/spark/pull/39226 right after the test passes. @LuciferYang please wait until https://github.com/apache/spark/pull/39226 is merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
gengliangwang commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062043632 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala: ## @@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession with BeforeAndAfter { } } +class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite { + override protected def createStatusStore: SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} + +class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite { + // TODO: SPARK-41882 remove this field after RocksDB can automatically cleanup Review Comment: FYI I am going to merge https://github.com/apache/spark/pull/39226 first. @LuciferYang please wait until https://github.com/apache/spark/pull/39226 is merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39268: [SPARK-41752][SQL][UI] Group nested executions under the root execution
gengliangwang commented on code in PR #39268: URL: https://github.com/apache/spark/pull/39268#discussion_r1062033642 ## sql/core/src/test/scala/org/apache/spark/status/api/v1/sql/SqlResourceSuite.scala: ## @@ -82,6 +82,7 @@ object SqlResourceSuite { new SQLExecutionUIData( executionId = 0, + rootExecutionId = 0, Review Comment: For testing purpose, let's use a different value from `executionId` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39268: [SPARK-41752][SQL][UI] Group nested executions under the root execution
gengliangwang commented on code in PR #39268: URL: https://github.com/apache/spark/pull/39268#discussion_r1062032939 ## sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala: ## @@ -74,6 +74,7 @@ class SQLExecutionUIDataSerializer extends ProtobufSerDe { new SQLExecutionUIData( executionId = ui.getExecutionId, + rootExecutionId = ui.getExecutionId, Review Comment: This should be ui.getRootExecutionId after updating the protobuf definition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39268: [SPARK-41752][SQL][UI] Group nested executions under the root execution
gengliangwang commented on code in PR #39268: URL: https://github.com/apache/spark/pull/39268#discussion_r1062032402 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala: ## @@ -43,6 +43,8 @@ case class SparkListenerSQLAdaptiveSQLMetricUpdates( @DeveloperApi case class SparkListenerSQLExecutionStart( executionId: Long, +// if the execution is a root, then rootExecutionId == executionId +rootExecutionId: Long, Review Comment: We need to refactor the code change in https://github.com/apache/spark/blob/master/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto#L387 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`
gengliangwang commented on PR #39226: URL: https://github.com/apache/spark/pull/39226#issuecomment-1371644321 @LuciferYang The failed test doesn't seem related. Just to double confirm, could you retrigger it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`
zhengruifeng opened a new pull request, #39396: URL: https://github.com/apache/spark/pull/39396 ### What changes were proposed in this pull request? enable a group of doctests ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? enabled tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #39277: [SPARK-41708][SQL] Pull v1write information to `WriteFiles`
ulysses-you commented on code in PR #39277: URL: https://github.com/apache/spark/pull/39277#discussion_r1062020277 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala: ## @@ -53,13 +59,17 @@ case class WriteFiles(child: LogicalPlan) extends UnaryNode { /** * Responsible for writing files. */ -case class WriteFilesExec(child: SparkPlan) extends UnaryExecNode { +case class WriteFilesExec( +child: SparkPlan, +fileFormat: FileFormat, +partitionColumns: Seq[Attribute], +bucketSpec: Option[BucketSpec], +options: Map[String, String], +staticPartitions: TablePartitionSpec) extends UnaryExecNode { override def output: Seq[Attribute] = Seq.empty - override protected def doExecuteWrite(writeSpec: WriteSpec): RDD[WriterCommitMessage] = { -assert(writeSpec.isInstanceOf[WriteFilesSpec]) -val writeFilesSpec: WriteFilesSpec = writeSpec.asInstanceOf[WriteFilesSpec] - + override protected def doExecuteWrite( + writeFilesSpec: WriteFilesSpec): RDD[WriterCommitMessage] = { Review Comment: Seems it's a bit hard. look at the current information: ```scala case class WriteFilesSpec( description: WriteJobDescription, committer: FileCommitProtocol, concurrentOutputWriterSpecFunc: SparkPlan => Option[ConcurrentOutputWriterSpec]) ``` - `ConcurrentOutputWriterSpec` and `FileCommitProtocol` contain the output spec so we can not replace them - `WriteJobDescription` contains many information which includes what we pull out, but if we want to reduce something inside `WriteJobDescription`, we need to create a new class to hold others. I'm not sure it's worth to do that. ```scala class WriteJobDescription( val uuid: String, val serializableHadoopConf: SerializableConfiguration, val outputWriterFactory: OutputWriterFactory, val allColumns: Seq[Attribute], val dataColumns: Seq[Attribute], val partitionColumns: Seq[Attribute], val bucketSpec: Option[WriterBucketSpec], val path: String, val customPartitionLocations: Map[TablePartitionSpec, String], val maxRecordsPerFile: Long, val timeZoneId: String, val statsTrackers: Seq[WriteJobStatsTracker]) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39388: [SPARK-41354][CONNECT][PYTHON] implement RepartitionByExpression
zhengruifeng commented on PR #39388: URL: https://github.com/apache/spark/pull/39388#issuecomment-1371624171 just to confirm, the proto `RepartitionByExpression repartition_by_expression = 27` can support both `def repartition(self, *cols: "ColumnOrName")` `def repartitionByRange(self, *cols: "ColumnOrName")` right? @dengziming -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe
zhengruifeng commented on PR #39378: URL: https://github.com/apache/spark/pull/39378#issuecomment-1371622584 > Shall we fix `TODO(SPARK-41821): Fix DataFrame.describe` below? You can remove: > > ``` > # TODO(SPARK-41821): Fix DataFrame.describe > del pyspark.sql.connect.dataframe.DataFrame.describe.__doc__ > ``` +1, please also enable the doctest -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
dongjoon-hyun commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062012515 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala: ## @@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils } } +class SQLAppStatusListenerWithInMemoryStoreSuite extends SQLAppStatusListenerSuite { + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} +class SQLAppStatusListenerWithRocksDBBackendSuite extends SQLAppStatusListenerSuite { + + // TODO: SPARK-41882 remove this field after RocksDB can automatically cleanup Review Comment: Use a new real JIRA ID with `IDed TODO` style. ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala: ## @@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils } } +class SQLAppStatusListenerWithInMemoryStoreSuite extends SQLAppStatusListenerSuite { + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} +class SQLAppStatusListenerWithRocksDBBackendSuite extends SQLAppStatusListenerSuite { + + // TODO: SPARK-41882 remove this field after RocksDB can automatically cleanup + private var storePath: File = _ + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +storePath = Utils.createTempDir() +conf.set(LIVE_UI_LOCAL_STORE_DIR, storePath.getCanonicalPath) +val appStatusStore = AppStatusStore.createLiveStore(conf) +kvstore = appStatusStore.store.asInstanceOf[ElementTrackingStore] +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } + + // TODO: SPARK-41882 remove this method after RocksDB can automatically cleanup Review Comment: ditto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
dongjoon-hyun commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062012227 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala: ## @@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils } } +class SQLAppStatusListenerWithInMemoryStoreSuite extends SQLAppStatusListenerSuite { + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} Review Comment: New line is needed after this. ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala: ## @@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends SharedSparkSession with JsonTestUtils } } +class SQLAppStatusListenerWithInMemoryStoreSuite extends SQLAppStatusListenerSuite { + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} +class SQLAppStatusListenerWithRocksDBBackendSuite extends SQLAppStatusListenerSuite { + Review Comment: Remove empty line here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
dongjoon-hyun commented on code in PR #39385: URL: https://github.com/apache/spark/pull/39385#discussion_r1062012012 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala: ## @@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession with BeforeAndAfter { } } +class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite { + override protected def createStatusStore: SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} + +class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite { + // TODO: SPARK-41882 remove this field after RocksDB can automatically cleanup Review Comment: Instead of this style, please create a new real JIRA ID and point it by using `IDed TODO` style, `TODO(SPARK-12345)`. ## sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala: ## @@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession with BeforeAndAfter { } } +class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite { + override protected def createStatusStore: SQLAppStatusStore = { +val conf = sparkContext.conf +kvstore = new ElementTrackingStore(new InMemoryStore, conf) +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } +} + +class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite { + // TODO: SPARK-41882 remove this field after RocksDB can automatically cleanup + private var storePath: File = _ + + override protected def createStatusStore(): SQLAppStatusStore = { +val conf = sparkContext.conf +storePath = Utils.createTempDir() +conf.set(LIVE_UI_LOCAL_STORE_DIR, storePath.getCanonicalPath) +val appStatusStore = AppStatusStore.createLiveStore(conf) +kvstore = appStatusStore.store.asInstanceOf[ElementTrackingStore] +val listener = new SQLAppStatusListener(conf, kvstore, live = true) +new SQLAppStatusStore(kvstore, Some(listener)) + } + + // TODO: SPARK-41882 remove this method after RocksDB can automatically cleanup Review Comment: ditto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields
xkrogen commented on PR #38660: URL: https://github.com/apache/spark/pull/38660#issuecomment-1371607231 Merged into latest master to resolve conflicts. @allisonwang-db or @cloud-fan , any thoughts/comments on the latest diff? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe
beliefer commented on PR #39378: URL: https://github.com/apache/spark/pull/39378#issuecomment-1371606490 ping @zhengruifeng cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tedyu commented on pull request #39395: [SQL] Use foldLeft for DeduplicateRelations
tedyu commented on PR #39395: URL: https://github.com/apache/spark/pull/39395#issuecomment-1371583867 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tedyu opened a new pull request, #39395: [SQL] Use foldLeft for DeduplicateRelations
tedyu opened a new pull request, #39395: URL: https://github.com/apache/spark/pull/39395 ### What changes were proposed in this pull request? This PR uses `foldLeft` in `DeduplicateRelations` for better performance. ### Why are the changes needed? `foldRight` is not as performant as `foldLeft`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rithwik-db commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
rithwik-db commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061992443 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
lu-wang-dl commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061991762 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] rithwik-db commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
rithwik-db commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061991704 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file Review Comment: Yeah, documentation will need to constantly evolve for this project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rithwik-db commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
rithwik-db commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061991285 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] rithwik-db commented on a diff in pull request #39146: [WIP][SPARK-41589][PYTHON][ML] PyTorch Distributor Baseline API Changes
rithwik-db commented on code in PR #39146: URL: https://github.com/apache/spark/pull/39146#discussion_r1061990731 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,287 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import math +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +class Distributor: +def __init__( +self, +num_processes: int = 1, +local_mode: bool = True, +use_gpu: bool = True, +spark: Optional[SparkSession] = None, +): +self.num_processes = num_processes +self.local_mode = local_mode +self.use_gpu = use_gpu +if spark: +self.spark = spark +else: +self.spark = SparkSession.builder.getOrCreate() +self.sc = self.spark.sparkContext +self.num_tasks = self._get_num_tasks() +self.ssl_conf = None + +def _get_num_tasks(self) -> int: +""" +Returns the number of Spark tasks to use for distributed training + +Returns +--- +The number of Spark tasks to use for distributed training +""" +if self.use_gpu: +key = "spark.task.resource.gpu.amount" +if self.sc.getConf().contains(key): +if gpu_amount_raw := self.sc.getConf().get(key): # mypy error?? +task_gpu_amount = int(gpu_amount_raw) +else: +task_gpu_amount = 1 # for single node clusters +if task_gpu_amount < 1: +raise ValueError( +f"The Spark conf `{key}` has a value " +f"of {task_gpu_amount} but it " +"should not have a value less than 1." +) +return math.ceil(self.num_processes / task_gpu_amount) +return self.num_processes + +def _validate_input_params(self) -> None: +if self.num_processes <= 0: +raise ValueError("num_proccesses has to be a positive integer") + +def _check_encryption(self) -> None: +"""Checks to see if the user requires encrpytion of data. +If required, throw an exception since we don't support that. + +Raises +-- +NotImplementedError +Thrown when the user doesn't use PyTorchDistributor +Exception +Thrown when the user requires ssl encryption +""" +if not "ssl_conf": +raise Exception( +"Distributor doesn't have this functionality. Use PyTorchDistributor instead." +) +is_ssl_enabled = get_conf_boolean(self.sc, "spark.ssl.enabled", "false") +ignore_ssl = get_conf_boolean(self.sc, self.ssl_conf, "false") # type: ignore +
[GitHub] [spark] github-actions[bot] commented on pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files
github-actions[bot] commented on PR #36700: URL: https://github.com/apache/spark/pull/36700#issuecomment-1371573628 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36052: [SPARK-38777][YARN] Add `bin/spark-submit --kill / --status` support for yarn
github-actions[bot] commented on PR #36052: URL: https://github.com/apache/spark/pull/36052#issuecomment-1371573647 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #37899: [SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException
github-actions[bot] commented on PR #37899: URL: https://github.com/apache/spark/pull/37899#issuecomment-1371573567 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
lu-wang-dl commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061987077 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
lu-wang-dl commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061987077 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
lu-wang-dl commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061985290 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
lu-wang-dl commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061983056 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
lu-wang-dl commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061979910 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: +"""Returns the expected torchrun command based on the input. + +Parameters +-- +input_params : dict[str, Any] +The dictionary of the input parameters of the distributor. The most relevant params +are local_mode and num_processes. +train_path : str +The path to the (potentially autogenerated) train.py file +args: *args +The input arguments to the train.py file. + +Returns +--- +str +The output torchrun command +""" +local_mode = input_params["local_mode"] +num_processes = input_params["num_processes"] + +if local_mode: +standalone = ["--standalone", "--nnodes=1"] +processes_per_node = num_processes +else: +master_addr, master_port = os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] +node_rank = os.environ["RANK"] +standalone = [ +f"--nnodes={num_processes}", +f"--node_rank={node_rank}", +f"--rdzv_endpoint={master_addr}:{master_port}", +"--rdzv_id=0", +] # TODO: setup random ID that is gleaned from env variables +processes_per_node = 1 + +args_string = list(map(str, args)) # converting all args to strings + +return ( +["torchrun"] ++ standalone +
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU
lu-wang-dl commented on code in PR #39188: URL: https://github.com/apache/spark/pull/39188#discussion_r1061979807 ## python/pyspark/ml/torch/distributor.py: ## @@ -0,0 +1,491 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import collections +import ctypes +import math +import os +import random +import re +import signal +import sys +import subprocess +import time +from typing import Union, Callable, Optional, Any +import warnings + +from pyspark.sql import SparkSession +from pyspark.context import SparkContext + + +# Moved the util functions to this file for now +# TODO(SPARK-41589): will move the functions and tests to an external file +# once we are in agreement about which functions should be in utils.py +def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool: +"""Get the conf "key" from the given spark context, +or return the default value if the conf is not set. +This expects the conf value to be a boolean or string; +if the value is a string, this checks for all capitalization +patterns of "true" and "false" to match Scala. + +Parameters +-- +sc : SparkContext +The SparkContext for the distributor. +key : str +string for conf name +default_value : str +default value for the conf value for the given key + +Returns +--- +bool +Returns the boolean value that corresponds to the conf + +Raises +-- +Exception +Thrown when the conf value is not a boolean +""" +val = sc.getConf().get(key, default_value) +lowercase_val = val.lower() +if lowercase_val == "true": +return True +if lowercase_val == "false": +return False +raise Exception( +"get_conf_boolean expected a boolean conf " +"value but found value of type {} " +"with value: {}".format(type(val), val) +) + + +def get_gpus_owned(addresses: list[str]) -> list[str]: +""" +Gets the number of GPUs that Spark scheduled to the calling task. +Returns: +The number of GPUs that Spark scheduled to the calling task. +""" +CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES" +pattern = re.compile("^[1-9][0-9]*|0$") +if any(not pattern.match(address) for address in addresses): +raise ValueError( +f"Found GPU addresses {addresses} which " +"are not all in the correct format " +"for CUDA_VISIBLE_DEVICES, which requires " +"integers with no zero padding." +) +if CUDA_VISIBLE_DEVICES in os.environ: +gpu_indices = list(map(int, addresses)) +gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",") +gpu_owned = [gpu_list[i] for i in gpu_indices] +return gpu_owned +return addresses + + +def create_torchrun_command(input_params: dict[str, Any], train_path: str, *args: Any) -> list[str]: Review Comment: Mark these as internal function: `_create_torchrun_command`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39382: [SPARK-41878][CONNECT][TESTS] pyspark.sql.tests.test_dataframe - Add JIRAs or messages for skipped messages
HyukjinKwon closed pull request #39382: [SPARK-41878][CONNECT][TESTS] pyspark.sql.tests.test_dataframe - Add JIRAs or messages for skipped messages URL: https://github.com/apache/spark/pull/39382 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39382: [SPARK-41878][CONNECT][TESTS] pyspark.sql.tests.test_dataframe - Add JIRAs or messages for skipped messages
HyukjinKwon commented on PR #39382: URL: https://github.com/apache/spark/pull/39382#issuecomment-1371556222 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39386: [SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly
HyukjinKwon closed pull request #39386: [SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly URL: https://github.com/apache/spark/pull/39386 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39386: [SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly
HyukjinKwon commented on PR #39386: URL: https://github.com/apache/spark/pull/39386#issuecomment-1371555616 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org