[GitHub] [spark] HyukjinKwon commented on pull request #39400: [SPARK-41891][CONNECT][TESTS] Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_inlin

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39400:
URL: https://github.com/apache/spark/pull/39400#issuecomment-1371890753

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39394: [SPARK-41575][SQL] Assign name to _LEGACY_ERROR_TEMP_2054

2023-01-04 Thread GitBox


MaxGekk commented on code in PR #39394:
URL: https://github.com/apache/spark/pull/39394#discussion_r1062189805


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##
@@ -784,7 +784,7 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
 
   def taskFailedWhileWritingRowsError(cause: Throwable): Throwable = {
 new SparkException(
-  errorClass = "_LEGACY_ERROR_TEMP_2054",
+  errorClass = "TASK_WRITE_FAILED",
   messageParameters = Map("message" -> cause.getMessage),
   cause = cause)

Review Comment:
   Do we really need to append `cause.getMessage` if we attach the `cause` 
exception? cc @srielau 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39401: [SPARK-41893][BUILD] Publish SBOM artifacts

2023-01-04 Thread GitBox


dongjoon-hyun commented on PR #39401:
URL: https://github.com/apache/spark/pull/39401#issuecomment-1371884646

   Ah, it seems that I missed some failures. I convert this as `Draft`. Let me 
dig this.
   ```
   [WARNING] An unexpected issue occurred attempting to resolve the effective 
pom for  org.xerial.snappy:snappy-java:1.1.8.4
   org.apache.maven.project.ProjectBuildingException: Some problems were 
encountered while processing the POMs:
   [ERROR] Unknown packaging: bundle @ line 6, column 16
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39401: [SPARK-41893][BUILD] Publish SBOM artifacts

2023-01-04 Thread GitBox


dongjoon-hyun commented on PR #39401:
URL: https://github.com/apache/spark/pull/39401#issuecomment-1371876411

   cc @srowen and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #39402: [SPARK-41889][SQL] Attach root cause to invalidPatternError

2023-01-04 Thread GitBox


panbingkun opened a new pull request, #39402:
URL: https://github.com/apache/spark/pull/39402

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #39401: [SPARK-41893][BUILD] Publish SBOM artifacts

2023-01-04 Thread GitBox


dongjoon-hyun opened a new pull request, #39401:
URL: https://github.com/apache/spark/pull/39401

   ### What changes were proposed in this pull request?
   
   This PR aims to publish `SBOM` artifacts.
   
   ### Why are the changes needed?
   
   Here is an article to give some context.
   - 
https://www.activestate.com/blog/why-the-us-government-is-mandating-software-bill-of-materials-sbom/
   
   Software Bill of Materials (SBOM) are additional artifacts containing the 
aggregate of all direct and transitive dependencies of a project. The US 
Government (based on NIST recommendations) currently accepts only the three 
most popular SBOM standards as valid, namely: 
[CycloneDX](https://cyclonedx.org/), [Software Identification (SWID) 
tag](https://csrc.nist.gov/projects/Software-Identification-SWID), [Software 
Package Data Exchange® (SPDX)](https://spdx.dev/).
   
   This PR uses [CycloneDX maven 
plugin](https://github.com/CycloneDX/cyclonedx-maven-plugin), a lightweight 
software bill of materials (SBOM) standard designed for use in application 
security contexts and supply chain component analysis.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but dev-only changes.
   
   ### How was this patch tested?
   
   Manually test.
   ```
   $ mvn install -DskipTests
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`

2023-01-04 Thread GitBox


zhengruifeng commented on PR #39398:
URL: https://github.com/apache/spark/pull/39398#issuecomment-1371869339

   merged into master, thank you @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`

2023-01-04 Thread GitBox


zhengruifeng closed pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the 
missing ordering parameter in `Sort` and `sortWithinPartitions`
URL: https://github.com/apache/spark/pull/39398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38356: [SPARK-40885] `Sort` may not take effect when it is the last 'Transform' operator

2023-01-04 Thread GitBox


EnricoMi commented on code in PR #38356:
URL: https://github.com/apache/spark/pull/38356#discussion_r1021126305


##
sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala:
##
@@ -220,6 +220,23 @@ class PartitionedWriteSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  test("SPARK-40885: V1 write uses the sort with partitionBy operator") {
+withTempPath { f =>
+  Seq((20, 30, "partition"), (15, 20, "partition"),
+(30, 70, "partition"), (18, 40, "partition"))
+.toDF("id", "sort_col", "p")
+.repartition(1)
+.sortWithinPartitions("p", "sort_col")
+.write
+.partitionBy("p")

Review Comment:
   > `partitionBy("p")` requires sorting the data by `p` per partition.
   
   Why does `write.partitionBy("p")` require sorting the data by `p` per 
partition? I understand why `df.groupBy($"p")` requires in-partition order by 
`p` (`GroupedIterator`). But `write.partitionBy("p")`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #39131: [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations

2023-01-04 Thread GitBox


EnricoMi commented on PR #39131:
URL: https://github.com/apache/spark/pull/39131#issuecomment-1371848352

   > @EnricoMi thanks for the fix! which spark version starts to have this bug?
   
   This was introduced in Spark 3.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

2023-01-04 Thread GitBox


MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1062158983


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+Seq(false, true).foreach { multiLine =>
+  val exception = intercept[SparkException] {
+spark.read
+  .format("csv")
+  .option("multiLine", multiLine)
+  .options(Map("header" -> "true", "mode" -> "failfast"))
+  .load(testFile(carsFile)).collect()
+  }
+
+  checkError(
+exception = exception.getCause.asInstanceOf[SparkException],
+errorClass = "_LEGACY_ERROR_TEMP_2177",

Review Comment:
   Could you explain why did you add the test for the error class in the PR. 



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"

Review Comment:
   The same is defined in the parent class (just make it as `protected`). 
Please, remove it.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -319,15 +319,17 @@ class UnivocityParser(
   throw BadRecordException(
 () => getCurrentInput,
 () => None,
-QueryExecutionErrors.malformedCSVRecordError())
+QueryExecutionErrors.malformedCSVRecordError(""))
 }
 
+val currentInput = getCurrentInput

Review Comment:
   It is not used in regular cases, correct? Don't think we should introduce 
additional overhead. Please, use `getCurrentInput` directly in errors.



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+Seq(false, true).foreach { multiLine =>
+  val exception = intercept[SparkException] {
+spark.read
+  .format("csv")
+  .option("multiLine", multiLine)
+  .options(Map("header" -> "true", "mode" -> "failfast"))
+  .load(testFile(carsFile)).collect()
+  }
+
+  checkError(
+exception = exception.getCause.asInstanceOf[SparkException],
+errorClass = "_LEGACY_ERROR_TEMP_2177",
+parameters = Map("failFastMode" -> "FAILFAST")
+  )
+}
+  }
 }
 
 class CSVv2Suite extends CSVSuite {
   override protected def sparkConf: SparkConf =
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v2") {
+Seq(false, true).foreach { multiLine =>
+  val exception = intercept[SparkException] {
+spark.read
+  .format("csv")
+  .option("multiLine", multiLine)
+  .options(Map("header" -> "true", "mode" -> "failfast"))
+  .load(testFile(carsFile)).collect()
+  }
+
+  checkError(
+exception = exception.getCause.asInstanceOf[SparkException],
+errorClass = "_LEGACY_ERROR_TEMP_2064",

Review Comment:
   The same question like above. How this is related to assigning a name to 
`_LEGACY_ERROR_TEMP_2149`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #38163: [SPARK-40711][SQL] Add spill size metrics for window

2023-01-04 Thread GitBox


ulysses-you commented on PR #38163:
URL: https://github.com/apache/spark/pull/38163#issuecomment-1371842097

   let me rebase this again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #39131: [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations

2023-01-04 Thread GitBox


EnricoMi commented on code in PR #39131:
URL: https://github.com/apache/spark/pull/39131#discussion_r1062152718


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LeftSemiAntiJoinPushDownSuite.scala:
##
@@ -46,7 +46,7 @@ class LeftSemiPushdownSuite extends PlanTest {
   val testRelation1 = LocalRelation($"d".int)
   val testRelation2 = LocalRelation($"e".int)
 
-  test("Project: LeftSemiAnti join pushdown") {
+  test("Project: LeftSemi join pushdown") {

Review Comment:
   The term `LeftSemiAnti` is wrong and misleading for individual tests, 
correcting this while I am touching the file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39377: [SPARK-41867][SQL] Selective predicate should respect InMemoryRelation

2023-01-04 Thread GitBox


ulysses-you commented on code in PR #39377:
URL: https://github.com/apache/spark/pull/39377#discussion_r1062152134


##
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##
@@ -167,44 +167,10 @@ object PartitionPruning extends Rule[LogicalPlan] with 
PredicateHelper with Join
 }
 
 val estimatePruningSideSize = filterRatio * 
partPlan.stats.sizeInBytes.toFloat
-val overhead = calculatePlanOverhead(otherPlan)
+val overhead = collectLeafPlanOverhead(otherPlan).sum.toFloat
 estimatePruningSideSize > overhead
   }
 
-  /**
-   * Calculates a heuristic overhead of a logical plan. Normally it returns 
the total
-   * size in bytes of all scan relations. We don't count in-memory relation 
which uses
-   * only memory.
-   */
-  private def calculatePlanOverhead(plan: LogicalPlan): Float = {
-val (cached, notCached) = plan.collectLeaves().partition(p => p match {
-  case _: InMemoryRelation => true
-  case _ => false
-})
-val scanOverhead = notCached.map(_.stats.sizeInBytes).sum.toFloat
-val cachedOverhead = cached.map {
-  case m: InMemoryRelation if m.cacheBuilder.storageLevel.useDisk &&

Review Comment:
   this method moved into the new trait.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39377: [SPARK-41867][SQL] Selective predicate should respect InMemoryRelation

2023-01-04 Thread GitBox


ulysses-you commented on code in PR #39377:
URL: https://github.com/apache/spark/pull/39377#discussion_r1062151959


##
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##
@@ -167,44 +167,10 @@ object PartitionPruning extends Rule[LogicalPlan] with 
PredicateHelper with Join
 }
 
 val estimatePruningSideSize = filterRatio * 
partPlan.stats.sizeInBytes.toFloat
-val overhead = calculatePlanOverhead(otherPlan)
+val overhead = collectLeafPlanOverhead(otherPlan).sum.toFloat
 estimatePruningSideSize > overhead
   }
 
-  /**
-   * Calculates a heuristic overhead of a logical plan. Normally it returns 
the total
-   * size in bytes of all scan relations. We don't count in-memory relation 
which uses
-   * only memory.
-   */
-  private def calculatePlanOverhead(plan: LogicalPlan): Float = {
-val (cached, notCached) = plan.collectLeaves().partition(p => p match {
-  case _: InMemoryRelation => true
-  case _ => false
-})
-val scanOverhead = notCached.map(_.stats.sizeInBytes).sum.toFloat
-val cachedOverhead = cached.map {
-  case m: InMemoryRelation if m.cacheBuilder.storageLevel.useDisk &&

Review Comment:
   the cached relation should be materialized right ? otherwise its statistics 
should be same with normal plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #39377: [SPARK-41867][SQL] Selective predicate should respect InMemoryRelation

2023-01-04 Thread GitBox


ulysses-you commented on PR #39377:
URL: https://github.com/apache/spark/pull/39377#issuecomment-1371839438

   @cloud-fan good suggestion. Follow the current code path:
   1. check if build side has selective predicate. Now it always returns ture 
if the cached relation is materialized
   2. check if the statistics is match the requirements. Now both DPP and 
Runtime Filter use the same calculate overhead method which considers the 
cached relation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


LuciferYang commented on PR #39385:
URL: https://github.com/apache/spark/pull/39385#issuecomment-1371825146

   Rebased(merged https://github.com/apache/spark/pull/39226), please help to 
review if you have time, thanks @gengliangwang @dongjoon-hyun @techaddict 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


LuciferYang commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062138688


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala:
##
@@ -1007,6 +1002,34 @@ class SQLAppStatusListenerSuite extends 
SharedSparkSession with JsonTestUtils
   }
 }
 
+class SQLAppStatusListenerWithInMemoryStoreSuite extends 
SQLAppStatusListenerSuite {
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+
+class SQLAppStatusListenerWithRocksDBBackendSuite extends 
SQLAppStatusListenerSuite {
+  private val storePath = Utils.createTempDir()

Review Comment:
   still make `storePath` as class field and override  `afterAll` to delete 
base `storePath`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39389: [SPARK-41574][SQL] Assign name to _LEGACY_ERROR_TEMP_2009

2023-01-04 Thread GitBox


MaxGekk commented on code in PR #39389:
URL: https://github.com/apache/spark/pull/39389#discussion_r1062135726


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##
@@ -378,10 +378,9 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
 )
   }
 
-  def dataTypeOperationUnsupportedError(): SparkUnsupportedOperationException 
= {
-new SparkUnsupportedOperationException(
-  errorClass = "_LEGACY_ERROR_TEMP_2009",
-  messageParameters = Map.empty)
+  def dataTypeOperationUnsupportedError(): Throwable = {
+SparkException.internalError(
+  s"""Operation dataType is not supported""")

Review Comment:
   ```suggestion
 "The operation `dataType` is not supported.")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #39305: [SPARK-41580][SQL] Assign name to _LEGACY_ERROR_TEMP_2137

2023-01-04 Thread GitBox


MaxGekk closed pull request #39305: [SPARK-41580][SQL] Assign name to 
_LEGACY_ERROR_TEMP_2137
URL: https://github.com/apache/spark/pull/39305


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #39305: [SPARK-41580][SQL] Assign name to _LEGACY_ERROR_TEMP_2137

2023-01-04 Thread GitBox


MaxGekk commented on PR #39305:
URL: https://github.com/apache/spark/pull/39305#issuecomment-1371814622

   +1, LGTM. Merging to master.
   Thank you, @itholic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe

2023-01-04 Thread GitBox


beliefer commented on PR #39378:
URL: https://github.com/apache/spark/pull/39378#issuecomment-1371811698

   @HyukjinKwon @zhengruifeng Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #39281: [SPARK-41576][SQL] Assign name to _LEGACY_ERROR_TEMP_2051

2023-01-04 Thread GitBox


MaxGekk closed pull request #39281: [SPARK-41576][SQL] Assign name to 
_LEGACY_ERROR_TEMP_2051
URL: https://github.com/apache/spark/pull/39281


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #39281: [SPARK-41576][SQL] Assign name to _LEGACY_ERROR_TEMP_2051

2023-01-04 Thread GitBox


MaxGekk commented on PR #39281:
URL: https://github.com/apache/spark/pull/39281#issuecomment-1371809041

   +1, LGTM. Merging to master.
   Thank you, @itholic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe

2023-01-04 Thread GitBox


HyukjinKwon closed pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc 
test for DataFrame.describe
URL: https://github.com/apache/spark/pull/39378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39378:
URL: https://github.com/apache/spark/pull/39378#issuecomment-1371806726

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39400: [SPARK-41891][CONNECT][TESTS] Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls,

2023-01-04 Thread GitBox


HyukjinKwon commented on code in PR #39400:
URL: https://github.com/apache/spark/pull/39400#discussion_r1062125012


##
python/pyspark/sql/tests/connect/test_parity_functions.py:
##
@@ -68,30 +60,14 @@ def test_date_add_function(self):
 def test_date_sub_function(self):
 super().test_date_sub_function()
 
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_dayofweek(self):
-super().test_dayofweek()
-
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_explode(self):
 super().test_explode()
 
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_first_last_ignorenulls(self):
-super().test_first_last_ignorenulls()
-
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_function_parity(self):

Review Comment:
   Let's exclude this. It uses `SparkContect` that Spark Connect doesn't 
support.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39397: [MINOR][CONNECT] Fix typos in connect/plan.py

2023-01-04 Thread GitBox


HyukjinKwon closed pull request #39397: [MINOR][CONNECT] Fix typos in 
connect/plan.py
URL: https://github.com/apache/spark/pull/39397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39397: [MINOR][CONNECT] Fix typos in connect/plan.py

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39397:
URL: https://github.com/apache/spark/pull/39397#issuecomment-1371800977

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, float or int

2023-01-04 Thread GitBox


HyukjinKwon closed pull request #39393: [SPARK-41871][CONNECT] DataFrame hint 
parameter can be str, float or int
URL: https://github.com/apache/spark/pull/39393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, float or int

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39393:
URL: https://github.com/apache/spark/pull/39393#issuecomment-1371800022

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`

2023-01-04 Thread GitBox


zhengruifeng commented on PR #39398:
URL: https://github.com/apache/spark/pull/39398#issuecomment-1371784933

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dengziming commented on pull request #39388: [SPARK-41354][CONNECT][PYTHON] implement RepartitionByExpression

2023-01-04 Thread GitBox


dengziming commented on PR #39388:
URL: https://github.com/apache/spark/pull/39388#issuecomment-1371775675

   > just to confirm, the proto `RepartitionByExpression 
repartition_by_expression = 27` can support both
   > 
   > `def repartition(self, *cols: "ColumnOrName")` `def 
repartitionByRange(self, *cols: "ColumnOrName")`
   
   Yes, you are right @zhengruifeng . Both `repartition(cols)` and 
`repartitionByRange(cols)` will be transformed to `RepartitionByExpression`, 
this can be seen in `Dataset.scala` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #39395: [SQL] Use foldLeft for DeduplicateRelations

2023-01-04 Thread GitBox


cloud-fan commented on PR #39395:
URL: https://github.com/apache/spark/pull/39395#issuecomment-1371775038

   looks fine if tests pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] techaddict opened a new pull request, #39400: [SPARK-41891][CONNECT][TESTS] Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_funct

2023-01-04 Thread GitBox


techaddict opened a new pull request, #39400:
URL: https://github.com/apache/spark/pull/39400

   ### What changes were proposed in this pull request?
   Enabling tests in connect/test_parity_functions.py
   
   ### Why are the changes needed?
   Improved coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   New Tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files

2023-01-04 Thread GitBox


cloud-fan closed pull request #36700: [SPARK-39318][SQL] Remove 
tpch-plan-stability WithStats golden files
URL: https://github.com/apache/spark/pull/36700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files

2023-01-04 Thread GitBox


cloud-fan commented on PR #36700:
URL: https://github.com/apache/spark/pull/36700#issuecomment-1371774108

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38163: [SPARK-40711][SQL] Add spill size metrics for window

2023-01-04 Thread GitBox


cloud-fan commented on PR #38163:
URL: https://github.com/apache/spark/pull/38163#issuecomment-1371773428

   LGTM if all tests pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`

2023-01-04 Thread GitBox


LuciferYang commented on PR #39226:
URL: https://github.com/apache/spark/pull/39226#issuecomment-1371772823

   Thanks @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`

2023-01-04 Thread GitBox


gengliangwang closed pull request #39226: [SPARK-41694][CORE] Isolate RocksDB 
path for Live UI and automatically cleanup when `SparkContext.stop()`
URL: https://github.com/apache/spark/pull/39226


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`

2023-01-04 Thread GitBox


gengliangwang commented on PR #39226:
URL: https://github.com/apache/spark/pull/39226#issuecomment-1371770833

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #39399: [SPARK-41890][CORE][SQL][UI] Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/`SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-04 Thread GitBox


LuciferYang opened a new pull request, #39399:
URL: https://github.com/apache/spark/pull/39399

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2023-01-04 Thread GitBox


shrprasa commented on PR #37880:
URL: https://github.com/apache/spark/pull/37880#issuecomment-1371749436

   @dongjoon-hyun @holdenk Can you please review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #39398: [SPARK-41829][CONNECT][PYTHON] Add the missing ordering parameter in `Sort` and `sortWithinPartitions`

2023-01-04 Thread GitBox


zhengruifeng opened a new pull request, #39398:
URL: https://github.com/apache/spark/pull/39398

   ### What changes were proposed in this pull request?
   Add the missing ordering parameter in `Sort` and `sortWithinPartitions`
   
   
   ### Why are the changes needed?
   API coverage
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   
   ### How was this patch tested?
   enabled doctests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`

2023-01-04 Thread GitBox


zhengruifeng commented on PR #39396:
URL: https://github.com/apache/spark/pull/39396#issuecomment-1371745236

   thank you @HyukjinKwon for reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`

2023-01-04 Thread GitBox


HyukjinKwon closed pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable 
doctests related to `DataFrame.show`
URL: https://github.com/apache/spark/pull/39396


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39396:
URL: https://github.com/apache/spark/pull/39396#issuecomment-1371739702

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] techaddict commented on pull request #39397: [MINOR] fix typos

2023-01-04 Thread GitBox


techaddict commented on PR #39397:
URL: https://github.com/apache/spark/pull/39397#issuecomment-1371734926

   cc: @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] techaddict opened a new pull request, #39397: [MINOR] fix typos

2023-01-04 Thread GitBox


techaddict opened a new pull request, #39397:
URL: https://github.com/apache/spark/pull/39397

   ### What changes were proposed in this pull request?
   Fixing typos in connect/plan.py
   
   
   ### Why are the changes needed?
   Typos
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   just fixing typos
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39260: [SPARK-41579][SQL] Assign name to _LEGACY_ERROR_TEMP_1249

2023-01-04 Thread GitBox


itholic commented on code in PR #39260:
URL: https://github.com/apache/spark/pull/39260#discussion_r1062073949


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -2405,22 +2405,24 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
   messageParameters = Map.empty)
   }
 
-  def cmdOnlyWorksOnPartitionedTablesError(cmd: String, tableIdentWithDB: 
String): Throwable = {
+  def cmdOnlyWorksOnPartitionedTablesError(
+  operation: String,
+  tableIdentWithDB: String): Throwable = {
 new AnalysisException(
-  errorClass = "_LEGACY_ERROR_TEMP_1249",
+  errorClass = "NOT_A_PARTITIONED_TABLE",
   messageParameters = Map(
-"cmd" -> cmd,
+"operation" -> toSQLStmt(operation),
 "tableIdentWithDB" -> tableIdentWithDB))
   }
 
-  def cmdOnlyWorksOnTableWithLocationError(cmd: String, tableIdentWithDB: 
String): Throwable = {
+  def cmdOnlyWorksOnTableWithLocationError(

Review Comment:
   Oh... yeah seems like we should introduce new error class.
   
   Just turned this error into `_LEGACY_ERROR_TEMP_2446`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39282: [SPARK-41581][SQL] Assign name to _LEGACY_ERROR_TEMP_1230

2023-01-04 Thread GitBox


itholic commented on code in PR #39282:
URL: https://github.com/apache/spark/pull/39282#discussion_r1062069986


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala:
##
@@ -680,6 +681,18 @@ class QueryCompilationErrorsSuite
   context = ExpectedContext("", "", 7, 13, "CAST(1)")
 )
   }
+
+  test("NEGATIVE_SCALE_NOT_ALLOWED: negative scale for Decimal is not 
allowed") {

Review Comment:
   Thanks! Moved  and migrate the existing test into `checkError`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] techaddict commented on a diff in pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread GitBox


techaddict commented on code in PR #39393:
URL: https://github.com/apache/spark/pull/39393#discussion_r1062062273


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -480,9 +480,10 @@ def to_jcols(
 
 def hint(self, name: str, *params: Any) -> "DataFrame":
 for param in params:
-if param is not None and not isinstance(param, (int, str)):
+if param is not None and not isinstance(param, (int, str, float, 
list)):

Review Comment:
   Awesome, will create a JIRA to add list later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #38163: [SPARK-40711][SQL] Add spill size metrics for window

2023-01-04 Thread GitBox


ulysses-you commented on code in PR #38163:
URL: https://github.com/apache/spark/pull/38163#discussion_r1062060973


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala:
##
@@ -337,6 +338,7 @@ case class WindowInPandasExec(
   if (!found) {
 // clear final partition
 buffer.clear()
+spillSize += buffer.spillSize

Review Comment:
   `ExternalAppendOnlyUnsafeRowArray` supported to report spill size since 
https://github.com/apache/spark/pull/34999



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39393:
URL: https://github.com/apache/spark/pull/39393#issuecomment-1371708907

   yeah that's fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread GitBox


HyukjinKwon commented on code in PR #39393:
URL: https://github.com/apache/spark/pull/39393#discussion_r1062060563


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -480,9 +480,10 @@ def to_jcols(
 
 def hint(self, name: str, *params: Any) -> "DataFrame":
 for param in params:
-if param is not None and not isinstance(param, (int, str)):
+if param is not None and not isinstance(param, (int, str, float, 
list)):

Review Comment:
   Let's add `float` only for now then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39277: [SPARK-41708][SQL] Pull v1write information to `WriteFiles`

2023-01-04 Thread GitBox


ulysses-you commented on code in PR #39277:
URL: https://github.com/apache/spark/pull/39277#discussion_r1062052748


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/V1WritesHiveUtils.scala:
##
@@ -105,4 +112,164 @@ trait V1WritesHiveUtils {
   .map(_ => Map(BucketingUtils.optionForHiveCompatibleBucketWrite -> 
"true"))
   .getOrElse(Map.empty)
   }
+
+  def setupCompression(
+  fileSinkConf: FileSinkDesc,
+  hadoopConf: Configuration,
+  sparkSession: SparkSession): Unit = {
+val isCompressed =
+  
fileSinkConf.getTableInfo.getOutputFileFormatClassName.toLowerCase(Locale.ROOT) 
match {
+case formatName if formatName.endsWith("orcoutputformat") =>
+  // For ORC,"mapreduce.output.fileoutputformat.compress",
+  // "mapreduce.output.fileoutputformat.compress.codec", and
+  // "mapreduce.output.fileoutputformat.compress.type"
+  // have no impact because it uses table properties to store 
compression information.
+  false
+case _ => hadoopConf.get("hive.exec.compress.output", 
"false").toBoolean
+  }
+
+if (isCompressed) {
+  hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true")
+  fileSinkConf.setCompressed(true)
+  fileSinkConf.setCompressCodec(hadoopConf
+.get("mapreduce.output.fileoutputformat.compress.codec"))
+  fileSinkConf.setCompressType(hadoopConf
+.get("mapreduce.output.fileoutputformat.compress.type"))
+} else {
+  // Set compression by priority
+  HiveOptions.getHiveWriteCompression(fileSinkConf.getTableInfo, 
sparkSession.sessionState.conf)
+.foreach { case (compression, codec) => hadoopConf.set(compression, 
codec) }
+}
+  }
+
+  /**
+   * Return two paths:
+   * 1. The first path is `stagingDir` which can be the parent path of 
`externalTmpPath`
+   * 2. The second path is `externalTmpPath`, e.g. `$stagingDir/-ext-1`
+   * The call side should create `stagingDir` before using `externalTmpPath` 
and
+   * delete `stagingDir` at the end.

Review Comment:
   wrapped using `HiveTempPath` since it would be used by 
`InsertIntoHiveDirCommand`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] techaddict commented on pull request #39393: [SPARK-41871][CONNECT] DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread GitBox


techaddict commented on PR #39393:
URL: https://github.com/apache/spark/pull/39393#issuecomment-1371689457

   @HyukjinKwon 
   After spending some time with this, looks like the change is much bigger
   Proto Message Hint expected parameters to be repeated literal
   
https://github.com/apache/spark/blob/master/connector/connect/common/src/main/protobuf/spark/connect/relations.proto#L698-L710
   
   adding list to this would require more changes in proto definition, I'm not 
super familiar with proto3, but it doesn't support extending and we can't do 
repeated oneof either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


LuciferYang commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062051944


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala:
##
@@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends 
SharedSparkSession with JsonTestUtils
   }
 }
 
+class SQLAppStatusListenerWithInMemoryStoreSuite extends 
SQLAppStatusListenerSuite {
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}

Review Comment:
   OK, let me fixed it together after 
https://github.com/apache/spark/pull/39226 merged
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


LuciferYang commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062051333


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala:
##
@@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession 
with BeforeAndAfter {
   }
 }
 
+class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite {
+  override protected def createStatusStore: SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+
+class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite {
+  // TODO: SPARK-41882 remove this field after RocksDB can automatically 
cleanup

Review Comment:
   Thanks for your review and guidance @dongjoon-hyun , let's wait 
https://github.com/apache/spark/pull/39226, then I think we can clear the TODO 
in this pr
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


LuciferYang commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062050562


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala:
##
@@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends 
SharedSparkSession with JsonTestUtils
   }
 }
 
+class SQLAppStatusListenerWithInMemoryStoreSuite extends 
SQLAppStatusListenerSuite {
+  protected def createStatusStore(): SQLAppStatusStore = {

Review Comment:
   Thanks @gengliangwang 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39357: [SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for StreamingQueryProgressWrapper

2023-01-04 Thread GitBox


LuciferYang commented on PR #39357:
URL: https://github.com/apache/spark/pull/39357#issuecomment-1371679675

   Thanks @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39391: [SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15

2023-01-04 Thread GitBox


LuciferYang commented on PR #39391:
URL: https://github.com/apache/spark/pull/39391#issuecomment-1371678208

   has been re-triggered the failed task


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`

2023-01-04 Thread GitBox


LuciferYang commented on PR #39226:
URL: https://github.com/apache/spark/pull/39226#issuecomment-1371675917

   @gengliangwang has been re-triggered the failed task, should not related to 
this pr ~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files

2023-01-04 Thread GitBox


AngersZh commented on PR #36700:
URL: https://github.com/apache/spark/pull/36700#issuecomment-1371674024

   ping @cloud-fan @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #39383: [SPARK-41780][SQL] Should throw INVALID_PARAMETER_VALUE when the parameters `regexp` in regexp_replace is invalid

2023-01-04 Thread GitBox


panbingkun commented on code in PR #39383:
URL: https://github.com/apache/spark/pull/39383#discussion_r1062048545


##
sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala:
##
@@ -663,4 +664,18 @@ class StringFunctionsSuite extends QueryTest with 
SharedSparkSession {
 start = 7,
 stop = 47))
   }
+
+  test("SPARK-41780: INVALID_PARAMETER_VALUE - invalid parameters `regexp` in 
regexp_replace") {

Review Comment:
   Ok, Let me use `checkErrorInExpression` replace `checkExceptionInExpression 
` to check exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


gengliangwang commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062043632


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala:
##
@@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession 
with BeforeAndAfter {
   }
 }
 
+class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite {
+  override protected def createStatusStore: SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+
+class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite {
+  // TODO: SPARK-41882 remove this field after RocksDB can automatically 
cleanup

Review Comment:
   FYI I am going to merge https://github.com/apache/spark/pull/39226 right 
after the test passes. @LuciferYang please wait until 
https://github.com/apache/spark/pull/39226 is merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


gengliangwang commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062043632


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala:
##
@@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession 
with BeforeAndAfter {
   }
 }
 
+class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite {
+  override protected def createStatusStore: SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+
+class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite {
+  // TODO: SPARK-41882 remove this field after RocksDB can automatically 
cleanup

Review Comment:
   FYI I am going to merge https://github.com/apache/spark/pull/39226 first. 
@LuciferYang please wait until https://github.com/apache/spark/pull/39226 is 
merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39268: [SPARK-41752][SQL][UI] Group nested executions under the root execution

2023-01-04 Thread GitBox


gengliangwang commented on code in PR #39268:
URL: https://github.com/apache/spark/pull/39268#discussion_r1062033642


##
sql/core/src/test/scala/org/apache/spark/status/api/v1/sql/SqlResourceSuite.scala:
##
@@ -82,6 +82,7 @@ object SqlResourceSuite {
 
 new SQLExecutionUIData(
   executionId = 0,
+  rootExecutionId = 0,

Review Comment:
   For testing purpose, let's use a different value from `executionId`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39268: [SPARK-41752][SQL][UI] Group nested executions under the root execution

2023-01-04 Thread GitBox


gengliangwang commented on code in PR #39268:
URL: https://github.com/apache/spark/pull/39268#discussion_r1062032939


##
sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala:
##
@@ -74,6 +74,7 @@ class SQLExecutionUIDataSerializer extends ProtobufSerDe {
 
 new SQLExecutionUIData(
   executionId = ui.getExecutionId,
+  rootExecutionId = ui.getExecutionId,

Review Comment:
   This should be ui.getRootExecutionId after updating the protobuf definition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39268: [SPARK-41752][SQL][UI] Group nested executions under the root execution

2023-01-04 Thread GitBox


gengliangwang commented on code in PR #39268:
URL: https://github.com/apache/spark/pull/39268#discussion_r1062032402


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala:
##
@@ -43,6 +43,8 @@ case class SparkListenerSQLAdaptiveSQLMetricUpdates(
 @DeveloperApi
 case class SparkListenerSQLExecutionStart(
 executionId: Long,
+// if the execution is a root, then rootExecutionId == executionId
+rootExecutionId: Long,

Review Comment:
   We need to refactor the code change in 
https://github.com/apache/spark/blob/master/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto#L387



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #39226: [SPARK-41694][CORE] Isolate RocksDB path for Live UI and automatically cleanup when `SparkContext.stop()`

2023-01-04 Thread GitBox


gengliangwang commented on PR #39226:
URL: https://github.com/apache/spark/pull/39226#issuecomment-1371644321

   @LuciferYang The failed test doesn't seem related. Just to double confirm, 
could you retrigger it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #39396: [SPARK-41825][CONNECT][PYTHON] Enable doctests related to `DataFrame.show`

2023-01-04 Thread GitBox


zhengruifeng opened a new pull request, #39396:
URL: https://github.com/apache/spark/pull/39396

   ### What changes were proposed in this pull request?
   enable a group of doctests
   
   
   ### Why are the changes needed?
   for test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   enabled tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39277: [SPARK-41708][SQL] Pull v1write information to `WriteFiles`

2023-01-04 Thread GitBox


ulysses-you commented on code in PR #39277:
URL: https://github.com/apache/spark/pull/39277#discussion_r1062020277


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala:
##
@@ -53,13 +59,17 @@ case class WriteFiles(child: LogicalPlan) extends UnaryNode 
{
 /**
  * Responsible for writing files.
  */
-case class WriteFilesExec(child: SparkPlan) extends UnaryExecNode {
+case class WriteFilesExec(
+child: SparkPlan,
+fileFormat: FileFormat,
+partitionColumns: Seq[Attribute],
+bucketSpec: Option[BucketSpec],
+options: Map[String, String],
+staticPartitions: TablePartitionSpec) extends UnaryExecNode {
   override def output: Seq[Attribute] = Seq.empty
 
-  override protected def doExecuteWrite(writeSpec: WriteSpec): 
RDD[WriterCommitMessage] = {
-assert(writeSpec.isInstanceOf[WriteFilesSpec])
-val writeFilesSpec: WriteFilesSpec = writeSpec.asInstanceOf[WriteFilesSpec]
-
+  override protected def doExecuteWrite(
+  writeFilesSpec: WriteFilesSpec): RDD[WriterCommitMessage] = {

Review Comment:
   Seems it's a bit hard. look at the current information:
   ```scala
   case class WriteFilesSpec(
   description: WriteJobDescription,
   committer: FileCommitProtocol,
   concurrentOutputWriterSpecFunc: SparkPlan => 
Option[ConcurrentOutputWriterSpec])
   ```
   - `ConcurrentOutputWriterSpec` and `FileCommitProtocol` contain the output 
spec so we can not replace them
   - `WriteJobDescription` contains many information which includes what we 
pull out, but if we want to reduce something inside `WriteJobDescription`, we 
need to create a new class to hold others. I'm not sure it's worth to do that.
   
   ```scala
   class WriteJobDescription(
   val uuid: String,
   val serializableHadoopConf: SerializableConfiguration,
   val outputWriterFactory: OutputWriterFactory,
   val allColumns: Seq[Attribute],
   val dataColumns: Seq[Attribute],
   val partitionColumns: Seq[Attribute],
   val bucketSpec: Option[WriterBucketSpec],
   val path: String,
   val customPartitionLocations: Map[TablePartitionSpec, String],
   val maxRecordsPerFile: Long,
   val timeZoneId: String,
   val statsTrackers: Seq[WriteJobStatsTracker])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #39388: [SPARK-41354][CONNECT][PYTHON] implement RepartitionByExpression

2023-01-04 Thread GitBox


zhengruifeng commented on PR #39388:
URL: https://github.com/apache/spark/pull/39388#issuecomment-1371624171

   just to confirm, the proto `RepartitionByExpression 
repartition_by_expression = 27` can support both 
   
   `def repartition(self, *cols: "ColumnOrName")`
   `def repartitionByRange(self, *cols: "ColumnOrName")`
   
   right? @dengziming 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe

2023-01-04 Thread GitBox


zhengruifeng commented on PR #39378:
URL: https://github.com/apache/spark/pull/39378#issuecomment-1371622584

   > Shall we fix `TODO(SPARK-41821): Fix DataFrame.describe` below? You can 
remove:
   > 
   > ```
   > # TODO(SPARK-41821): Fix DataFrame.describe
   > del pyspark.sql.connect.dataframe.DataFrame.describe.__doc__
   > ```
   
   +1, please also enable the doctest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


dongjoon-hyun commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062012515


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala:
##
@@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends 
SharedSparkSession with JsonTestUtils
   }
 }
 
+class SQLAppStatusListenerWithInMemoryStoreSuite extends 
SQLAppStatusListenerSuite {
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+class SQLAppStatusListenerWithRocksDBBackendSuite extends 
SQLAppStatusListenerSuite {
+
+  // TODO: SPARK-41882 remove this field after RocksDB can automatically 
cleanup

Review Comment:
   Use a new real JIRA ID with `IDed TODO` style.



##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala:
##
@@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends 
SharedSparkSession with JsonTestUtils
   }
 }
 
+class SQLAppStatusListenerWithInMemoryStoreSuite extends 
SQLAppStatusListenerSuite {
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+class SQLAppStatusListenerWithRocksDBBackendSuite extends 
SQLAppStatusListenerSuite {
+
+  // TODO: SPARK-41882 remove this field after RocksDB can automatically 
cleanup
+  private var storePath: File = _
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+storePath = Utils.createTempDir()
+conf.set(LIVE_UI_LOCAL_STORE_DIR, storePath.getCanonicalPath)
+val appStatusStore = AppStatusStore.createLiveStore(conf)
+kvstore = appStatusStore.store.asInstanceOf[ElementTrackingStore]
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+
+  // TODO: SPARK-41882 remove this method after RocksDB can automatically 
cleanup

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


dongjoon-hyun commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062012227


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala:
##
@@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends 
SharedSparkSession with JsonTestUtils
   }
 }
 
+class SQLAppStatusListenerWithInMemoryStoreSuite extends 
SQLAppStatusListenerSuite {
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}

Review Comment:
   New line is needed after this.



##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala:
##
@@ -1007,6 +1004,36 @@ class SQLAppStatusListenerSuite extends 
SharedSparkSession with JsonTestUtils
   }
 }
 
+class SQLAppStatusListenerWithInMemoryStoreSuite extends 
SQLAppStatusListenerSuite {
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+class SQLAppStatusListenerWithRocksDBBackendSuite extends 
SQLAppStatusListenerSuite {
+

Review Comment:
   Remove empty line here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39385: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs

2023-01-04 Thread GitBox


dongjoon-hyun commented on code in PR #39385:
URL: https://github.com/apache/spark/pull/39385#discussion_r1062012012


##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala:
##
@@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession 
with BeforeAndAfter {
   }
 }
 
+class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite {
+  override protected def createStatusStore: SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+
+class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite {
+  // TODO: SPARK-41882 remove this field after RocksDB can automatically 
cleanup

Review Comment:
   Instead of this style, please create a new real JIRA ID and point it by 
using `IDed TODO` style, `TODO(SPARK-12345)`.



##
sql/core/src/test/scala/org/apache/spark/sql/execution/ui/AllExecutionsPageSuite.scala:
##
@@ -178,3 +177,35 @@ class AllExecutionsPageSuite extends SharedSparkSession 
with BeforeAndAfter {
   }
 }
 
+class AllExecutionsPageWithInMemoryStoreSuite extends AllExecutionsPageSuite {
+  override protected def createStatusStore: SQLAppStatusStore = {
+val conf = sparkContext.conf
+kvstore = new ElementTrackingStore(new InMemoryStore, conf)
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+}
+
+class AllExecutionsPageWithRocksDBBackendSuite extends AllExecutionsPageSuite {
+  // TODO: SPARK-41882 remove this field after RocksDB can automatically 
cleanup
+  private var storePath: File = _
+
+  override protected def createStatusStore(): SQLAppStatusStore = {
+val conf = sparkContext.conf
+storePath = Utils.createTempDir()
+conf.set(LIVE_UI_LOCAL_STORE_DIR, storePath.getCanonicalPath)
+val appStatusStore = AppStatusStore.createLiveStore(conf)
+kvstore = appStatusStore.store.asInstanceOf[ElementTrackingStore]
+val listener = new SQLAppStatusListener(conf, kvstore, live = true)
+new SQLAppStatusStore(kvstore, Some(listener))
+  }
+
+  // TODO: SPARK-41882 remove this method after RocksDB can automatically 
cleanup

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

2023-01-04 Thread GitBox


xkrogen commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1371607231

   Merged into latest master to resolve conflicts. @allisonwang-db or 
@cloud-fan , any thoughts/comments on the latest diff? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #39378: [SPARK-41821][CONNECT][PYTHON] Fix doc test for DataFrame.describe

2023-01-04 Thread GitBox


beliefer commented on PR #39378:
URL: https://github.com/apache/spark/pull/39378#issuecomment-1371606490

   ping @zhengruifeng cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tedyu commented on pull request #39395: [SQL] Use foldLeft for DeduplicateRelations

2023-01-04 Thread GitBox


tedyu commented on PR #39395:
URL: https://github.com/apache/spark/pull/39395#issuecomment-1371583867

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tedyu opened a new pull request, #39395: [SQL] Use foldLeft for DeduplicateRelations

2023-01-04 Thread GitBox


tedyu opened a new pull request, #39395:
URL: https://github.com/apache/spark/pull/39395

   ### What changes were proposed in this pull request?
   This PR uses `foldLeft` in `DeduplicateRelations` for better performance.
   
   ### Why are the changes needed?
   `foldRight` is not as performant as `foldLeft`.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing test suite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rithwik-db commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


rithwik-db commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061992443


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


lu-wang-dl commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061991762


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] rithwik-db commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


rithwik-db commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061991704


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file

Review Comment:
   Yeah, documentation will need to constantly evolve for this project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rithwik-db commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


rithwik-db commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061991285


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] rithwik-db commented on a diff in pull request #39146: [WIP][SPARK-41589][PYTHON][ML] PyTorch Distributor Baseline API Changes

2023-01-04 Thread GitBox


rithwik-db commented on code in PR #39146:
URL: https://github.com/apache/spark/pull/39146#discussion_r1061990731


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,287 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import math
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+class Distributor:
+def __init__(
+self,
+num_processes: int = 1,
+local_mode: bool = True,
+use_gpu: bool = True,
+spark: Optional[SparkSession] = None,
+):
+self.num_processes = num_processes
+self.local_mode = local_mode
+self.use_gpu = use_gpu
+if spark:
+self.spark = spark
+else:
+self.spark = SparkSession.builder.getOrCreate()
+self.sc = self.spark.sparkContext
+self.num_tasks = self._get_num_tasks()
+self.ssl_conf = None
+
+def _get_num_tasks(self) -> int:
+"""
+Returns the number of Spark tasks to use for distributed training
+
+Returns
+---
+The number of Spark tasks to use for distributed training
+"""
+if self.use_gpu:
+key = "spark.task.resource.gpu.amount"
+if self.sc.getConf().contains(key):
+if gpu_amount_raw := self.sc.getConf().get(key):  # mypy 
error??
+task_gpu_amount = int(gpu_amount_raw)
+else:
+task_gpu_amount = 1  # for single node clusters
+if task_gpu_amount < 1:
+raise ValueError(
+f"The Spark conf `{key}` has a value "
+f"of {task_gpu_amount} but it "
+"should not have a value less than 1."
+)
+return math.ceil(self.num_processes / task_gpu_amount)
+return self.num_processes
+
+def _validate_input_params(self) -> None:
+if self.num_processes <= 0:
+raise ValueError("num_proccesses has to be a positive integer")
+
+def _check_encryption(self) -> None:
+"""Checks to see if the user requires encrpytion of data.
+If required, throw an exception since we don't support that.
+
+Raises
+--
+NotImplementedError
+Thrown when the user doesn't use PyTorchDistributor
+Exception
+Thrown when the user requires ssl encryption
+"""
+if not "ssl_conf":
+raise Exception(
+"Distributor doesn't have this functionality. Use 
PyTorchDistributor instead."
+)
+is_ssl_enabled = get_conf_boolean(self.sc, "spark.ssl.enabled", 
"false")
+ignore_ssl = get_conf_boolean(self.sc, self.ssl_conf, "false")  # 
type: ignore
+  

[GitHub] [spark] github-actions[bot] commented on pull request #36700: [SPARK-39318][SQL] Remove tpch-plan-stability WithStats golden files

2023-01-04 Thread GitBox


github-actions[bot] commented on PR #36700:
URL: https://github.com/apache/spark/pull/36700#issuecomment-1371573628

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #36052: [SPARK-38777][YARN] Add `bin/spark-submit --kill / --status` support for yarn

2023-01-04 Thread GitBox


github-actions[bot] commented on PR #36052:
URL: https://github.com/apache/spark/pull/36052#issuecomment-1371573647

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #37899: [SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException

2023-01-04 Thread GitBox


github-actions[bot] commented on PR #37899:
URL: https://github.com/apache/spark/pull/37899#issuecomment-1371573567

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


lu-wang-dl commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061987077


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


lu-wang-dl commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061987077


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


lu-wang-dl commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061985290


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


lu-wang-dl commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061983056


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


lu-wang-dl commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061979910


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:
+"""Returns the expected torchrun command based on the input.
+
+Parameters
+--
+input_params : dict[str, Any]
+The dictionary of the input parameters of the distributor. The most 
relevant params
+are local_mode and num_processes.
+train_path : str
+The path to the (potentially autogenerated) train.py file
+args: *args
+The input arguments to the train.py file.
+
+Returns
+---
+str
+The output torchrun command
+"""
+local_mode = input_params["local_mode"]
+num_processes = input_params["num_processes"]
+
+if local_mode:
+standalone = ["--standalone", "--nnodes=1"]
+processes_per_node = num_processes
+else:
+master_addr, master_port = os.environ["MASTER_ADDR"], 
os.environ["MASTER_PORT"]
+node_rank = os.environ["RANK"]
+standalone = [
+f"--nnodes={num_processes}",
+f"--node_rank={node_rank}",
+f"--rdzv_endpoint={master_addr}:{master_port}",
+"--rdzv_id=0",
+]  # TODO: setup random ID that is gleaned from env variables
+processes_per_node = 1
+
+args_string = list(map(str, args))  # converting all args to strings
+
+return (
+["torchrun"]
++ standalone
+

[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39188: [WIP][SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU

2023-01-04 Thread GitBox


lu-wang-dl commented on code in PR #39188:
URL: https://github.com/apache/spark/pull/39188#discussion_r1061979807


##
python/pyspark/ml/torch/distributor.py:
##
@@ -0,0 +1,491 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import collections
+import ctypes
+import math
+import os
+import random
+import re
+import signal
+import sys
+import subprocess
+import time
+from typing import Union, Callable, Optional, Any
+import warnings
+
+from pyspark.sql import SparkSession
+from pyspark.context import SparkContext
+
+
+# Moved the util functions to this file for now
+# TODO(SPARK-41589): will move the functions and tests to an external file
+#   once we are in agreement about which functions should be in utils.py
+def get_conf_boolean(sc: SparkContext, key: str, default_value: str) -> bool:
+"""Get the conf "key" from the given spark context,
+or return the default value if the conf is not set.
+This expects the conf value to be a boolean or string;
+if the value is a string, this checks for all capitalization
+patterns of "true" and "false" to match Scala.
+
+Parameters
+--
+sc : SparkContext
+The SparkContext for the distributor.
+key : str
+string for conf name
+default_value : str
+default value for the conf value for the given key
+
+Returns
+---
+bool
+Returns the boolean value that corresponds to the conf
+
+Raises
+--
+Exception
+Thrown when the conf value is not a boolean
+"""
+val = sc.getConf().get(key, default_value)
+lowercase_val = val.lower()
+if lowercase_val == "true":
+return True
+if lowercase_val == "false":
+return False
+raise Exception(
+"get_conf_boolean expected a boolean conf "
+"value but found value of type {} "
+"with value: {}".format(type(val), val)
+)
+
+
+def get_gpus_owned(addresses: list[str]) -> list[str]:
+"""
+Gets the number of GPUs that Spark scheduled to the calling task.
+Returns:
+The number of GPUs that Spark scheduled to the calling task.
+"""
+CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+pattern = re.compile("^[1-9][0-9]*|0$")
+if any(not pattern.match(address) for address in addresses):
+raise ValueError(
+f"Found GPU addresses {addresses} which "
+"are not all in the correct format "
+"for CUDA_VISIBLE_DEVICES, which requires "
+"integers with no zero padding."
+)
+if CUDA_VISIBLE_DEVICES in os.environ:
+gpu_indices = list(map(int, addresses))
+gpu_list = os.environ[CUDA_VISIBLE_DEVICES].split(",")
+gpu_owned = [gpu_list[i] for i in gpu_indices]
+return gpu_owned
+return addresses
+
+
+def create_torchrun_command(input_params: dict[str, Any], train_path: str, 
*args: Any) -> list[str]:

Review Comment:
   Mark these as internal function: `_create_torchrun_command`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39382: [SPARK-41878][CONNECT][TESTS] pyspark.sql.tests.test_dataframe - Add JIRAs or messages for skipped messages

2023-01-04 Thread GitBox


HyukjinKwon closed pull request #39382: [SPARK-41878][CONNECT][TESTS] 
pyspark.sql.tests.test_dataframe - Add JIRAs or messages for skipped messages
URL: https://github.com/apache/spark/pull/39382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39382: [SPARK-41878][CONNECT][TESTS] pyspark.sql.tests.test_dataframe - Add JIRAs or messages for skipped messages

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39382:
URL: https://github.com/apache/spark/pull/39382#issuecomment-1371556222

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39386: [SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly

2023-01-04 Thread GitBox


HyukjinKwon closed pull request #39386: 
[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make 
`DataFrame.collect` handle None/NaN/Array/Binary porperly
URL: https://github.com/apache/spark/pull/39386


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39386: [SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly

2023-01-04 Thread GitBox


HyukjinKwon commented on PR #39386:
URL: https://github.com/apache/spark/pull/39386#issuecomment-1371555616

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >