Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-04-22 Thread via GitHub


cloud-fan commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1575677485


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -51,13 +51,30 @@ abstract class QueryStageExec extends LeafExecNode {
*/
   val plan: SparkPlan
 
+  /**
+   * Name of this query stage which is unique in the entire query plan.
+   */
+  val name: String = s"${this.getClass.getSimpleName}-$id"
+
+  /**
+   * This flag aims to detect if the stage materialization is started. This 
helps
+   * to avoid unnecessary stage materialization when the stage is canceled.
+   */
+  private val materializationStarted = new AtomicBoolean()

Review Comment:
   sorry for the last-minute proposal, but I'm wondering if it's more efficient 
to push this cancelation optimization into shuffle and broadcast nodes.
   
   It looks a bit fragile to operate on the `shuffleFuture` directly in 
`ShuffleQueryStageExec.cancel`. I think we should let `ShuffleExchangeLike` 
provide clear APIs to do it. Today it provides `submitShuffleJob`, and it 
should also provide `cancelShuffleJob`.
   
   Within `ShuffleExchangeLike`, we can do more optimizations. e.g. even if we 
cancel the shuffle stage after the shuffle stage is submitted, we can still 
avoid submitting the shuffle job, as the shuffle node might be doing other 
preparation work: generating the shuffle dependency, waiting for subqueries to 
finish, etc. It's more efficient to check the isCanceled flag at the last 
minute, right before submitting the shuffle job.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47943] Add `GitHub Action` CI for Java Build and Test [spark-kubernetes-operator]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #7:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/7#issuecomment-2071455107

   Merged to main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47943] Add `GitHub Action` CI for Java Build and Test [spark-kubernetes-operator]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #7: [SPARK-47943] Add `GitHub Action` CI for 
Java Build and Test
URL: https://github.com/apache/spark-kubernetes-operator/pull/7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47929] Setup Static Analysis for Operator [spark-kubernetes-operator]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #6: [SPARK-47929] Setup Static Analysis for 
Operator
URL: https://github.com/apache/spark-kubernetes-operator/pull/6


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47945][SQL] MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server and add tests [spark]

2024-04-22 Thread via GitHub


yaooqinn commented on code in PR #46173:
URL: https://github.com/apache/spark/pull/46173#discussion_r1575669023


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala:
##
@@ -265,7 +273,7 @@ class MsSqlServerIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 assert(types(7).equals("class java.lang.String"))
 assert(types(8).equals("class [B"))
 assert(row.getString(0).length == 10)
-assert(row.getString(0).trim.equals("the"))
+assert(row.getString(0).equals("the".padTo(10, ' ')))

Review Comment:
   Anyway, I removed this LOC



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47945][SQL] MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server and add tests [spark]

2024-04-22 Thread via GitHub


yaooqinn commented on code in PR #46173:
URL: https://github.com/apache/spark/pull/46173#discussion_r1575667729


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala:
##
@@ -265,7 +273,7 @@ class MsSqlServerIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 assert(types(7).equals("class java.lang.String"))
 assert(types(8).equals("class [B"))
 assert(row.getString(0).length == 10)
-assert(row.getString(0).trim.equals("the"))
+assert(row.getString(0).equals("the".padTo(10, ' ')))

Review Comment:
   Char padding is ANSI-irrelevant, AFAIK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47945][SQL] MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server and add tests [spark]

2024-04-22 Thread via GitHub


yaooqinn commented on code in PR #46173:
URL: https://github.com/apache/spark/pull/46173#discussion_r1575666204


##
docs/sql-data-sources-jdbc.md:
##
@@ -1441,3 +1441,192 @@ The Spark Catalyst data types below are not supported 
with suitable Oracle types
 - NullType
 - ObjectType
 - VariantType
+
+### Mapping Spark SQL Data Types from Microsoft SQL Server
+
+The below table describes the data type conversions from Microsoft SQL Server 
data types to Spark SQL Data Types,
+when reading data from a Microsoft SQL Server table using the built-in jdbc 
data source with the mssql-jdbc
+as the activated JDBC Driver.
+
+
+
+  
+
+  SQL Server  Data Type
+  Spark SQL Data Type
+  Remarks
+
+  
+  
+
+  bit
+  BooleanType
+  
+
+
+  tinyint
+  ShortType
+  
+
+
+  smallint
+  ShortType
+  
+
+
+  int
+  IntegerType
+  
+
+
+  bigint
+  LongType
+  
+
+
+  float(p), real
+  FloatType
+  1  p  24
+
+
+  float[(p)]
+  DoubleType
+  25  p  53
+
+
+  double precision
+  DoubleType
+  
+
+
+  smallmoney
+  DecimalType(10, 4)
+  
+
+
+  money
+  DecimalType(19, 4)
+  
+
+
+  decimal[(p[, s])], numeric[(p[, s])]
+  DecimalType(p, s)
+  
+
+
+  date
+  DateType
+  
+
+
+  datetime
+  TimestampType
+  (Default)preferTimestampNTZ=false or 
spark.sql.timestampType=TIMESTAMP_LTZ
+
+
+  datetime
+  TimestampNTZType
+  preferTimestampNTZ=true or spark.sql.timestampType=TIMESTAMP_NTZ
+
+
+  datetime2 [ (fractional seconds precision) ]
+  TimestampType
+  (Default)preferTimestampNTZ=false or 
spark.sql.timestampType=TIMESTAMP_LTZ
+
+
+  datetime2 [ (fractional seconds precision) ]
+  TimestampNTZType
+  preferTimestampNTZ=true or spark.sql.timestampType=TIMESTAMP_NTZ
+
+
+  datetimeoffset [ (fractional seconds precision) ]
+  StringType

Review Comment:
   
https://github.com/apache/spark/blob/9d715ba491710969340d9e8a49a21d11f51ef7d3/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala#L112-L114
   
   This comment appears not true as we use mssql-jdbc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47730][K8S] Support `APP_ID` and `EXECUTOR_ID` placeholders in labels [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46149:
URL: https://github.com/apache/spark/pull/46149#issuecomment-2071446978

   Gentle ping, @jshmchenxi .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47604][CORE] Resource managers: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


gengliangwang commented on PR #46130:
URL: https://github.com/apache/spark/pull/46130#issuecomment-2071445935

   @panbingkun Thanks. Please resolve the conflicts so that I can merge this 
one later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47945][SQL] MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server and add tests [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46173:
URL: https://github.com/apache/spark/pull/46173#discussion_r1575663710


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala:
##
@@ -265,7 +273,7 @@ class MsSqlServerIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 assert(types(7).equals("class java.lang.String"))
 assert(types(8).equals("class [B"))
 assert(row.getString(0).length == 10)
-assert(row.getString(0).trim.equals("the"))
+assert(row.getString(0).equals("the".padTo(10, ' ')))

Review Comment:
   This seems to be ANSI result. Is this safe in non-ANSI CI, @yaooqinn ?
   - https://github.com/apache/spark/actions/workflows/build_non_ansi.yml



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44170][BUILD][FOLLOWUP] Align JUnit5 dependency's version and clean up exclusions [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46160:
URL: https://github.com/apache/spark/pull/46160#issuecomment-2071442761

   Merged to master. Thank you, @pan3793 and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44170][BUILD][FOLLOWUP] Align JUnit5 dependency's version and clean up exclusions [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46160: [SPARK-44170][BUILD][FOLLOWUP] Align 
JUnit5 dependency's version and clean up exclusions
URL: https://github.com/apache/spark/pull/46160


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46164:
URL: https://github.com/apache/spark/pull/46164#issuecomment-2071441869

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error [spark]

2024-04-22 Thread via GitHub


yaooqinn commented on code in PR #46164:
URL: https://github.com/apache/spark/pull/46164#discussion_r1575661507


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala:
##
@@ -437,4 +437,13 @@ class MsSqlServerIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   .load()
 assert(df.collect().toSet === expectedResult)
   }
+
+

Review Comment:
   @dongjoon-hyun Thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46164: [SPARK-47938][SQL] MsSQLServer: 
Cannot find data type BYTE error
URL: https://github.com/apache/spark/pull/46164


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46164:
URL: https://github.com/apache/spark/pull/46164#discussion_r1575660470


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala:
##
@@ -437,4 +437,13 @@ class MsSqlServerIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   .load()
 assert(df.collect().toSet === expectedResult)
   }
+
+

Review Comment:
   nit. Redundant empty line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


gengliangwang closed pull request #46151: [SPARK-47600][CORE] MLLib: Migrate 
logInfo with variables to structured logging framework
URL: https://github.com/apache/spark/pull/46151


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


gengliangwang commented on PR #46151:
URL: https://github.com/apache/spark/pull/46151#issuecomment-2071433391

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47933][PYTHON] Parent Column class for Spark Connect and Spark Classic [spark]

2024-04-22 Thread via GitHub


HyukjinKwon closed pull request #46155: [SPARK-47933][PYTHON] Parent Column 
class for Spark Connect and Spark Classic
URL: https://github.com/apache/spark/pull/46155


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47933][PYTHON] Parent Column class for Spark Connect and Spark Classic [spark]

2024-04-22 Thread via GitHub


HyukjinKwon commented on PR #46155:
URL: https://github.com/apache/spark/pull/46155#issuecomment-2071421918

   Merged to master.
   
   For naming, I will send an email soon.
   For others, please leave the comments. I will followup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47943] Add Operator CI Task for Java Build and Test [spark-kubernetes-operator]

2024-04-22 Thread via GitHub


jiangzho opened a new pull request, #7:
URL: https://github.com/apache/spark-kubernetes-operator/pull/7

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds an additional CI build task for operator. 
   
   ### Why are the changes needed?
   
   
   The additional CI task is needed in order to build and test Java code for 
upcoming operator pull requests. 
   
   When Java plugin is enabled and Java source is checked in, `./gradlew build` 
[task](https://docs.gradle.org/3.3/userguide/java_plugin.html#sec:java_tasks) 
by default includes a set of tasks to compile and run tests. This can serve as 
pull request build.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no
   
   
   ### How was this patch tested?
   
   
   tested locally.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47945][SQL] MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server and add tests [spark]

2024-04-22 Thread via GitHub


yaooqinn opened a new pull request, #46173:
URL: https://github.com/apache/spark/pull/46173

   
   
   
   
   ### What changes were proposed in this pull request?
   
   
   This PR added  Document Mapping Spark SQL Data Types from Microsoft SQL 
Server and added more tests
   ### Why are the changes needed?
   
   doc improvement
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   new test and doc build
   
![image](https://github.com/apache/spark/assets/8326978/29ad3814-9284-45a3-aefb-6f0fb8f26593)
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47938][SQL] MsSQLServer: Cannot find data type BYTE error [spark]

2024-04-22 Thread via GitHub


yaooqinn commented on PR #46164:
URL: https://github.com/apache/spark/pull/46164#issuecomment-2071365832

   cc @dongjoon-hyun @cloud-fan, thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-04-22 Thread via GitHub


cloud-fan commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1575603485


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -198,18 +215,23 @@ case class ShuffleQueryStageExec(
 reuse
   }
 
-  override def cancel(): Unit = shuffleFuture match {
-case action: FutureAction[MapOutputStatistics] if !action.isCompleted =>
-  action.cancel()
-case _ =>
+  override def cancel(): Unit = {

Review Comment:
   nit: it seems better to follow `def materialize` and `def doMaterialize`, to 
have `def cancel` and `def doCancel`. We can put this materialization check in 
`def cancel`, and maybe more common logic related to cancelation in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]

2024-04-22 Thread via GitHub


beliefer commented on PR #45644:
URL: https://github.com/apache/spark/pull/45644#issuecomment-2071347859

   I think we do not add HiveDialect as built-in dialect. Users could add 
custom dialect with https://github.com/apache/spark/pull/45626.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]

2024-04-22 Thread via GitHub


HeartSaVioR commented on code in PR #45991:
URL: https://github.com/apache/spark/pull/45991#discussion_r1575597023


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateTTLSuite.scala:
##
@@ -0,0 +1,308 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.time.Duration
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.execution.streaming.{MapStateImplWithTTL, 
MemoryStream}
+import org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.streaming.util.StreamManualClock
+
+class MapStateSingleKeyTTLProcessor(ttlConfig: TTLConfig)

Review Comment:
   Ah OK you are passing this over the base tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]

2024-04-22 Thread via GitHub


ericm-db commented on code in PR #45991:
URL: https://github.com/apache/spark/pull/45991#discussion_r1575592659


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateTTLSuite.scala:
##
@@ -0,0 +1,308 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.time.Duration
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.execution.streaming.{MapStateImplWithTTL, 
MemoryStream}
+import org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.streaming.util.StreamManualClock
+
+class MapStateSingleKeyTTLProcessor(ttlConfig: TTLConfig)
+  extends StatefulProcessor[String, InputEvent, OutputEvent]
+with Logging {
+
+  @transient private var _mapState: MapStateImplWithTTL[String, Int] = _
+
+  override def init(
+  outputMode: OutputMode,
+  timeMode: TimeMode): Unit = {
+_mapState = getHandle
+  .getMapState("mapState", Encoders.STRING, Encoders.scalaInt, ttlConfig)
+  .asInstanceOf[MapStateImplWithTTL[String, Int]]
+  }
+  override def handleInputRows(
+key: String,
+inputRows: Iterator[InputEvent],
+timerValues: TimerValues,
+expiredTimerInfo: ExpiredTimerInfo): Iterator[OutputEvent] = {
+var results = List[OutputEvent]()
+
+for (row <- inputRows) {
+  val resultIter = processRow(row, _mapState)
+  resultIter.foreach { r =>
+results = r :: results
+  }
+}
+
+results.iterator
+  }
+
+  def processRow(
+  row: InputEvent,
+  mapState: MapStateImplWithTTL[String, Int]): Iterator[OutputEvent] = {
+var results = List[OutputEvent]()
+val key = row.key
+val userKey = "key"
+if (row.action == "get") {
+  if (mapState.containsKey(userKey)) {
+results = OutputEvent(key, mapState.getValue(userKey), isTTLValue = 
false, -1) :: results
+  }
+} else if (row.action == "get_without_enforcing_ttl") {
+  val currState = mapState.getWithoutEnforcingTTL(userKey)
+  if (currState.isDefined) {
+results = OutputEvent(key, currState.get, isTTLValue = false, -1) :: 
results
+  }
+} else if (row.action == "get_ttl_value_from_state") {
+  val ttlValue = mapState.getTTLValue(userKey)
+  if (ttlValue.isDefined) {
+val value = ttlValue.get._1
+val ttlExpiration = ttlValue.get._2
+results = OutputEvent(key, value, isTTLValue = true, ttlExpiration) :: 
results
+  }
+} else if (row.action == "put") {
+  mapState.updateValue(userKey, row.value)
+} else if (row.action == "get_values_in_ttl_state") {
+  val ttlValues = mapState.getKeyValuesInTTLState()
+  ttlValues.foreach { v =>
+results = OutputEvent(key, -1, isTTLValue = true, ttlValue = v._2) :: 
results
+  }
+}
+
+results.iterator
+  }
+}
+
+case class MapInputEvent(
+key: String,
+userKey: String,
+action: String,
+value: Int)
+
+case class MapOutputEvent(
+key: String,
+userKey: String,
+value: Int,
+isTTLValue: Boolean,
+ttlValue: Long)
+
+class MapStateTTLProcessor(ttlConfig: TTLConfig)
+  extends StatefulProcessor[String, MapInputEvent, MapOutputEvent]
+with Logging {
+
+  @transient private var _mapState: MapStateImplWithTTL[String, Int] = _
+
+  override def init(
+  outputMode: OutputMode,
+  timeMode: TimeMode): Unit = {
+_mapState = getHandle
+  .getMapState("mapState", Encoders.STRING, Encoders.scalaInt, ttlConfig)
+  .asInstanceOf[MapStateImplWithTTL[String, Int]]
+  }
+
+  override def handleInputRows(
+  key: String,
+  inputRows: Iterator[MapInputEvent],
+  timerValues: TimerValues,
+  expiredTimerInfo: ExpiredTimerInfo): Iterator[MapOutputEvent] = {
+var results = List[MapOutputEvent]()
+
+for (row <- inputRows) {
+  val resultIter = processRow(row, _mapState)
+  resultIter.foreach { r =>
+results = r :: results
+  }
+}
+
+results.iterator
+  }
+
+  def processRow(
+  row: MapInputEvent,
+  mapState: MapStateImplWithTTL[String, Int]): 

Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]

2024-04-22 Thread via GitHub


ericm-db commented on code in PR #45991:
URL: https://github.com/apache/spark/pull/45991#discussion_r1575586347


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateTTLSuite.scala:
##
@@ -0,0 +1,308 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.time.Duration
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.execution.streaming.{MapStateImplWithTTL, 
MemoryStream}
+import org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.streaming.util.StreamManualClock
+
+class MapStateSingleKeyTTLProcessor(ttlConfig: TTLConfig)

Review Comment:
   We need this processor since it has different input/output types than the 
other processor in this file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47944][BUILD] Upgrade icu4j to 75.1 [spark]

2024-04-22 Thread via GitHub


LuciferYang commented on PR #46172:
URL: https://github.com/apache/spark/pull/46172#issuecomment-2071319910

   The icu-related features appear to still be under development. Although I am 
not aware of the criteria for selecting the initial version(72.1), I believe it 
would be advisable to wait until the related features are stable and have 
undergone benchmark testing (if it affects performance) before attempting to 
upgrade its version.
   
   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47936][BUILD] Improve `toUpperCase` & `toLowerCase` with `Locale.ROOT` rules [spark]

2024-04-22 Thread via GitHub


LuciferYang commented on code in PR #46147:
URL: https://github.com/apache/spark/pull/46147#discussion_r1575579520


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -206,7 +210,7 @@ public static StringSearch getStringSearch(
* Returns if the given collationName is valid one.
*/
   public static boolean isValidCollation(String collationName) {
-return collationNameToIdMap.containsKey(collationName.toUpperCase());
+return 
collationNameToIdMap.containsKey(collationName.toUpperCase(Locale.ROOT));

Review Comment:
   `toUpperCase(Locale.ROOT)` uses the root language environment, which is 
independent of the default language environment of the machine, ensuring 
consistent results across different machines, I think it's ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]

2024-04-22 Thread via GitHub


HeartSaVioR commented on code in PR #45991:
URL: https://github.com/apache/spark/pull/45991#discussion_r1575569366


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateTTLSuite.scala:
##
@@ -0,0 +1,308 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.time.Duration
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.execution.streaming.{MapStateImplWithTTL, 
MemoryStream}
+import org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.streaming.util.StreamManualClock
+
+class MapStateSingleKeyTTLProcessor(ttlConfig: TTLConfig)

Review Comment:
   Maybe redundant to have this separately?



##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateTTLSuite.scala:
##
@@ -0,0 +1,308 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.time.Duration
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.execution.streaming.{MapStateImplWithTTL, 
MemoryStream}
+import org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.streaming.util.StreamManualClock
+
+class MapStateSingleKeyTTLProcessor(ttlConfig: TTLConfig)
+  extends StatefulProcessor[String, InputEvent, OutputEvent]
+with Logging {

Review Comment:
   nit: we don't use this, right?



##
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateTTLSuite.scala:
##
@@ -0,0 +1,308 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.time.Duration
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.execution.streaming.{MapStateImplWithTTL, 
MemoryStream}
+import org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.streaming.util.StreamManualClock
+
+class MapStateSingleKeyTTLProcessor(ttlConfig: TTLConfig)
+  extends StatefulProcessor[String, InputEvent, OutputEvent]
+with Logging {
+
+  @transient private var _mapState: MapStateImplWithTTL[String, Int] = _
+
+  override def init(
+  outputMode: OutputMode,
+  timeMode: TimeMode): Unit = {
+_mapState = getHandle
+  .getMapState("mapState", Encoders.STRING, Encoders.scalaInt, ttlConfig)
+  

Re: [PR] [SPARK-44305][SQL] Dynamically choose whether to broadcast hadoop conf [spark]

2024-04-22 Thread via GitHub


7mming7 commented on PR #46162:
URL: https://github.com/apache/spark/pull/46162#issuecomment-2071298400

   cc @HyukjinKwon @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47933][PYTHON] Parent Column class for Spark Connect and Spark Classic [spark]

2024-04-22 Thread via GitHub


HyukjinKwon commented on code in PR #46155:
URL: https://github.com/apache/spark/pull/46155#discussion_r1575562214


##
python/pyspark/ml/connect/functions.py:
##
@@ -15,7 +15,7 @@
 # limitations under the License.
 #
 from pyspark.ml import functions as PyMLFunctions
-from pyspark.sql.connect.column import Column
+from pyspark.sql.column import Column

Review Comment:
   It is actually related to the type hints. We're using `pyspark.sql.Column` 
for type hints, and when we actually need to create the instance, we should use 
`pyspark.sql.classic.column import Column` or `pyspark.sql.connect.column 
import Column` .. it is also a bit messy. I should probably clean those up 
separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47933][PYTHON] Parent Column class for Spark Connect and Spark Classic [spark]

2024-04-22 Thread via GitHub


HyukjinKwon commented on code in PR #46155:
URL: https://github.com/apache/spark/pull/46155#discussion_r1575561525


##
python/pyspark/ml/stat.py:
##
@@ -22,7 +22,8 @@
 from pyspark.ml.common import _java2py, _py2java
 from pyspark.ml.linalg import Matrix, Vector
 from pyspark.ml.wrapper import JavaWrapper, _jvm
-from pyspark.sql.column import Column, _to_seq
+from pyspark.sql.column import Column

Review Comment:
   This file is only used in Spark Classic so it should probably be fine. It is 
a bit messy and complicated .. I will try to refactor them separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47847][CORE] Deprecate spark.network.remoteReadNioBufferConversion [spark]

2024-04-22 Thread via GitHub


pan3793 commented on code in PR #46047:
URL: https://github.com/apache/spark/pull/46047#discussion_r1575558648


##
core/src/main/scala/org/apache/spark/SparkConf.scala:
##
@@ -640,7 +640,8 @@ private[spark] object SparkConf extends Logging {
   DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0",
 "Please use spark.excludeOnFailure.killExcludedExecutors"),
   
DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", 
"3.1.0",
-"Please use spark.yarn.executor.launch.excludeOnFailure.enabled")
+"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"),
+  DeprecatedConfig("spark.network.remoteReadNioBufferConversion", "3.5.2", 
"")

Review Comment:
   I fill the deprecated message with "Not used anymore", to be consistent with 
existing items
   ```
   DeprecatedConfig("spark.yarn.am.port", "2.0.0", "Not used anymore"),
   DeprecatedConfig("spark.executor.port", "2.0.0", "Not used anymore"),
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47944][BUILD] Upgrade icu4j to 75.1 [spark]

2024-04-22 Thread via GitHub


panbingkun opened a new pull request, #46172:
URL: https://github.com/apache/spark/pull/46172

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


zhengruifeng commented on PR #46163:
URL: https://github.com/apache/spark/pull/46163#issuecomment-2071275104

   thank you @dongjoon-hyun and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]

2024-04-22 Thread via GitHub


HeartSaVioR commented on code in PR #45991:
URL: https://github.com/apache/spark/pull/45991#discussion_r1575549218


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImplWithTTL.scala:
##
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.streaming
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import 
org.apache.spark.sql.execution.streaming.TransformWithStateKeyValueRowSchema.{COMPOSITE_KEY_ROW_SCHEMA,
 VALUE_ROW_SCHEMA_WITH_TTL}
+import 
org.apache.spark.sql.execution.streaming.state.{PrefixKeyScanStateEncoderSpec, 
StateStore, StateStoreErrors}
+import org.apache.spark.sql.streaming.{MapState, TTLConfig}
+import org.apache.spark.util.NextIterator
+
+/**
+ * Class that provides a concrete implementation for map state associated with 
state
+ * variables (with ttl expiration support) used in the streaming 
transformWithState operator.
+ * @param store - reference to the StateStore instance to be used for storing 
state
+ * @param stateName  - name of the state variable
+ * @param keyExprEnc - Spark SQL encoder for key
+ * @param userKeyEnc  - Spark SQL encoder for the map key
+ * @param valEncoder - SQL encoder for state variable
+ * @param ttlConfig  - the ttl configuration (time to live duration etc.)
+ * @param batchTimestampMs - current batch processing timestamp.
+ * @tparam K - type of key for map state variable
+ * @tparam V - type of value for map state variable
+ * @return - instance of MapState of type [K,V] that can be used to store 
state persistently
+ */
+class MapStateImplWithTTL[K, V](
+store: StateStore,
+stateName: String,
+keyExprEnc: ExpressionEncoder[Any],
+userKeyEnc: Encoder[K],
+valEncoder: Encoder[V],
+ttlConfig: TTLConfig,
+batchTimestampMs: Long) extends CompositeKeyTTLStateImpl(stateName, store, 
batchTimestampMs)
+  with MapState[K, V] with Logging {
+
+  private val keySerializer = keyExprEnc.createSerializer()
+  private val stateTypesEncoder = new CompositeKeyStateEncoder(
+keySerializer, userKeyEnc, valEncoder, COMPOSITE_KEY_ROW_SCHEMA, 
stateName, hasTtl = true)
+
+  private val ttlExpirationMs =
+StateTTL.calculateExpirationTimeForDuration(ttlConfig.ttlDuration, 
batchTimestampMs)
+
+  createColumnFamily()
+
+  private def createColumnFamily(): Unit = {
+store.createColFamilyIfAbsent(stateName, COMPOSITE_KEY_ROW_SCHEMA, 
VALUE_ROW_SCHEMA_WITH_TTL,
+  PrefixKeyScanStateEncoderSpec(COMPOSITE_KEY_ROW_SCHEMA, 1))
+  }
+
+  /** Whether state exists or not. */
+  override def exists(): Boolean = {
+iterator().nonEmpty
+  }
+
+  /** Get the state value if it exists */
+  override def getValue(key: K): V = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key)
+val retRow = store.get(encodedCompositeKey, stateName)
+
+if (retRow != null) {
+  val resState = stateTypesEncoder.decodeValue(retRow)
+
+  if (!stateTypesEncoder.isExpired(retRow, batchTimestampMs)) {
+resState
+  } else {
+null.asInstanceOf[V]
+  }
+} else {
+  null.asInstanceOf[V]
+}
+  }
+
+  /** Check if the user key is contained in the map */
+  override def containsKey(key: K): Boolean = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+getValue(key) != null
+  }
+
+  /** Update value for given user key */
+  override def updateValue(key: K, value: V): Unit = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+StateStoreErrors.requireNonNullStateValue(value, stateName)
+
+val encodedValue = stateTypesEncoder.encodeValue(value, ttlExpirationMs)
+val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key)
+store.put(encodedCompositeKey, encodedValue, stateName)
+
+val serializedGroupingKey = stateTypesEncoder.serializeGroupingKey()
+val serializedUserKey = stateTypesEncoder.serializeUserKey(key)
+upsertTTLForStateKey(ttlExpirationMs, serializedGroupingKey, 
serializedUserKey)
+  }
+
+  /** Get the map associated 

Re: [PR] [SPARK-46632][SQL] Fix subexpression elimination when equivalent ternary expressions have different children [spark]

2024-04-22 Thread via GitHub


zml1206 commented on PR #46135:
URL: https://github.com/apache/spark/pull/46135#issuecomment-2071270386

   cc @cloud-fan Do you have time to help take a look? Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47928][SQL][TEST] Speed up test "Add jar support Ivy URI in SQL" [spark]

2024-04-22 Thread via GitHub


yaooqinn commented on PR #46150:
URL: https://github.com/apache/spark/pull/46150#issuecomment-2071269848

   Hi @dongjoon-hyun, tests of HiveVersionsSuite might cover hive dependency 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47912][SQL] Infer serde class from format classes [spark]

2024-04-22 Thread via GitHub


wForget commented on PR #46132:
URL: https://github.com/apache/spark/pull/46132#issuecomment-2071268489

   > Structured Streaming supports writing Spark SQL and using SQL to write 
stream processing logic. Is this possible, similar to the syntax of Flink. SQL 
can satisfy the flow processing process.
   
   This doesn't seem to be relevant to the current PR, you can get more help 
from https://spark.apache.org/community.html.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR] Remove unnecessary `imports` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46161:
URL: https://github.com/apache/spark/pull/46161#issuecomment-2071267699

   Thanks.
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46163: [SPARK-47937][PYTHON][DOCS] Fix 
docstring of `hll_sketch_agg`
URL: https://github.com/apache/spark/pull/46163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46163:
URL: https://github.com/apache/spark/pull/46163#issuecomment-2071267535

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR] Remove unnecessary `imports` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46161: [MINOR] Remove unnecessary `imports`
URL: https://github.com/apache/spark/pull/46161


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][CONNECT][PYTHON][TESTS][FOLLOW-UP] Move `pyspark.classic` references [spark]

2024-04-22 Thread via GitHub


HyukjinKwon closed pull request #46171: 
[SPARK-47909][CONNECT][PYTHON][TESTS][FOLLOW-UP] Move `pyspark.classic` 
references
URL: https://github.com/apache/spark/pull/46171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47909][CONNECT][PYTHON][TESTS][FOLLOW-UP] Move `pyspark.classic` references [spark]

2024-04-22 Thread via GitHub


HyukjinKwon commented on PR #46171:
URL: https://github.com/apache/spark/pull/46171#issuecomment-2071260842

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44050][K8S]add retry config when creating Kubernetes resources. [spark]

2024-04-22 Thread via GitHub


liangyouze commented on PR #45911:
URL: https://github.com/apache/spark/pull/45911#issuecomment-2071259936

   > Thank you for making a PR, but I'm not sure if this is a right layer to 
do. For me, it sounds like you are hitting your K8s cluster issue or K8s client 
library issue. Could you elaborate your environment and the error message more, 
@liangyouze ?
   > 
   > > When creating Kubernetes resources, we occasionally encounter situations 
where resources such as ConfigMap cannot be successfully created, resulting in 
the driver pod remaining in the 'ContainerCreating' state. Therefore, it is 
necessary to add a verification mechanism after creating other resources to 
ensure that the resources are actually created
   
   It's the same as described in SPARK-44050,I've encountered the same issue. 
When creating resources such as configmaps, occasionally this situation occurs: 
the code does not throw any exceptions, but the configmap resource is not 
actually created, causing the driver pod to remain in a ContainerCreating state 
and unable to proceed to the next step. This may be a Kubernetes issue, or a 
feature (as far as I know, Kubernetes has some rate-limiting policies that may 
cause certain requests to be dropped, but I'm not sure if it's related), but in 
any case, Spark should not be stuck because of this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT [spark]

2024-04-22 Thread via GitHub


pan3793 commented on PR #46167:
URL: https://github.com/apache/spark/pull/46167#issuecomment-2071252279

   thank you, @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47018][BUILD] Upgrade built-in Hive to 2.3.10 [spark]

2024-04-22 Thread via GitHub


pan3793 commented on code in PR #45372:
URL: https://github.com/apache/spark/pull/45372#discussion_r1575536345


##
dev/test-dependencies.sh:
##
@@ -49,7 +49,7 @@ OLD_VERSION=$($MVN -q \
 --non-recursive \
 org.codehaus.mojo:exec-maven-plugin:1.6.0:exec | grep -E 
'[0-9]+\.[0-9]+\.[0-9]+')
 # dependency:get for guava and jetty-io are workaround for SPARK-37302.
-GUAVA_VERSION=$(build/mvn help:evaluate -Dexpression=guava.version -q 
-DforceStdout | grep -E "^[0-9.]+$")
+GUAVA_VERSION=$(build/mvn help:evaluate -Dexpression=guava.version -q 
-DforceStdout | grep -E "^[0-9\.]+")

Review Comment:
   The recent Guava version has the suffix `-jre` or `-android`, the change is 
required to match the new version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44170][BUILD][FOLLOWUP] Align JUnit5 dependency's version and clean up exclusions [spark]

2024-04-22 Thread via GitHub


pan3793 commented on code in PR #46160:
URL: https://github.com/apache/spark/pull/46160#discussion_r1575535040


##
pom.xml:
##
@@ -220,6 +220,9 @@
 4.1.109.Final
 2.0.65.Final
 72.1
+5.9.1

Review Comment:
   @dongjoon-hyun sorry, that's a typo, corrected



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44170][BUILD][FOLLOWUP] Align JUnit5 dependency's version and clean up exclusions [spark]

2024-04-22 Thread via GitHub


pan3793 commented on code in PR #46160:
URL: https://github.com/apache/spark/pull/46160#discussion_r1575534643


##
pom.xml:
##
@@ -220,6 +220,9 @@
 4.1.109.Final
 2.0.65.Final
 72.1
+5.9.1

Review Comment:
   ```suggestion
   5.9.3
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


zhengruifeng commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575493828


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -3804,12 +3804,14 @@ def sha2(col: "ColumnOrName", numBits: int) -> Column:
 sha2.__doc__ = pysparkfuncs.sha2.__doc__
 
 
-def hll_sketch_agg(col: "ColumnOrName", lgConfigK: Optional[Union[int, 
Column]] = None) -> Column:
+def hll_sketch_agg(
+col: "ColumnOrName",
+lgConfigK: Optional[Union[int, Column]] = None,
+) -> Column:

Review Comment:
   this line is kind of long (in my screen), so I break it to multiple lines:
   
   https://github.com/apache/spark/assets/7322292/86029dd9-9939-4b14-8df2-fc9c89feffdb;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47909][CONNECT][PYTHON][TESTS][FOLLOW-UP] Move `pyspark.classic` references [spark]

2024-04-22 Thread via GitHub


HyukjinKwon opened a new pull request, #46171:
URL: https://github.com/apache/spark/pull/46171

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/46129 that 
moves `pyspark.classic` references to the actual test methods so they are not 
references during `pyspark-connect` only test (that does not have 
`pyspark.classic` package).
   
   ### Why are the changes needed?
   
   To recover the CI: 
https://github.com/apache/spark/actions/runs/8789489804/job/24119356874
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Manually
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47604][CORE] Resource managers: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


panbingkun commented on code in PR #46130:
URL: https://github.com/apache/spark/pull/46130#discussion_r1575532352


##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##
@@ -464,7 +465,7 @@ private[spark] class ApplicationMaster(
   val dummyRunner = new ExecutorRunnable(None, yarnConf, _sparkConf, 
driverUrl, "",
 "", executorMemory, executorCores, appId, securityMgr, 
localResources,
 ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
-  dummyRunner.launchContextDebugInfo()
+  log"${MDC(LogKey.LAUNCH_CONTEXT_DEBUG_INFO, 
dummyRunner.launchContextDebugInfo())}"

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44305][SQL] Dynamically choose whether to broadcast hadoop conf [spark]

2024-04-22 Thread via GitHub


7mming7 commented on PR #46162:
URL: https://github.com/apache/spark/pull/46162#issuecomment-2071214561

   > Could you provide some concrete numbers about your claim?
   @dongjoon-hyun  Yes, it is found here that in the case of small queries with 
large concurrency, the consumption is more obvious
   . Tested some performance of 50 concurrency on SSB, parquet as the source 
data has a difference of 10%-13%
   
![image](https://github.com/apache/spark/assets/5662745/fe505244-d654-4f7a-9b66-f9c020d9b6be)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [CONNECT] Use v1 as spark connect go library starting version [spark-connect-go]

2024-04-22 Thread via GitHub


viirya commented on PR #19:
URL: https://github.com/apache/spark-connect-go/pull/19#issuecomment-2071185132

   Does this Spark connect go library have any previous release yet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR] Remove unnecessary `imports` [spark]

2024-04-22 Thread via GitHub


panbingkun commented on PR #46161:
URL: https://github.com/apache/spark/pull/46161#issuecomment-2071176182

   > To @panbingkun , please don't use `[WIP]` in the PR title. GitHub `Draft` 
feature is better to prevent meting.
   > 
   > It seems that you forget `[WIP]` of the PR title frequently. :)
   
   Yeah, I was in a hurry to have `breakfast` just now, and I forgot to remove 
it. Sorry, haha ❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


zhengruifeng commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575494685


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -3804,12 +3804,14 @@ def sha2(col: "ColumnOrName", numBits: int) -> Column:
 sha2.__doc__ = pysparkfuncs.sha2.__doc__
 
 
-def hll_sketch_agg(col: "ColumnOrName", lgConfigK: Optional[Union[int, 
Column]] = None) -> Column:
+def hll_sketch_agg(
+col: "ColumnOrName",
+lgConfigK: Optional[Union[int, Column]] = None,
+) -> Column:
 if lgConfigK is None:
 return _invoke_function_over_columns("hll_sketch_agg", col)
 else:
-_lgConfigK = lit(lgConfigK) if isinstance(lgConfigK, int) else 
lgConfigK

Review Comment:
   I add this change to the PR desriptions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


zhengruifeng commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575493828


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -3804,12 +3804,14 @@ def sha2(col: "ColumnOrName", numBits: int) -> Column:
 sha2.__doc__ = pysparkfuncs.sha2.__doc__
 
 
-def hll_sketch_agg(col: "ColumnOrName", lgConfigK: Optional[Union[int, 
Column]] = None) -> Column:
+def hll_sketch_agg(
+col: "ColumnOrName",
+lgConfigK: Optional[Union[int, Column]] = None,
+) -> Column:

Review Comment:
   this line is kind of long (to my screen), so I break it to multiple lines:
   
   https://github.com/apache/spark/assets/7322292/86029dd9-9939-4b14-8df2-fc9c89feffdb;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


zhengruifeng commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575492977


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -3804,12 +3804,14 @@ def sha2(col: "ColumnOrName", numBits: int) -> Column:
 sha2.__doc__ = pysparkfuncs.sha2.__doc__
 
 
-def hll_sketch_agg(col: "ColumnOrName", lgConfigK: Optional[Union[int, 
Column]] = None) -> Column:
+def hll_sketch_agg(
+col: "ColumnOrName",
+lgConfigK: Optional[Union[int, Column]] = None,
+) -> Column:
 if lgConfigK is None:
 return _invoke_function_over_columns("hll_sketch_agg", col)
 else:
-_lgConfigK = lit(lgConfigK) if isinstance(lgConfigK, int) else 
lgConfigK

Review Comment:
   oh, it is not a logical change.
   The `lit` function actually accepts a `Column` input and return the column 
itself, so for a `Union[int, Column]` input, we don't not need this `if else` 
(there should be other similar cases, I just update here BTW)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][MINOR] Remove unnecessary `imports` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46161:
URL: https://github.com/apache/spark/pull/46161#issuecomment-2071159435

   To @panbingkun , please don't use `[WIP]` in the PR title. GitHub `Draft` 
feature is better to prevent meting.
   
   It seems that you forget `[WIP]` of the PR title frequently. :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47903][PYTHON][FOLLOW-UP] Removed changes relating to try_parse_json [spark]

2024-04-22 Thread via GitHub


harshmotw-db commented on code in PR #46170:
URL: https://github.com/apache/spark/pull/46170#discussion_r1575488191


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpectsInputTypes.scala:
##


Review Comment:
   This change is not accidental. A very old version of the `try_parse_json` PR 
had removed a line in this file, and that had been fixed in that PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47903][PYTHON][FOLLOW-UP] Removed changes relating to try_parse_json [spark]

2024-04-22 Thread via GitHub


harshmotw-db opened a new pull request, #46170:
URL: https://github.com/apache/spark/pull/46170

   
   
   ### What changes were proposed in this pull request?
   
   Removed changes relating to `try_parse_json` that were accidentally pushed 
during the late stages of this PR.
   
   ### Why are the changes needed?
   
   There is already another PR in progress adding support for `try_parse_json` 
and the implementation that was accidentally pushed is outdated.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it removes the `try_parse_json` that was added just now. This feature 
will be added again soon.
   
   ### How was this patch tested?
   
   NA
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47935][INFRA][PYTHON] Pin `pandas==2.0.3` for `pypy3.8` [spark]

2024-04-22 Thread via GitHub


zhengruifeng commented on PR #46159:
URL: https://github.com/apache/spark/pull/46159#issuecomment-2071153358

   thank you @dongjoon-hyun and @HyukjinKwon for reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47903][PYTHON] Add support for remaining scalar types in the PySpark Variant library [spark]

2024-04-22 Thread via GitHub


HyukjinKwon closed pull request #46122: [SPARK-47903][PYTHON] Add support for 
remaining scalar types in the PySpark Variant library
URL: https://github.com/apache/spark/pull/46122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47903][PYTHON] Add support for remaining scalar types in the PySpark Variant library [spark]

2024-04-22 Thread via GitHub


HyukjinKwon commented on PR #46122:
URL: https://github.com/apache/spark/pull/46122#issuecomment-2071123524

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [CONNECT] Use v1 as spark connect go library starting version [spark-connect-go]

2024-04-22 Thread via GitHub


hiboyang opened a new pull request, #19:
URL: https://github.com/apache/spark-connect-go/pull/19

   ### What changes were proposed in this pull request?
   
   Use v1 for this Spark Connect Go Client library, instead of previously using 
v34.
   
   ### Why are the changes needed?
   
   There is a recent discussion in Spark community for Spark Operator version 
naming convention. People like to use version independent of Spark versions. 
That applies to Spark Connect Go Client as well. Thus change the version here 
to start from v1.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, Spark Connect Go Client users will use v1 instead of use v34.
   
   ### How was this patch tested?
   
   Run unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][MINOR] Remove unnecessary `imports` [spark]

2024-04-22 Thread via GitHub


panbingkun commented on PR #46161:
URL: https://github.com/apache/spark/pull/46161#issuecomment-2071119544

   > +1, LGTM.
   > 
   > BTW, do you know why this is not detected by `UnusedImport` rule?
   
   After this PR (which should be after `scala 2.13`), we see that 
`scala.collection.immutable.IndexedSeq` has been added to 
`src/library/scala/package.scala`, and this class means that we do not need to 
explicitly import when using `IndexedSeq`
   
https://github.com/scala/scala/commit/457dfd07449a8b7943b2a6c614fb57ea2a7b452e#diff-4b7ff743cb5f9fa1d39314acc17d6d0eebd6b9d74ce333dd6c5549cca13de731
   https://github.com/apache/spark/assets/15246973/8fa2c595-7f80-4081-877f-12a039bf34ab;>
   https://github.com/apache/spark/assets/15246973/a16136a5-c3e6-4b7c-bb0b-7db141581265;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-22 Thread via GitHub


chaoqin-li1123 commented on code in PR #45977:
URL: https://github.com/apache/spark/pull/45977#discussion_r1575445077


##
python/pyspark/sql/datasource.py:
##
@@ -469,6 +501,200 @@ def stop(self) -> None:
 ...
 
 
+class SimpleInputPartition(InputPartition):

Review Comment:
   I see, moved it to another file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47929]Setup Static Analysis for Operator [spark-kubernetes-operator]

2024-04-22 Thread via GitHub


jiangzho commented on code in PR #6:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/6#discussion_r1575438072


##
config/checkstyle/checkstyle.xml:
##
@@ -0,0 +1,195 @@
+
+
+https://checkstyle.org/dtds/configuration_1_3.dtd;>
+
+
+
+

Review Comment:
   Yes, this is taken from Spark repo. The major difference is to 
   
   * Remove `checkstyle-suppressions.xml` as there's no expected 
SuppressionFilter expected for this repository yet.
   * Style difference for line length / indent .etc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47929]Setup Static Analysis for Operator [spark-kubernetes-operator]

2024-04-22 Thread via GitHub


jiangzho commented on PR #6:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/6#issuecomment-2071080506

   Thanks @dongjoon-hyun for the review!
   
   yes, we target latest versions as possible 
   
   [Checkstyle](https://checkstyle.sourceforge.io/) latest 10.15.0
   [pmd](https://pmd.github.io/) latest 7.0.0 - it's a major version released 
on Mar 22, 2024, with another minor version 7.1.0 expected on Apr 26. Thus we 
started with latest released 6.x version in this PR while evaluating v7.
   [Spotbugs plugin](https://plugins.gradle.org/plugin/com.github.spotbugs) 
latest version 6.0.12
   [spotbugs](https://github.com/spotbugs/spotbugs) latest version 4.8.4
   [Spotless 
plugin](https://plugins.gradle.org/plugin/com.diffplug.gradle.spotless) latest 
version 6.25.0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


zeotuan commented on code in PR #46151:
URL: https://github.com/apache/spark/pull/46151#discussion_r1575426488


##
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala:
##
@@ -358,15 +359,17 @@ class KMeans private (
 }
 
 val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) / 1e9
-logInfo(f"Iterations took $iterationTimeInSeconds%.3f seconds.")
+logInfo(log"Iterations took ${MDC(TIME_UNITS, iterationTimeInSeconds%.3f)} 
seconds.")
 
 if (iteration == maxIterations) {
-  logInfo(s"KMeans reached the max number of iterations: $maxIterations.")
+  logInfo(log"KMeans reached the max number of" +
+log" iterations: ${MDC(NUM_ITERATIONS, maxIterations)}.")
 } else {
-  logInfo(s"KMeans converged in $iteration iterations.")
+  logInfo(log"KMeans converged in " +
+log"${MDC(NUM_ITERATIONS, iteration)} iterations.")

Review Comment:
   Merged to one line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46169:
URL: https://github.com/apache/spark/pull/46169#issuecomment-2071065080

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46169: [SPARK-47904][SQL][3.5] Preserve case 
in Avro schema when using enableStableIdentifiersForUnionType
URL: https://github.com/apache/spark/pull/46169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


zeotuan commented on code in PR #46151:
URL: https://github.com/apache/spark/pull/46151#discussion_r1575423794


##
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala:
##
@@ -253,6 +254,13 @@ private[spark] class OptionalInstrumentation private(
 }
   }
 
+  override def logInfo(logEntry: LogEntry): Unit = {

Review Comment:
   This is so that classes that use the `OptionalInstrumentation` for logging 
such as 
[WeightedLeastSquares](mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala),
 
[CrossValidator](mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala)
 can also log MessageWithContext. Else I will get `Type Mismatched String => 
MessageWithContext`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47910][CORE] close stream when DiskBlockObjectWriter closeResources to avoid memory leak [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46131:
URL: https://github.com/apache/spark/pull/46131#discussion_r1575409704


##
core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:
##
@@ -174,23 +174,46 @@ private[spark] class DiskBlockObjectWriter(
* Should call after committing or reverting partial writes.
*/
   private def closeResources(): Unit = {
-if (initialized) {
-  Utils.tryWithSafeFinally {
-mcs.manualClose()
-  } {
-channel = null
-mcs = null
-bs = null
-fos = null
-ts = null
-objOut = null
-initialized = false
-streamOpen = false
-hasBeenClosed = true
+try {
+  if (streamOpen) {
+Utils.tryWithSafeFinally {
+  objOut = closeIfNonNull(objOut)
+  bs = null
+} {
+  bs = closeIfNonNull(bs)
+}
+  }
+} catch {
+  case e: IOException =>
+logError(log"Exception occurred while closing the output stream" +
+  log"${MDC(ERROR, e.getMessage)}")
+} finally {
+  if (initialized) {
+Utils.tryWithSafeFinally {
+  mcs.manualClose()
+} {
+  channel = null
+  mcs = null
+  bs = null
+  fos = null
+  ts = null
+  objOut = null
+  initialized = false
+  streamOpen = false
+  hasBeenClosed = true
+}
   }
 }
   }
 
+

Review Comment:
   nit. Redundant empty line. Please remove this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44170][BUILD][FOLLOWUP] Align JUnit5 dependency's version and clean up exclusions [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46160:
URL: https://github.com/apache/spark/pull/46160#discussion_r1575402128


##
pom.xml:
##
@@ -220,6 +220,9 @@
 4.1.109.Final
 2.0.65.Final
 72.1
+5.9.1

Review Comment:
   May I ask why we downgrade this, @pan3793 ? Previously, we have `5.9.3`.
   
   
https://github.com/apache/spark/blob/ac9a12ef6e062ae07e878e202521b22de9979a17/pom.xml#L1239



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575399787


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -3804,12 +3804,14 @@ def sha2(col: "ColumnOrName", numBits: int) -> Column:
 sha2.__doc__ = pysparkfuncs.sha2.__doc__
 
 
-def hll_sketch_agg(col: "ColumnOrName", lgConfigK: Optional[Union[int, 
Column]] = None) -> Column:
+def hll_sketch_agg(
+col: "ColumnOrName",
+lgConfigK: Optional[Union[int, Column]] = None,
+) -> Column:
 if lgConfigK is None:
 return _invoke_function_over_columns("hll_sketch_agg", col)
 else:
-_lgConfigK = lit(lgConfigK) if isinstance(lgConfigK, int) else 
lgConfigK

Review Comment:
   This is a logical code change, isn't it? If the previous code is a bug, 
please file a JIRA and handle it separately, @zhengruifeng .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575400474


##
python/pyspark/sql/functions/builtin.py:
##
@@ -19551,8 +19554,7 @@ def hll_sketch_agg(col: "ColumnOrName", lgConfigK: 
Optional[Union[int, Column]]
 if lgConfigK is None:
 return _invoke_function_over_columns("hll_sketch_agg", col)
 else:
-_lgConfigK = lit(lgConfigK) if isinstance(lgConfigK, int) else 
lgConfigK

Review Comment:
   ditto. This is a logical code change which is irrelevant to `docstring`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575400033


##
python/pyspark/sql/functions/builtin.py:
##
@@ -19502,7 +19502,10 @@ def unwrap_udt(col: "ColumnOrName") -> Column:
 
 
 @_try_remote_functions
-def hll_sketch_agg(col: "ColumnOrName", lgConfigK: Optional[Union[int, 
Column]] = None) -> Column:
+def hll_sketch_agg(
+col: "ColumnOrName",
+lgConfigK: Optional[Union[int, Column]] = None,
+) -> Column:

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47937][PYTHON][DOCS] Fix docstring of `hll_sketch_agg` [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on code in PR #46163:
URL: https://github.com/apache/spark/pull/46163#discussion_r1575399123


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -3804,12 +3804,14 @@ def sha2(col: "ColumnOrName", numBits: int) -> Column:
 sha2.__doc__ = pysparkfuncs.sha2.__doc__
 
 
-def hll_sketch_agg(col: "ColumnOrName", lgConfigK: Optional[Union[int, 
Column]] = None) -> Column:
+def hll_sketch_agg(
+col: "ColumnOrName",
+lgConfigK: Optional[Union[int, Column]] = None,
+) -> Column:

Review Comment:
   This doesn't look like a doc string. May I ask if we need this style change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46168:
URL: https://github.com/apache/spark/pull/46168#issuecomment-2071008564

   Merged to master for Apache Spark 4.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46168: [SPARK-47942][K8S][DOCS] Drop K8s 
v1.26 Support
URL: https://github.com/apache/spark/pull/46168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType [spark]

2024-04-22 Thread via GitHub


sadikovi commented on PR #46169:
URL: https://github.com/apache/spark/pull/46169#issuecomment-2071002159

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47904][SQL][3.5] Preserve case in Avro schema when using enableStableIdentifiersForUnionType [spark]

2024-04-22 Thread via GitHub


sadikovi opened a new pull request, #46169:
URL: https://github.com/apache/spark/pull/46169

   
   
   ### What changes were proposed in this pull request?
   
   
   Backport of https://github.com/apache/spark/pull/46126 to branch-3.5.
   
   When `enableStableIdentifiersForUnionType` is enabled, all of the types are 
lowercased which creates a problem when field types are case-sensitive: 
   
   Union type with fields:
   ```
   Schema.createEnum("myENUM", "", null, List[String]("E1", "e2").asJava),
   Schema.createRecord("myRecord2", "", null, false, List[Schema.Field](new 
Schema.Field("F", Schema.create(Type.FLOAT))).asJava)
   ```
   
   would become
   
   ```
   struct> 
   ```
   
   but instead should be 
   ```
   struct> 
   ```
   
   ### Why are the changes needed?
   
   
   Fixes a bug of lowercasing the field name (the type portion).
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, if a user enables `enableStableIdentifiersForUnionType` and has Union 
types, all fields will preserve the case. Previously, the field names would be 
all in lowercase.
   
   
   ### How was this patch tested?
   
   
   I added a test case to verify the new field names.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47904][SQL] Preserve case in Avro schema when using enableStableIdentifiersForUnionType [spark]

2024-04-22 Thread via GitHub


sadikovi commented on PR #46126:
URL: https://github.com/apache/spark/pull/46126#issuecomment-2070979626

   Yes, I will open a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47904][SQL] Preserve case in Avro schema when using enableStableIdentifiersForUnionType [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46126:
URL: https://github.com/apache/spark/pull/46126#issuecomment-2070968153

   There is a small conflict. Could you make a backporting PR to `branch-3.5`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47904][SQL] Preserve case in Avro schema when using enableStableIdentifiersForUnionType [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46126: [SPARK-47904][SQL] Preserve case in 
Avro schema when using enableStableIdentifiersForUnionType
URL: https://github.com/apache/spark/pull/46126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46168:
URL: https://github.com/apache/spark/pull/46168#issuecomment-2070956886

   Thank you, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46168:
URL: https://github.com/apache/spark/pull/46168#issuecomment-2070953251

   Could you review this K8s version doc PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47907][SQL] Put bang under a config [spark]

2024-04-22 Thread via GitHub


gengliangwang closed pull request #46138: [SPARK-47907][SQL] Put bang under a 
config
URL: https://github.com/apache/spark/pull/46138


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47942][K8S][DOCS] Drop K8s v1.26 Support [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun opened a new pull request, #46168:
URL: https://github.com/apache/spark/pull/46168

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47907] Put bang under a config [spark]

2024-04-22 Thread via GitHub


gengliangwang commented on PR #46138:
URL: https://github.com/apache/spark/pull/46138#issuecomment-2070929813

   Thanks, merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


gengliangwang commented on PR #46151:
URL: https://github.com/apache/spark/pull/46151#issuecomment-2070917462

   @zeotuan Thanks for the work! LGTM overall except for some minor comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun closed pull request #46167: [SPARK-47940][BUILD][TESTS] Upgrade 
`guava` dependency to `33.1.0-jre` in Docker IT
URL: https://github.com/apache/spark/pull/46167


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT [spark]

2024-04-22 Thread via GitHub


dongjoon-hyun commented on PR #46167:
URL: https://github.com/apache/spark/pull/46167#issuecomment-2070905656

   Thank you! Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


gengliangwang commented on code in PR #46151:
URL: https://github.com/apache/spark/pull/46151#discussion_r1575322770


##
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala:
##
@@ -358,15 +359,17 @@ class KMeans private (
 }
 
 val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) / 1e9
-logInfo(f"Iterations took $iterationTimeInSeconds%.3f seconds.")
+logInfo(log"Iterations took ${MDC(TIME_UNITS, iterationTimeInSeconds%.3f)} 
seconds.")
 
 if (iteration == maxIterations) {
-  logInfo(s"KMeans reached the max number of iterations: $maxIterations.")
+  logInfo(log"KMeans reached the max number of" +
+log" iterations: ${MDC(NUM_ITERATIONS, maxIterations)}.")
 } else {
-  logInfo(s"KMeans converged in $iteration iterations.")
+  logInfo(log"KMeans converged in " +
+log"${MDC(NUM_ITERATIONS, iteration)} iterations.")

Review Comment:
   nit: can we merge into one line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


gengliangwang commented on code in PR #46151:
URL: https://github.com/apache/spark/pull/46151#discussion_r1575320474


##
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala:
##
@@ -253,6 +254,13 @@ private[spark] class OptionalInstrumentation private(
 }
   }
 
+  override def logInfo(logEntry: LogEntry): Unit = {

Review Comment:
   why do we need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework [spark]

2024-04-22 Thread via GitHub


gengliangwang commented on code in PR #46151:
URL: https://github.com/apache/spark/pull/46151#discussion_r1575319360


##
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala:
##
@@ -53,8 +54,8 @@ private[spark] class Instrumentation private () extends 
Logging with MLEvents {
 // estimator.getClass.getSimpleName can cause Malformed class name error,
 // call safer `Utils.getSimpleName` instead
 val className = Utils.getSimpleName(stage.getClass)
-logInfo(s"Stage class: $className")
-logInfo(s"Stage uid: ${stage.uid}")
+logInfo(log"Stage class: ${MDC(CLASS_NAME, className)}")
+logInfo(log"Stage uid: ${MDC(STAGE_ID, stage.uid)}")

Review Comment:
   ```suggestion
   logInfo(log"Stage uid: ${MDC(PIPELINE_STAGE_UID, stage.uid)}")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >