[GitHub] [spark] ulysses-you opened a new pull request, #40574: [SPARK-42942][SQL] Support coalesce table cache stage partitions

2023-03-27 Thread via GitHub


ulysses-you opened a new pull request, #40574:
URL: https://github.com/apache/spark/pull/40574

   
   
   ### What changes were proposed in this pull request?
   
   Add a new rule `CoalesceCachePartitions` to support coalesce partitions with 
`TableCacheQueryStageExec`. In order to reuse the code path with 
`CoalesceShufflePartitions`, this pr also does a small refactor about how we 
coalesce partitions.
   
   RDD cache use the RDD id and partition id as the block id, so it seems not 
possible to split skewd partitions like shuffle. To reduce complexity, this pr 
does not allow coalesce partitions with both shuffle and cache stage since 
shuffle read may contain skewed partition spec.
   
   For example, the follow case can not be coalesced by both 
`CoalesceCachePartitions` and `CoalesceShufflePartitions`.
   ```
   SMJ
 ShuffleQueryStage
 TableCacheStage
   ```
   
   ### Why are the changes needed?
   
   Make AQE support coalesce table cache stage partitions.
   
   ### Does this PR introduce _any_ user-facing change?
   
   yes, add a new config to control if coalesce partitions for  table cache 
stage.
   
   ### How was this patch tested?
   
   add tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized

2023-03-27 Thread via GitHub


HyukjinKwon closed pull request #40534: [SPARK-42908][PYTHON] Raise 
RuntimeError when SparkContext is required but not initialized
URL: https://github.com/apache/spark/pull/40534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized

2023-03-27 Thread via GitHub


HyukjinKwon commented on PR #40534:
URL: https://github.com/apache/spark/pull/40534#issuecomment-1486246281

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40536: [SPARK-42895][CONNECT] Improve error messages for stopped Spark sessions

2023-03-27 Thread via GitHub


HyukjinKwon commented on code in PR #40536:
URL: https://github.com/apache/spark/pull/40536#discussion_r1150045902


##
python/pyspark/sql/connect/client.py:
##
@@ -997,10 +1000,32 @@ def config(self, operation: pb2.ConfigRequest.Operation) 
-> ConfigResult:
 )
 return ConfigResult.fromProto(resp)
 raise SparkConnectException("Invalid state during retry exception 
handling.")
-except grpc.RpcError as rpc_error:
-self._handle_error(rpc_error)
+except Exception as error:
+self._handle_error(error)
+
+def _handle_error(self, error: Exception) -> NoReturn:
+"""
+Handle errors that occur during RPC calls.
+
+Parameters
+--
+error : Exception
+An exception thrown during RPC calls.
+
+Returns
+---
+Throws the appropriate internal Python exception.
+"""
+if isinstance(error, grpc.RpcError):
+self._handle_rpc_error(cast(grpc.RpcError, error))
+elif isinstance(error, ValueError):
+if "Cannot invoke RPC" in str(error) and "closed" in str(error):
+raise SparkConnectException(
+error_class="NO_ACTIVE_SESSION", message_parameters=dict()
+)

Review Comment:
   Would be great to double check how it looks like in the console.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40536: [SPARK-42895][CONNECT] Improve error messages for stopped Spark sessions

2023-03-27 Thread via GitHub


HyukjinKwon commented on code in PR #40536:
URL: https://github.com/apache/spark/pull/40536#discussion_r1150045613


##
python/pyspark/sql/connect/client.py:
##
@@ -997,10 +1000,32 @@ def config(self, operation: pb2.ConfigRequest.Operation) 
-> ConfigResult:
 )
 return ConfigResult.fromProto(resp)
 raise SparkConnectException("Invalid state during retry exception 
handling.")
-except grpc.RpcError as rpc_error:
-self._handle_error(rpc_error)
+except Exception as error:
+self._handle_error(error)
+
+def _handle_error(self, error: Exception) -> NoReturn:
+"""
+Handle errors that occur during RPC calls.
+
+Parameters
+--
+error : Exception
+An exception thrown during RPC calls.
+
+Returns
+---
+Throws the appropriate internal Python exception.
+"""
+if isinstance(error, grpc.RpcError):
+self._handle_rpc_error(cast(grpc.RpcError, error))
+elif isinstance(error, ValueError):
+if "Cannot invoke RPC" in str(error) and "closed" in str(error):
+raise SparkConnectException(
+error_class="NO_ACTIVE_SESSION", message_parameters=dict()
+)

Review Comment:
   I think Python exception chaining would work here so it'd be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40572: [SPARK-37677][CORE] Unzip could keep file permissions

2023-03-27 Thread via GitHub


HyukjinKwon closed pull request #40572: [SPARK-37677][CORE] Unzip could keep 
file permissions
URL: https://github.com/apache/spark/pull/40572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40572: [SPARK-37677][CORE] Unzip could keep file permissions

2023-03-27 Thread via GitHub


HyukjinKwon commented on PR #40572:
URL: https://github.com/apache/spark/pull/40572#issuecomment-1486238107

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-27 Thread via GitHub


shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1486232865

   @cloud-fan Can you please check my last comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-27 Thread via GitHub


shrprasa commented on PR #40128:
URL: https://github.com/apache/spark/pull/40128#issuecomment-1486231326

   Gentle Ping @dongjoon-hyun @holdenk


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40536: [SPARK-42895][CONNECT] Improve error messages for stopped Spark sessions

2023-03-27 Thread via GitHub


amaliujia commented on code in PR #40536:
URL: https://github.com/apache/spark/pull/40536#discussion_r1150013541


##
python/pyspark/sql/connect/client.py:
##
@@ -997,10 +1000,32 @@ def config(self, operation: pb2.ConfigRequest.Operation) 
-> ConfigResult:
 )
 return ConfigResult.fromProto(resp)
 raise SparkConnectException("Invalid state during retry exception 
handling.")
-except grpc.RpcError as rpc_error:
-self._handle_error(rpc_error)
+except Exception as error:
+self._handle_error(error)
+
+def _handle_error(self, error: Exception) -> NoReturn:
+"""
+Handle errors that occur during RPC calls.
+
+Parameters
+--
+error : Exception
+An exception thrown during RPC calls.
+
+Returns
+---
+Throws the appropriate internal Python exception.
+"""
+if isinstance(error, grpc.RpcError):
+self._handle_rpc_error(cast(grpc.RpcError, error))
+elif isinstance(error, ValueError):
+if "Cannot invoke RPC" in str(error) and "closed" in str(error):
+raise SparkConnectException(
+error_class="NO_ACTIVE_SESSION", message_parameters=dict()
+)

Review Comment:
   I am thinking this will swallow other ValueError? You probably need to 
re-throw non-closed channel ValueError?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #40573: [SPARK-42943][SQL] Use LONGTEXT instead of TEXT for StringType for effective length

2023-03-27 Thread via GitHub


yaooqinn commented on PR #40573:
URL: https://github.com/apache/spark/pull/40573#issuecomment-1486195050

   cc @cloud-fan @dongjoon-hyun @HyukjinKwon, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #40573: [SPARK-42943][SQL] Use LONGTEXT instead of TEXT for StringType for effective length

2023-03-27 Thread via GitHub


yaooqinn commented on code in PR #40573:
URL: https://github.com/apache/spark/pull/40573#discussion_r1150009461


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala:
##
@@ -176,6 +176,7 @@ private case object MySQLDialect extends JdbcDialect with 
SQLConfHelper {
 // See SPARK-35446: MySQL treats REAL as a synonym to DOUBLE by default
 // We override getJDBCType so that FloatType is mapped to FLOAT instead
 case FloatType => Option(JdbcType("FLOAT", java.sql.Types.FLOAT))
+case StringType => Option(JdbcType("LONGTEXT", java.sql.Types.LONGVARCHAR))

Review Comment:
   Before this, it's ```case StringType => Option(JdbcType("TEXT", 
java.sql.Types.CLOB))```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn opened a new pull request, #40573: [SPARK-42943][SQL] Use LONGTEXT instead of TEXT for StringType for effective length

2023-03-27 Thread via GitHub


yaooqinn opened a new pull request, #40573:
URL: https://github.com/apache/spark/pull/40573

   
   
   
   ### What changes were proposed in this pull request?
   
   
   Referring to 
https://dev.mysql.com/doc/refman/8.0/en/string-type-syntax.html, A 
[TEXT](https://dev.mysql.com/doc/refman/8.0/en/blob.html) column with a maximum 
length of 65,535 (2^16 − 1) characters.
   
   We currently convert our string to MySQL's `text` and jdbc's `CLOB`. The 
`text` here is insufficient. And `CLOB` is incorrect, which LONGVARCHAR should 
be replaced instead. 
   
   
   ### Why are the changes needed?
   
   
   better compatibility with MySQL and bugfix
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, you won't see MysqlDataTruncation if you store a string exceeding 65536 
into a column which defined by spark' string with MySQL catalog.
   
   ### How was this patch tested?
   
   
   new tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-27 Thread via GitHub


cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1149997056


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##
@@ -297,8 +297,14 @@ abstract class InMemoryBaseTable(
   InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, 
tableSchema)
 
 override def pruneColumns(requiredSchema: StructType): Unit = {
-  val schemaNames = metadataColumnNames ++ tableSchema.map(_.name)
-  schema = StructType(requiredSchema.filter(f => 
schemaNames.contains(f.name)))
+  // The required schema could contain conflict-renamed metadata columns, 
so we need to match
+  // them by their logical (original) names, not their current names.
+  val schemaNames = tableSchema.map(_.name).toSet
+  val prunedFields = requiredSchema.filter {

Review Comment:
   > ... simply won't gain the ability to rename metadata columns on conflict.
   
   is it true? The rename happens at the Spark side, when Spark append metadata 
columns to the normal output columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

2023-03-27 Thread via GitHub


cloud-fan commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1486163555

   OK now I got the use case. At the time when we add the rebalance hint, we 
don't know what the final query is. Shall we make this optimization a bit more 
conservative to match both global and local limit? I still think it's risky to 
remove rebalance below local limit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets commented on a diff in pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

2023-03-27 Thread via GitHub


bersprockets commented on code in PR #40569:
URL: https://github.com/apache/spark/pull/40569#discussion_r1149995693


##
sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala:
##
@@ -2695,4 +2695,26 @@ class SubquerySuite extends QueryTest
   }
 }
   }
+
+  test("SPARK-42937: Outer join with subquery in condition") {
+withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false",

Review Comment:
   @allisonwang-db 
   
   When adaptive execution is enabled, `PlanAdaptiveSubqueries` always sets 
`shouldBroadcast=true`, so the subquery's result is available on the executor, 
if needed.
   
   
https://github.com/apache/spark/blob/31965a06c9f85abf2296971237b1f88065eb67c2/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveSubqueries.scala#L46



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40568: [SPARK-42922][SQL] Move from Random to SecureRandom

2023-03-27 Thread via GitHub


srowen commented on PR #40568:
URL: https://github.com/apache/spark/pull/40568#issuecomment-1486162373

   Merged to master/3.4/3.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #40568: [SPARK-42922][SQL] Move from Random to SecureRandom

2023-03-27 Thread via GitHub


srowen closed pull request #40568: [SPARK-42922][SQL] Move from Random to 
SecureRandom
URL: https://github.com/apache/spark/pull/40568


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40570: [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator

2023-03-27 Thread via GitHub


HyukjinKwon commented on PR #40570:
URL: https://github.com/apache/spark/pull/40570#issuecomment-1486155106

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40570: [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator

2023-03-27 Thread via GitHub


HyukjinKwon closed pull request #40570: [SPARK-41876][CONNECT][PYTHON] 
Implement DataFrame.toLocalIterator
URL: https://github.com/apache/spark/pull/40570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring

2023-03-27 Thread via GitHub


HyukjinKwon closed pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] 
Rename isBarrier to barrier, and correct docstring
URL: https://github.com/apache/spark/pull/40571


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring

2023-03-27 Thread via GitHub


HyukjinKwon commented on PR #40571:
URL: https://github.com/apache/spark/pull/40571#issuecomment-1486153075

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] smallzhongfeng commented on pull request #40572: [SPARK-37677][CORE] Unzip could keep file permissions

2023-03-27 Thread via GitHub


smallzhongfeng commented on PR #40572:
URL: https://github.com/apache/spark/pull/40572#issuecomment-1486149209

   Discussed in 
https://github.com/apache/spark/pull/35278#issuecomment-1033927506, PTAL. 
@HyukjinKwon
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] smallzhongfeng opened a new pull request, #40572: [SPARK-37677][CORE] Unzip could keep file permissions

2023-03-27 Thread via GitHub


smallzhongfeng opened a new pull request, #40572:
URL: https://github.com/apache/spark/pull/40572

   
   
   ### What changes were proposed in this pull request?
   
   Just remove comment.
   
   ### Why are the changes needed?
   
   After https://github.com/apache/hadoop/pull/4036, unzip could unzip could 
keep file permissions.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   No need, has been added in hadoop-client side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

2023-03-27 Thread via GitHub


wangyum commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1486144770

   If such queries cannot be optimized, the performance of such queries will be 
very poor. We use a partition to fetch data from MySQL, and increase its 
parallelism for downstream computing after fetching the data:
   
   ```sql
   CREATE VIEW full_query_log
   AS
   SELECT h.* FROM query_log_hdfs h
   UNION ALL
   SELECT /*+ REBALANCE */ q.*, DATE(start) FROM query_log_mysql q;
   
   SELECT * FROM full_query_log limit 5;
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate

2023-03-27 Thread via GitHub


cloud-fan closed pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the 
having clause can be resolved directly by its child Aggregate
URL: https://github.com/apache/spark/pull/40558


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate

2023-03-27 Thread via GitHub


cloud-fan commented on PR #40558:
URL: https://github.com/apache/spark/pull/40558#issuecomment-1486139796

   thanks, merging to master/3.4!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side

2023-03-27 Thread via GitHub


yaooqinn commented on code in PR #40543:
URL: https://github.com/apache/spark/pull/40543#discussion_r1149965908


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -128,7 +128,7 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "('abcd', 'efgh', 'ijkl', 'mnop', 'q')").executeUpdate()
 
 conn.prepareStatement("CREATE TABLE char_array_types (" +
-  "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 
varchar(4)[], c4 bpchar[])"
+  "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 
varchar(4)[], c4 bpchar(1)[])"

Review Comment:
   updated 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side

2023-03-27 Thread via GitHub


yaooqinn commented on code in PR #40543:
URL: https://github.com/apache/spark/pull/40543#discussion_r1149965776


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -86,6 +86,8 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
   s"jdbc:oracle:thin:system/$oracle_password@//$ip:$port/xe"
   }
 
+  override val typeMapping: Map[DataType, DataType] = Map(StringType -> 
VarcharType(255))

Review Comment:
   According to the comment with it, 255 is from 
https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/index.html, but I think we 
misunderstand the doc and Int.MaxValue is much more reasonable
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a diff in pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

2023-03-27 Thread via GitHub


allisonwang-db commented on code in PR #40569:
URL: https://github.com/apache/spark/pull/40569#discussion_r1149956290


##
sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala:
##
@@ -2695,4 +2695,26 @@ class SubquerySuite extends QueryTest
   }
 }
   }
+
+  test("SPARK-42937: Outer join with subquery in condition") {
+withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false",

Review Comment:
   Why does this fail only when AQE is disabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`

2023-03-27 Thread via GitHub


LuciferYang commented on PR #40560:
URL: https://github.com/apache/spark/pull/40560#issuecomment-1486109613

   Thanks @gengliangwang ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on a diff in pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized

2023-03-27 Thread via GitHub


xinrong-meng commented on code in PR #40534:
URL: https://github.com/apache/spark/pull/40534#discussion_r1149944567


##
python/pyspark/sql/utils.py:
##
@@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any:
 return cast(FuncT, wrapped)
 
 
+def require_spark_context_initialized() -> SparkContext:

Review Comment:
   Good point, renamed.



##
python/pyspark/sql/utils.py:
##
@@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any:
 return cast(FuncT, wrapped)
 
 
+def require_spark_context_initialized() -> SparkContext:
+"""Raise RuntimeError if SparkContext is not initialized,
+otherwise, returns the active SparkContext."""
+sc = SparkContext._active_spark_context
+if sc is None or sc._jvm is None:
+raise RuntimeError("SparkContext must be initialized for JVM access.")

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wankunde commented on pull request #40523: [SPARK-42897][SQL] Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition

2023-03-27 Thread via GitHub


wankunde commented on PR #40523:
URL: https://github.com/apache/spark/pull/40523#issuecomment-1486090996

   Hi, @c21 @cloud-fan  this seems to be SMJ full outer join codegen bug, could 
you have a look at this issue ? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

2023-03-27 Thread via GitHub


HyukjinKwon commented on PR #40569:
URL: https://github.com/apache/spark/pull/40569#issuecomment-1486087962

   cc @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-27 Thread via GitHub


HyukjinKwon commented on code in PR #40520:
URL: https://github.com/apache/spark/pull/40520#discussion_r1149940286


##
python/pyspark/sql/pandas/map_ops.py:
##
@@ -60,6 +60,7 @@ def mapInPandas(
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
+isBarrier : Use barrier mode execution if True.

Review Comment:
   https://github.com/apache/spark/pull/40571



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring

2023-03-27 Thread via GitHub


HyukjinKwon opened a new pull request, #40571:
URL: https://github.com/apache/spark/pull/40571

   ### What changes were proposed in this pull request?
   
   This PR is a followup of  proposes to fix:
   - Add `versionchanged` in its docstring.
   - Rename `isBarrier` to `barrier` to make it look Python friendly
   - Fix some wording and examples.
   
   ### Why are the changes needed?
   
   For better documentation, and make it more Python friendly.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it renames the parameter, and fixes the documentation.
   
   ### How was this patch tested?
   
   Linters in this PR should test them out.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring

2023-03-27 Thread via GitHub


HyukjinKwon commented on PR #40571:
URL: https://github.com/apache/spark/pull/40571#issuecomment-1486086978

   cc @WeichenXu123 and @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-27 Thread via GitHub


HyukjinKwon commented on code in PR #40520:
URL: https://github.com/apache/spark/pull/40520#discussion_r1149936423


##
python/pyspark/sql/pandas/map_ops.py:
##
@@ -60,6 +60,7 @@ def mapInPandas(
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
+isBarrier : Use barrier mode execution if True.

Review Comment:
   let me just make a quick followup.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-27 Thread via GitHub


HyukjinKwon commented on code in PR #40520:
URL: https://github.com/apache/spark/pull/40520#discussion_r1149935975


##
python/pyspark/sql/pandas/map_ops.py:
##
@@ -60,6 +60,7 @@ def mapInPandas(
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
+isBarrier : Use barrier mode execution if True.

Review Comment:
   Yes we should. with the directive `.. versionchanged`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized

2023-03-27 Thread via GitHub


HyukjinKwon commented on code in PR #40534:
URL: https://github.com/apache/spark/pull/40534#discussion_r1149935560


##
python/pyspark/sql/utils.py:
##
@@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any:
 return cast(FuncT, wrapped)
 
 
+def require_spark_context_initialized() -> SparkContext:
+"""Raise RuntimeError if SparkContext is not initialized,
+otherwise, returns the active SparkContext."""
+sc = SparkContext._active_spark_context
+if sc is None or sc._jvm is None:
+raise RuntimeError("SparkContext must be initialized for JVM access.")

Review Comment:
   I would just say something like: `SparkContext or SparkSession should be 
created first` instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized

2023-03-27 Thread via GitHub


HyukjinKwon commented on code in PR #40534:
URL: https://github.com/apache/spark/pull/40534#discussion_r1149934958


##
python/pyspark/sql/utils.py:
##
@@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any:
 return cast(FuncT, wrapped)
 
 
+def require_spark_context_initialized() -> SparkContext:

Review Comment:
   I would name it like: `get_jvm_spark_context`



##
python/pyspark/sql/utils.py:
##
@@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any:
 return cast(FuncT, wrapped)
 
 
+def require_spark_context_initialized() -> SparkContext:

Review Comment:
   I would name it like: `get_active_spark_context`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas

2023-03-27 Thread via GitHub


dongjoon-hyun commented on PR #40521:
URL: https://github.com/apache/spark/pull/40521#issuecomment-1486035003

   Got it. Thanks. +1, LGTM too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-27 Thread via GitHub


ueshin commented on code in PR #40520:
URL: https://github.com/apache/spark/pull/40520#discussion_r1149903109


##
python/pyspark/sql/pandas/map_ops.py:
##
@@ -60,6 +60,7 @@ def mapInPandas(
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
+isBarrier : Use barrier mode execution if True.

Review Comment:
   We should mark these new arguments by `.. versionadded`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-27 Thread via GitHub


HeartSaVioR commented on PR #40561:
URL: https://github.com/apache/spark/pull/40561#issuecomment-1486008365

   Sigh I didn't indicate we already took a step of Scala API with Spark 
connect. I thought there's only in PySpark. Thanks for correcting me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request, #40570: [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator

2023-03-27 Thread via GitHub


ueshin opened a new pull request, #40570:
URL: https://github.com/apache/spark/pull/40570

   ### What changes were proposed in this pull request?
   
   Implements `DataFrame.toLocalIterator`.
   
   The argument `prefetchPartitions` won't take effect for Spark Connect.
   
   ### Why are the changes needed?
   
   Missing API.
   
   ### Does this PR introduce _any_ user-facing change?
   
   `DataFrame.toLocalIterator` will be available.
   
   ### How was this patch tested?
   
   Enabled the related tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-27 Thread via GitHub


amaliujia commented on PR #40561:
URL: https://github.com/apache/spark/pull/40561#issuecomment-1486006675

   hmm I am not sure what you already did but I am thinking if you don't add 
anything into 
https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala,
 then you need update `CheckConnectJvmClientCompatibility`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas

2023-03-27 Thread via GitHub


HyukjinKwon closed pull request #40521: [MINOR][DOCS][PYTHON] Update some urls 
about deprecated repository pyspark.pandas
URL: https://github.com/apache/spark/pull/40521


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas

2023-03-27 Thread via GitHub


HyukjinKwon commented on PR #40521:
URL: https://github.com/apache/spark/pull/40521#issuecomment-1486003187

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-27 Thread via GitHub


HeartSaVioR commented on PR #40561:
URL: https://github.com/apache/spark/pull/40561#issuecomment-1485999022

   I was wondering what is different from dropDuplicates and this one. I don't 
see dropDuplicates being handled separately. Is it because the PySpark 
implementation of dropDuplicates is available?
   
   If this method has to be excluded, could you please guide how to do that? 
Thanks in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-27 Thread via GitHub


amaliujia commented on PR #40561:
URL: https://github.com/apache/spark/pull/40561#issuecomment-1485996207

   I think you need update at 
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-27 Thread via GitHub


HeartSaVioR commented on PR #40561:
URL: https://github.com/apache/spark/pull/40561#issuecomment-1485964292

   @HyukjinKwon @amaliujia 
   Would you mind if I ask what happens with the mima check for this PR? 
   https://github.com/HeartSaVioR/spark/actions/runs/4536405777/jobs/7993077860
   
   Is it required to add PySpark API in this PR to pass Spark connect check? At 
least MiMa check failed in Scala codebase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40533: [SPARK-42906][K8S] Replace a starting digit with `x` in resource name prefix

2023-03-27 Thread via GitHub


dongjoon-hyun commented on PR #40533:
URL: https://github.com/apache/spark/pull/40533#issuecomment-1485948727

   Merged to master/3.4/3.3/3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40533: [SPARK-42906][K8S] Replace a starting digit with `x` in resource name prefix

2023-03-27 Thread via GitHub


dongjoon-hyun closed pull request #40533: [SPARK-42906][K8S] Replace a starting 
digit with `x` in resource name prefix
URL: https://github.com/apache/spark/pull/40533


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets opened a new pull request, #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

2023-03-27 Thread via GitHub


bersprockets opened a new pull request, #40569:
URL: https://github.com/apache/spark/pull/40569

   ### What changes were proposed in this pull request?
   
   Change `PlanSubqueries` to set `shouldBroadcast` to true when instantiating 
an `InSubqueryExec` instance.
   
   ### Why are the changes needed?
   
   The below left outer join gets an error:
   ```
   create or replace temp view v1 as
   select * from values
   (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
   (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
   (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
   as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
value9, value10);
   
   create or replace temp view v2 as
   select * from values
   (1, 2),
   (3, 8),
   (7, 9)
   as v2(a, b);
   
   create or replace temp view v3 as
   select * from values
   (3),
   (8)
   as v3(col1);
   
   set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
   set spark.sql.adaptive.enabled=false;
   
   select *
   from v1
   left outer join v2
   on key = a
   and key in (select col1 from v3);
   ```
   The join fails during predicate codegen:
   ```
   23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
interpreter mode
   java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
IN subquery#34 has not finished
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
at 
org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
at scala.collection.immutable.List.map(List.scala:293)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
at 
org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
at 
org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
   ```
   It fails again after fallback to interpreter mode:
   ```
   23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
   java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
IN subquery#34 has not finished
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
at 
org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
   ```
   Both the predicate codegen and the evaluation fail for the same reason: 
`PlanSubqueries` creates `InSubqueryExec` with `shouldBroadcast=false`. The 
driver waits for the subquery to finish, but it's the executor that uses the 
results of the subquery (for predicate codegen or evaluation). Because 
`shouldBroadcast` is set to false, the result is stored in a transient field 
(`InSubqueryExec#result`), so the result of the subquery is not serialized when 
the `InSubqueryExec` instance is sent to the executor.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific 

[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-27 Thread via GitHub


HeartSaVioR commented on PR #40561:
URL: https://github.com/apache/spark/pull/40561#issuecomment-1485834545

   Just added a dummy implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40568: [SPARK-42922][SQL] Move from Random to SecureRandom

2023-03-27 Thread via GitHub


dongjoon-hyun commented on PR #40568:
URL: https://github.com/apache/spark/pull/40568#issuecomment-1485827152

   According to the `Affected Version` in JIRA, I also agree with backporting 
to the applicable release branches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-27 Thread via GitHub


HeartSaVioR commented on PR #40561:
URL: https://github.com/apache/spark/pull/40561#issuecomment-1485819017

   The error only occurred from linter - it now does not allow a new PR to 
introduce a new public API "without adding to spark-connect". This PR 
intentionally postpones addressing PySpark in separate JIRA ticket, hence 
addressing spark-connect should go to there as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40568: [SPARK-42922][SQL]: Move from Random to SecureRandom

2023-03-27 Thread via GitHub


srowen commented on PR #40568:
URL: https://github.com/apache/spark/pull/40568#issuecomment-1485705422

   I think it's fine. These do look like better usages of RNGs. Let's see what 
tests say.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40568: SPARK-42922: Move from Random to SecureRandom

2023-03-27 Thread via GitHub


mridulm commented on PR #40568:
URL: https://github.com/apache/spark/pull/40568#issuecomment-1485666995

   +CC @srowen


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm opened a new pull request, #40568: SPARK-42922: Move from Random to SecureRandom

2023-03-27 Thread via GitHub


mridulm opened a new pull request, #40568:
URL: https://github.com/apache/spark/pull/40568

   
   
   ### What changes were proposed in this pull request?
   
   Most uses of `Random` in spark are either in testcases or where we need a 
pseudo random number which is repeatable.
   Use `SecureRandom`, instead of `Random` for subset of cases where it helps:
   
   ### Why are the changes needed?
   
   Use of `SecureRandom` in more security sensitive contexts.
   This was flagged in our internal scans as well.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Directly no.
   Would improve security posture of Apache Spark.
   
   ### How was this patch tested?
   
   Existing unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`

2023-03-27 Thread via GitHub


gengliangwang commented on PR #40560:
URL: https://github.com/apache/spark/pull/40560#issuecomment-1485658926

   +1, @LuciferYang Thanks for the work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] anchovYu commented on pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate

2023-03-27 Thread via GitHub


anchovYu commented on PR #40558:
URL: https://github.com/apache/spark/pull/40558#issuecomment-1485580609

   @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] johanl-db commented on a diff in pull request #40545: [SPARK-42918] Generalize handling of metadata attributes in FileSourceStrategy

2023-03-27 Thread via GitHub


johanl-db commented on code in PR #40545:
URL: https://github.com/apache/spark/pull/40545#discussion_r1149566989


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##
@@ -519,6 +519,13 @@ object FileSourceMetadataAttribute {
   def cleanupFileSourceMetadataInformation(attr: Attribute): Attribute =
 attr.withMetadata(removeInternalMetadata(attr.metadata))
 
+  /**
+   * Cleanup the internal metadata information of a struct field, if it is

Review Comment:
   Done



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##
@@ -221,14 +225,23 @@ object FileFormat {
 FileSourceConstantMetadataStructField(FILE_MODIFICATION_TIME, 
TimestampType, nullable = false))
 
   /**
-   * Create a file metadata struct column containing fields supported by the 
given file format.
+   * All fields the file format's _metadata struct defines.

Review Comment:
   Done



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -322,18 +329,13 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 
   // extra Project node: wrap flat metadata columns to a metadata struct
   val withMetadataProjections = metadataStructOpt.map { metadataStruct =>
-val structColumns = metadataColumns.map { col => col.name match {
-case FileFormat.FILE_PATH | FileFormat.FILE_NAME | 
FileFormat.FILE_SIZE |
- FileFormat.FILE_BLOCK_START | FileFormat.FILE_BLOCK_LENGTH |
- FileFormat.FILE_MODIFICATION_TIME =>
-  col
-case FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME =>
-  generatedMetadataColumns
-.find(_.name == FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME)
-// Change the `_tmp_metadata_row_index` to `row_index`,
-// and also change the nullability to not nullable,
-// which is consistent with the nullability of `row_index` 
field
-.get.withName(FileFormat.ROW_INDEX).withNullability(false)
+val structColumns = 
metadataStruct.dataType.asInstanceOf[StructType].fields.map { field =>
+  // Construct the metadata struct the query expects to see, using the 
columns we previously
+  // created. Be sure to restore the proper name and nullability for 
each metadata field.
+  metadataColumnsByName(field.name) match {

Review Comment:
   I simplified this code to always restore the name and nullability, I don't 
think it warrants a method anymore.
   I actually think it was not correct before that since you could have a 
generated column created with `nullable = true` on 
[L272](https://github.com/apache/spark/pull/40545/commits/2f5f9a26ec5cdf0c04893a18e2cfa386228e7aa2#diff-7e4f78e90e8699733afbe43e2b265b95e514896ac68f1fb9e60705d59a0b7ed9R272)
 and nullability wouldn't be reset if its `name` and `internalName` are the same



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -236,37 +247,41 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
   // For generated metadata columns, they are set as nullable when passed 
to readers,
   //  as the values will be null when trying to read the missing column 
from the file.
   //  They are then replaced by the actual values later in the process.
-  // All metadata columns will be non-null in the returned output.
-  // We then change the nullability to non-nullable in the metadata 
projection node below.
-  val constantMetadataColumns: mutable.Buffer[Attribute] = 
mutable.Buffer.empty
-  val generatedMetadataColumns: mutable.Buffer[Attribute] = 
mutable.Buffer.empty
+  // We then restore the specified nullability in the metadata projection 
node below.
+  // Also remember the attribute for each column name, so we can easily 
map back to it.
+  val constantMetadataColumns = mutable.Buffer.empty[Attribute]
+  val generatedMetadataColumns = mutable.Buffer.empty[Attribute]
+  val metadataColumnsByName = mutable.Map.empty[String, Attribute]
 
   metadataStructOpt.foreach { metadataStruct =>
-metadataStruct.dataType.asInstanceOf[StructType].fields.foreach { 
field =>
-  field.name match {
-case FileFormat.ROW_INDEX =>
-  if ((readDataColumns ++ 
partitionColumns).map(_.name.toLowerCase(Locale.ROOT))
-  .contains(FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME)) {
-throw new 
AnalysisException(FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME +
-  " is a reserved column name that cannot be read in 
combination with " +
-  s"${FileFormat.METADATA_NAME}.${FileFormat.ROW_INDEX} 
column.")
-  }
-  generatedMetadataColumns +=
-

[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

2023-03-27 Thread via GitHub


dongjoon-hyun commented on PR #40462:
URL: https://github.com/apache/spark/pull/40462#issuecomment-1485539305

   `HINT` is a part of `SELECT` clause.
   
https://github.com/apache/spark/blob/a3d9e0ae0f95a55766078da5d0bf0f74f3c3cfc3/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L530-L532
   
   That's the reason why `LIMIT` is the root of the query.
   ```
   scala> sql("SELECT /*+ REBALANCE */ * FROM t WHERE id > 1 LIMIT 
5").explain(true)
   == Parsed Logical Plan ==
   'GlobalLimit 5
   +- 'LocalLimit 5
  +- 'UnresolvedHint REBALANCE
 +- 'Project [*]
+- 'Filter ('id > 1)
   +- 'UnresolvedRelation [t], [], false
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40566: [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin`

2023-03-27 Thread via GitHub


LuciferYang commented on PR #40566:
URL: https://github.com/apache/spark/pull/40566#issuecomment-1485485240

   Thanks very much @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`

2023-03-27 Thread via GitHub


LuciferYang commented on PR #40560:
URL: https://github.com/apache/spark/pull/40560#issuecomment-1485482919

   Thanks @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40566: [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin`

2023-03-27 Thread via GitHub


dongjoon-hyun closed pull request #40566: [SPARK-42934][BUILD] Add 
`spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin`
URL: https://github.com/apache/spark/pull/40566


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40566: [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin`

2023-03-27 Thread via GitHub


dongjoon-hyun commented on PR #40566:
URL: https://github.com/apache/spark/pull/40566#issuecomment-1485469907

   I verified this manually via Maven. Merged to master/3.4/3.3/3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`

2023-03-27 Thread via GitHub


dongjoon-hyun closed pull request #40560: [SPARK-42930][CORE][SQL] Change the 
access scope of `ProtobufSerDe` related implementations to `private[protobuf]`
URL: https://github.com/apache/spark/pull/40560


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40566: [SPARK-42934][BUILD] Move test property `spark.hadoop.hadoop.security.key.provider.path` from `maven-surefire-plugin` to `scala

2023-03-27 Thread via GitHub


LuciferYang commented on code in PR #40566:
URL: https://github.com/apache/spark/pull/40566#discussion_r1149480752


##
pom.xml:
##
@@ -2970,7 +2970,6 @@
   
false
   
true
   true
-  
test:///

Review Comment:
   OK ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40566: [SPARK-42934][BUILD] Move test property `spark.hadoop.hadoop.security.key.provider.path` from `maven-surefire-plugin` to `sca

2023-03-27 Thread via GitHub


dongjoon-hyun commented on code in PR #40566:
URL: https://github.com/apache/spark/pull/40566#discussion_r1149478852


##
pom.xml:
##
@@ -2970,7 +2970,6 @@
   
false
   
true
   true
-  
test:///

Review Comment:
   To be safe, shall we keep this, @LuciferYang ? We have `spark.ui.enabled` in 
both places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40566: [SPARK-42934][BUILD] Move test property `spark.hadoop.hadoop.security.key.provider.path` from `maven-surefire-plugin` to `scalatest-mav

2023-03-27 Thread via GitHub


dongjoon-hyun commented on PR #40566:
URL: https://github.com/apache/spark/pull/40566#issuecomment-1485412215

   Thank you for pinging me, @LuciferYang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side

2023-03-27 Thread via GitHub


dongjoon-hyun commented on code in PR #40543:
URL: https://github.com/apache/spark/pull/40543#discussion_r1149467878


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -86,6 +86,8 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
   s"jdbc:oracle:thin:system/$oracle_password@//$ip:$port/xe"
   }
 
+  override val typeMapping: Map[DataType, DataType] = Map(StringType -> 
VarcharType(255))

Review Comment:
   Is `255` DB-specific value? Where this `255` came from? In 
MySQLIntegrationSuite, it seems to be `Int.MaxValue`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side

2023-03-27 Thread via GitHub


dongjoon-hyun commented on code in PR #40543:
URL: https://github.com/apache/spark/pull/40543#discussion_r1149465747


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -128,7 +128,7 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "('abcd', 'efgh', 'ijkl', 'mnop', 'q')").executeUpdate()
 
 conn.prepareStatement("CREATE TABLE char_array_types (" +
-  "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 
varchar(4)[], c4 bpchar[])"
+  "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 
varchar(4)[], c4 bpchar(1)[])"

Review Comment:
   Since it's non-trivial findings, could you add some comments please, 
@yaooqinn ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5

2023-03-27 Thread via GitHub


dongjoon-hyun commented on PR #39124:
URL: https://github.com/apache/spark/pull/39124#issuecomment-1485388919

   Ya, right. I forgot to say that. Thank you so much, @steveloughran and 
@sunchao too.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5

2023-03-27 Thread via GitHub


LuciferYang commented on PR #39124:
URL: https://github.com/apache/spark/pull/39124#issuecomment-1485382860

   Thanks @dongjoon-hyun @sunchao @steveloughran 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-27 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1149407329


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##
@@ -297,8 +297,14 @@ abstract class InMemoryBaseTable(
   InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, 
tableSchema)
 
 override def pruneColumns(requiredSchema: StructType): Unit = {
-  val schemaNames = metadataColumnNames ++ tableSchema.map(_.name)
-  schema = StructType(requiredSchema.filter(f => 
schemaNames.contains(f.name)))
+  // The required schema could contain conflict-renamed metadata columns, 
so we need to match
+  // them by their logical (original) names, not their current names.
+  val schemaNames = tableSchema.map(_.name).toSet
+  val prunedFields = requiredSchema.filter {

Review Comment:
   As far as I can tell, v2 intentionally puts the burden of pruning (and of 
metadata handling) on the datasource. So by design, every datasource _does_ 
have to implement the policy it wants to use. 
   
   But that's not necessarily a bad thing -- a datasource that chooses not to 
implement this change simply won't gain the ability to rename metadata columns 
on conflict.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-27 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1149407329


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##
@@ -297,8 +297,14 @@ abstract class InMemoryBaseTable(
   InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, 
tableSchema)
 
 override def pruneColumns(requiredSchema: StructType): Unit = {
-  val schemaNames = metadataColumnNames ++ tableSchema.map(_.name)
-  schema = StructType(requiredSchema.filter(f => 
schemaNames.contains(f.name)))
+  // The required schema could contain conflict-renamed metadata columns, 
so we need to match
+  // them by their logical (original) names, not their current names.
+  val schemaNames = tableSchema.map(_.name).toSet
+  val prunedFields = requiredSchema.filter {

Review Comment:
   As far as I can tell, v2 intentionally puts the burden of pruning (and of 
metadata handling) on the datasource. So every datasource _does_ have to 
implement the policy it wants to use. 
   
   But that's not necessarily a bad thing -- a datasource that chooses not to 
implement this change simply won't gain the ability to rename metadata columns 
on conflict.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5

2023-03-27 Thread via GitHub


dongjoon-hyun closed pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop 
to 3.3.5
URL: https://github.com/apache/spark/pull/39124


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] clownxc commented on a diff in pull request #40400: [SPARK-41359][SQL] Use `PhysicalDataType` instead of DataType in UnsafeRow

2023-03-27 Thread via GitHub


clownxc commented on code in PR #40400:
URL: https://github.com/apache/spark/pull/40400#discussion_r1149359986


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java:
##
@@ -70,51 +66,25 @@ public static int calculateBitSetWidthInBytes(int 
numFields) {
 return ((numFields + 63)/ 64) * 8;
   }
 
-  /**
-   * Field types that can be updated in place in UnsafeRows (e.g. we support 
set() for these types)
-   */
-  public static final Set mutableFieldTypes;
-
-  // DecimalType, DayTimeIntervalType and YearMonthIntervalType are also 
mutable
-  static {
-mutableFieldTypes = Collections.unmodifiableSet(
-  new HashSet<>(
-Arrays.asList(
-  NullType,
-  BooleanType,
-  ByteType,
-  ShortType,
-  IntegerType,
-  LongType,
-  FloatType,
-  DoubleType,
-  DateType,
-  TimestampType,
-  TimestampNTZType
-)));
-  }
-
   public static boolean isFixedLength(DataType dt) {

Review Comment:
   Ok, Let me do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40563: [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor `array_append` and `array_prepend` with `RuntimeReplaceable`

2023-03-27 Thread via GitHub


LuciferYang commented on code in PR #40563:
URL: https://github.com/apache/spark/pull/40563#discussion_r1149348790


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala:
##
@@ -1400,120 +1400,27 @@ case class ArrayContains(left: Expression, right: 
Expression)
 copy(left = newLeft, right = newRight)
 }
 
-// scalastyle:off line.size.limit
-@ExpressionDescription(
-  usage = """
-  _FUNC_(array, element) - Add the element at the beginning of the array 
passed as first
-  argument. Type of element should be the same as the type of the elements 
of the array.
-  Null element is also prepended to the array. But if the array passed is 
NULL
-  output is NULL
-""",
-  examples = """
-Examples:
-  > SELECT _FUNC_(array('b', 'd', 'c', 'a'), 'd');
-   ["d","b","d","c","a"]
-  > SELECT _FUNC_(array(1, 2, 3, null), null);
-   [null,1,2,3,null]
-  > SELECT _FUNC_(CAST(null as Array), 2);
-   NULL
-  """,
-  group = "array_funcs",
-  since = "3.5.0")
-case class ArrayPrepend(left: Expression, right: Expression)
-  extends BinaryExpression
-with ImplicitCastInputTypes
-with ComplexTypeMergingExpression
-with QueryErrorsBase {
+trait ArrayInsertEnd extends RuntimeReplaceable

Review Comment:
   should we rename this trait? `ArrayInsertEnd` cannot describe `ArrayPrepend` 
well
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-27 Thread via GitHub


cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1149347950


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##
@@ -297,8 +297,14 @@ abstract class InMemoryBaseTable(
   InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, 
tableSchema)
 
 override def pruneColumns(requiredSchema: StructType): Unit = {
-  val schemaNames = metadataColumnNames ++ tableSchema.map(_.name)
-  schema = StructType(requiredSchema.filter(f => 
schemaNames.contains(f.name)))
+  // The required schema could contain conflict-renamed metadata columns, 
so we need to match
+  // them by their logical (original) names, not their current names.
+  val schemaNames = tableSchema.map(_.name).toSet
+  val prunedFields = requiredSchema.filter {

Review Comment:
   Shall we move this handling to `V2ScanRelationPushDown`? in memory table is 
just a testing v2 source and we can't expect all existing v2 sources to make 
the same change like here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-27 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1149336751


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##
@@ -297,8 +297,14 @@ abstract class InMemoryBaseTable(
   InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, 
tableSchema)
 
 override def pruneColumns(requiredSchema: StructType): Unit = {
-  val schemaNames = metadataColumnNames ++ tableSchema.map(_.name)
-  schema = StructType(requiredSchema.filter(f => 
schemaNames.contains(f.name)))
+  // The required schema could contain conflict-renamed metadata columns, 
so we need to match
+  // them by their logical (original) names, not their current names.
+  val schemaNames = tableSchema.map(_.name).toSet
+  val prunedFields = requiredSchema.filter {

Review Comment:
   Indeed the latter -- that's why this code needs search for metadata columns 
by their logical names, and also why `createReaderFactory` needs to map them 
back to their logical names (because the internal code only knows about the 
logical names).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-27 Thread via GitHub


cloud-fan commented on PR #40300:
URL: https://github.com/apache/spark/pull/40300#issuecomment-1485216976

   about https://github.com/apache/spark/pull/40300/files#r1129818813 , I think 
if `SubqueryAlias` can't propagate metadata columns, then `df.metadataColumn` 
should not be able to get the column, what do you think? 
@ryan-johnson-databricks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-27 Thread via GitHub


cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1149329725


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##
@@ -297,8 +297,14 @@ abstract class InMemoryBaseTable(
   InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, 
tableSchema)
 
 override def pruneColumns(requiredSchema: StructType): Unit = {
-  val schemaNames = metadataColumnNames ++ tableSchema.map(_.name)
-  schema = StructType(requiredSchema.filter(f => 
schemaNames.contains(f.name)))
+  // The required schema could contain conflict-renamed metadata columns, 
so we need to match
+  // them by their logical (original) names, not their current names.
+  val schemaNames = tableSchema.map(_.name).toSet
+  val prunedFields = requiredSchema.filter {

Review Comment:
   does `requiredSchema` contain the logical name or physical name? I think 
it's the latter as `V2ScanRelationPushDown` does not handle metadata column 
name mapping.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] clownxc commented on a diff in pull request #40400: [SPARK-41359][SQL] Use `PhysicalDataType` instead of DataType in UnsafeRow

2023-03-27 Thread via GitHub


clownxc commented on code in PR #40400:
URL: https://github.com/apache/spark/pull/40400#discussion_r1149320479


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala:
##
@@ -19,48 +19,83 @@ package org.apache.spark.sql.catalyst.types
 
 import org.apache.spark.sql.types._
 
-sealed abstract class PhysicalDataType
-
-case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) 
extends PhysicalDataType
-
-class PhysicalBinaryType() extends PhysicalDataType
+sealed abstract class PhysicalDataType{
+  private[sql] def isPrimitive: Boolean
+}
+
+case class PhysicalArrayType(elementType: DataType, containsNull: Boolean)
+  extends PhysicalDataType {
+  override def isPrimitive: Boolean = false
+}
+
+class PhysicalBinaryType() extends PhysicalDataType {
+  override def isPrimitive: Boolean = false
+}
 case object PhysicalBinaryType extends PhysicalBinaryType
 
-class PhysicalBooleanType() extends PhysicalDataType
+class PhysicalBooleanType() extends PhysicalDataType {
+  override def isPrimitive: Boolean = true

Review Comment:
   Thank you for the code review comments,  I will try to modify the code as 
you say.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhmin opened a new pull request, #40567: [SPARK-42935] [SQL] Add union required distribution push down

2023-03-27 Thread via GitHub


zhmin opened a new pull request, #40567:
URL: https://github.com/apache/spark/pull/40567

   
   
   ### What changes were proposed in this pull request?
   We  indroduce a new idea to optimize exchange plan when union spark plan 
output partitoning can't match parent plan's required distribution.
   
   1. First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.
   2. Then push the required distribution to union plan's children. If any 
child output partitioning matches the required distribution , we can reduce 
this child shuffle operation.
   
   
   ### Why are the changes needed?
   Union plan does not take full advantage of children plan output 
partitionings when output partitoning can't match parent plan's required 
distribution.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   org.apache.spark.sql.execution.PlannerSuite
   org.apache.spark.sql.execution.exchange.UnionZipRDDSuite
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40566: [SPARK-42934][SQL][TESTS] Move `spark.hadoop.hadoop.security.key.provider.path` from `systemPropertyVariables` of `maven-surefire-plugin`

2023-03-27 Thread via GitHub


LuciferYang commented on PR #40566:
URL: https://github.com/apache/spark/pull/40566#issuecomment-1485155792

   cc @dongjoon-hyun FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40566: [SPARK-42934][SQL][TESTS] Move `spark.hadoop.hadoop.security.key.provider.path` from `systemPropertyVariables` of `maven-surefire-plugin

2023-03-27 Thread via GitHub


LuciferYang opened a new pull request, #40566:
URL: https://github.com/apache/spark/pull/40566

   ### What changes were proposed in this pull request?
   When testing `OrcEncryptionSuite` using maven, all test suites are always 
skipped. So this pr move `spark.hadoop.hadoop.security.key.provider.path` from 
`systemPropertyVariables` of `maven-surefire-plugin` to `systemProperties` of 
`scalatest-maven-plugin` to make `OrcEncryptionSuite` can test by maven.
   
   ### Why are the changes needed?
   Make `OrcEncryptionSuite` can test by maven.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, just for maven test
   
   
   ### How was this patch tested?
   
   - Pass GitHub Actions
   - Manual testing:
   
   run

   ```
   build/mvn clean install -pl sql/core -DskipTests -am
   build/mvn test -pl sql/core -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite
 
   ```
   
   **Before**
   
   ```
   Discovery starting.
   Discovery completed in 3 seconds, 218 milliseconds.
   Run starting. Expected test count is: 4
   OrcEncryptionSuite:
   21:57:58.344 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
   - Write and read an encrypted file !!! CANCELED !!!
 [] was empty org.apache.orc.impl.NullKeyProvider@5af5d76f doesn't has the 
test keys. ORC shim is created with old Hadoop libraries 
(OrcEncryptionSuite.scala:37)
   - Write and read an encrypted table !!! CANCELED !!!
 [] was empty org.apache.orc.impl.NullKeyProvider@5ad6cc21 doesn't has the 
test keys. ORC shim is created with old Hadoop libraries 
(OrcEncryptionSuite.scala:65)
   - SPARK-35325: Write and read encrypted nested columns !!! CANCELED !!!
 [] was empty org.apache.orc.impl.NullKeyProvider@691124ee doesn't has the 
test keys. ORC shim is created with old Hadoop libraries 
(OrcEncryptionSuite.scala:116)
   - SPARK-35992: Write and read fully-encrypted columns with default masking 
!!! CANCELED !!!
 [] was empty org.apache.orc.impl.NullKeyProvider@5403799b doesn't has the 
test keys. ORC shim is created with old Hadoop libraries 
(OrcEncryptionSuite.scala:166)
   21:58:00.035 WARN 
org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: 
   
   = POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 
(daemon=true), shuffle-boss-6-1 (daemon=true) =
   
   Run completed in 5 seconds, 41 milliseconds.
   Total number of tests run: 0
   Suites: completed 2, aborted 0
   Tests: succeeded 0, failed 0, canceled 4, ignored 0, pending 0
   No tests were executed.
   ```
   
   **After**
   
   ```
   Discovery starting.
   Discovery completed in 3 seconds, 185 milliseconds.
   Run starting. Expected test count is: 4
   OrcEncryptionSuite:
   21:58:46.540 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
   - Write and read an encrypted file
   - Write and read an encrypted table
   - SPARK-35325: Write and read encrypted nested columns
   - SPARK-35992: Write and read fully-encrypted columns with default masking
   21:58:51.933 WARN 
org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: 
   
   = POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 
(daemon=true), shuffle-boss-6-1 (daemon=true) =
   
   Run completed in 8 seconds, 708 milliseconds.
   Total number of tests run: 4
   Suites: completed 2, aborted 0
   Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk opened a new pull request, #40565: [WIP][SPARK-42873][SQL] Define Spark SQL types as keywords

2023-03-27 Thread via GitHub


MaxGekk opened a new pull request, #40565:
URL: https://github.com/apache/spark/pull/40565

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.12.4

2023-03-27 Thread via GitHub


wangyum commented on code in PR #40555:
URL: https://github.com/apache/spark/pull/40555#discussion_r1149215476


##
pom.xml:
##
@@ -325,6 +325,17 @@
 
   
   
+

Review Comment:
   man need this to download parquet 1.12.4.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.12.4

2023-03-27 Thread via GitHub


wangyum commented on code in PR #40555:
URL: https://github.com/apache/spark/pull/40555#discussion_r1149214805


##
project/SparkBuild.scala:
##
@@ -307,7 +307,9 @@ object SparkBuild extends PomBuild {
   DefaultMavenRepository,
   Resolver.mavenLocal,
   Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns)
-),
+) ++ Seq(
+  "staging-releases-mirror" at 
"https://repository.apache.org/content/repositories/staging/;,

Review Comment:
   sbt need this to download parquet 1.12.4.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5

2023-03-27 Thread via GitHub


LuciferYang commented on PR #39124:
URL: https://github.com/apache/spark/pull/39124#issuecomment-1485027282

   > what version of jettison has come in from hadoop-common?
   > 
   > HADOOP-18676 has gone in this weekend to exclude transitive jettison 
dependencies which don't get into a hadoop tarball, but will come in from pom 
imports.
   
   1.5.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit

2023-03-27 Thread via GitHub


beliefer commented on PR #40355:
URL: https://github.com/apache/spark/pull/40355#issuecomment-1484986064

   ping @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40528: [SPARK-42584][CONNECT] Improve output of Column.explain

2023-03-27 Thread via GitHub


beliefer commented on PR #40528:
URL: https://github.com/apache/spark/pull/40528#issuecomment-1484985770

   ping @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5

2023-03-27 Thread via GitHub


steveloughran commented on PR #39124:
URL: https://github.com/apache/spark/pull/39124#issuecomment-1484958857

   what version of jettison has come in from hadoop-common?
   
   HADOOP-18676 has gone in this weekend to exclude transitive jettison 
dependencies which don't get into a hadoop tarball, but will come in from pom 
imports.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jaceklaskowski commented on a diff in pull request #39907: [SPARK-42359][SQL] Support row skipping when reading CSV files

2023-03-27 Thread via GitHub


jaceklaskowski commented on code in PR #39907:
URL: https://github.com/apache/spark/pull/39907#discussion_r1149152549


##
docs/sql-data-sources-csv.md:
##
@@ -102,6 +102,12 @@ Data source options of CSV can be set via:
 For reading, uses the first line as names of columns. For writing, 
writes the names of columns as the first line. Note that if the given path is a 
RDD of Strings, this header option will remove all lines same with the header 
if exists. CSV built-in functions ignore this option.
 read/write
   
+  
+skipLines
+0
+Sets a number of non-empty, uncommented lines to skip before parsing 
each of the CSV files. If the header option is set to 
true, the first line after the number of skipLines 
will be taken as the header.

Review Comment:
   nit: s/a number/the number + "before parsing CSV files"



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala:
##
@@ -25,21 +25,33 @@ import org.apache.spark.sql.functions._
 
 object CSVUtils {
   /**
-   * Filter ignorable rows for CSV dataset (lines empty and starting with 
`comment`).
-   * This is currently being used in CSV schema inference.
+   * Filter blank lines, remove comments, and skip specified rows from a CSV 
iterator. Then blank
+   * entries (and comments if set) are removed. This is currently being used 
in CSV schema
+   * inference.
*/
-  def filterCommentAndEmpty(lines: Dataset[String], options: CSVOptions): 
Dataset[String] = {
+  def filterUnwantedLines(lines: Dataset[String], options: CSVOptions): 
Dataset[String] = {
 // Note that this was separately made by SPARK-18362. Logically, this 
should be the same
-// with the one below, `filterCommentAndEmpty` but execution path is 
different. One of them
+// with the one below, `filterUnwantedLines` but execution path is 
different. One of them
 // might have to be removed in the near future if possible.
 import lines.sqlContext.implicits._
 val aliased = lines.toDF("value")
 val nonEmptyLines = aliased.filter(length(trim($"value")) > 0)
-if (options.isCommentSet) {
-  
nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String]
-} else {
-  nonEmptyLines.as[String]
+val commentFilteredLines = {
+  if (options.isCommentSet) {
+
nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String]
+  } else {
+nonEmptyLines.as[String]
+  }
 }
+// Note that unlike actual CSV reading path, it simply filters the given 
skipped lines.
+// Therefore, this skips the line same with the skipped lines if exists.
+val linesToSkip = commentFilteredLines.head(options.skipLines)
+commentFilteredLines.rdd

Review Comment:
   Is `.rdd` required here? `Dataset.mapPartitions` should give us what we 
want, shouldn't it?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala:
##
@@ -25,21 +25,33 @@ import org.apache.spark.sql.functions._
 
 object CSVUtils {
   /**
-   * Filter ignorable rows for CSV dataset (lines empty and starting with 
`comment`).
-   * This is currently being used in CSV schema inference.
+   * Filter blank lines, remove comments, and skip specified rows from a CSV 
iterator. Then blank
+   * entries (and comments if set) are removed. This is currently being used 
in CSV schema
+   * inference.
*/
-  def filterCommentAndEmpty(lines: Dataset[String], options: CSVOptions): 
Dataset[String] = {
+  def filterUnwantedLines(lines: Dataset[String], options: CSVOptions): 
Dataset[String] = {
 // Note that this was separately made by SPARK-18362. Logically, this 
should be the same
-// with the one below, `filterCommentAndEmpty` but execution path is 
different. One of them
+// with the one below, `filterUnwantedLines` but execution path is 
different. One of them
 // might have to be removed in the near future if possible.
 import lines.sqlContext.implicits._
 val aliased = lines.toDF("value")
 val nonEmptyLines = aliased.filter(length(trim($"value")) > 0)
-if (options.isCommentSet) {
-  
nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String]
-} else {
-  nonEmptyLines.as[String]
+val commentFilteredLines = {
+  if (options.isCommentSet) {
+
nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String]
+  } else {
+nonEmptyLines.as[String]
+  }
 }
+// Note that unlike actual CSV reading path, it simply filters the given 
skipped lines.
+// Therefore, this skips the line same with the skipped lines if exists.
+val linesToSkip = commentFilteredLines.head(options.skipLines)
+commentFilteredLines.rdd
+  .mapPartitions { iter =>
+iter.filterNot(linesToSkip.contains(_))

Review Comment:
   `contains` should be enough. Moreover, the following should work too:
   
   ```
   

[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40474: [SPARK-42849] [WIP] [SQL] Session Variables

2023-03-27 Thread via GitHub


jaceklaskowski commented on code in PR #40474:
URL: https://github.com/apache/spark/pull/40474#discussion_r1149130887


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala:
##
@@ -582,9 +583,29 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   override def visitSetVariable(ctx: SetVariableContext): LogicalPlan = 
withOrigin(ctx) {
-val variableName = visitMultipartIdentifier(ctx.multipartIdentifier)
-val expression = ctx.expression().toString
-SetVariableCommand(VariableIdentifier(variableName.last), expression)
+if (ctx.assignmentList().isEmpty) {
+  SetVariableCommand(Seq(VariableIdentifier("var")), "7")

Review Comment:
   7?! What does this "magic number" represent?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala:
##
@@ -582,9 +583,29 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   override def visitSetVariable(ctx: SetVariableContext): LogicalPlan = 
withOrigin(ctx) {
-val variableName = visitMultipartIdentifier(ctx.multipartIdentifier)
-val expression = ctx.expression().toString
-SetVariableCommand(VariableIdentifier(variableName.last), expression)
+if (ctx.assignmentList().isEmpty) {
+  SetVariableCommand(Seq(VariableIdentifier("var")), "7")
+} else {
+  val assignCtx = ctx.assignmentList()
+  val varList = assignCtx.assignment().asScala.map {
+assign => {

Review Comment:
   nit: no need for `{`. I'd also move `assign` to the line above (as more 
Scala-idiomatic IMHO)



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/variables.scala:
##
@@ -119,13 +116,45 @@ case class DropVariableCommand(
 /**
  * This command is for setting a SQL variable.
  */
-case class SetVariableCommand(identifier: VariableIdentifier,
-  newExpr: String) extends LeafRunnableCommand {
+case class SetVariableCommand(identifierList: Seq[VariableIdentifier],
+  newQueryStr: String) extends LeafRunnableCommand 
{
 
   override def run(sparkSession: SparkSession): Seq[Row] = {
+val parser = sparkSession.sessionState.sqlParser
+val analyzer = sparkSession.sessionState.analyzer
 val catalog = sparkSession.sessionState.catalog
-assert(identifier.database.isEmpty)
-val varInfo = catalog.getTempVariable(identifier.variableName)
+val varInfoList = identifierList.collect { case identifier: 
VariableIdentifier =>

Review Comment:
   nit: `varInfos`



##
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala:
##
@@ -582,9 +583,29 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   override def visitSetVariable(ctx: SetVariableContext): LogicalPlan = 
withOrigin(ctx) {
-val variableName = visitMultipartIdentifier(ctx.multipartIdentifier)
-val expression = ctx.expression().toString
-SetVariableCommand(VariableIdentifier(variableName.last), expression)
+if (ctx.assignmentList().isEmpty) {
+  SetVariableCommand(Seq(VariableIdentifier("var")), "7")
+} else {
+  val assignCtx = ctx.assignmentList()
+  val varList = assignCtx.assignment().asScala.map {
+assign => {
+  val varname = visitMultipartIdentifier(assign.key)
+  VariableIdentifier(varname.head)
+}
+  }
+
+  val assignments = assignCtx.assignment()
+  val exprStrList = assignments.asScala.map {
+assign => {

Review Comment:
   nit: no need for `{`. I'd also move assign to the line above (as more 
Scala-idiomatic IMHO)



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/variables.scala:
##
@@ -119,13 +116,45 @@ case class DropVariableCommand(
 /**
  * This command is for setting a SQL variable.
  */
-case class SetVariableCommand(identifier: VariableIdentifier,
-  newExpr: String) extends LeafRunnableCommand {
+case class SetVariableCommand(identifierList: Seq[VariableIdentifier],
+  newQueryStr: String) extends LeafRunnableCommand 
{
 
   override def run(sparkSession: SparkSession): Seq[Row] = {
+val parser = sparkSession.sessionState.sqlParser
+val analyzer = sparkSession.sessionState.analyzer
 val catalog = sparkSession.sessionState.catalog
-assert(identifier.database.isEmpty)
-val varInfo = catalog.getTempVariable(identifier.variableName)
+val varInfoList = identifierList.collect { case identifier: 
VariableIdentifier =>
+  assert(identifier.database.isEmpty)
+  val varInfo = catalog.getTempVariable(identifier.variableName)
+  if (varInfo.isEmpty) {
+throw new NoSuchTempVariableException(identifier.variableName)
+  }
+  (identifier.variableName, varInfo.get._2)
+}
+
+val varNames = varInfoList.collect { case variable => variable._1 }
+val casts = varInfoList.collect {

Review Comment:
   nit: `map` (not `collect`)




[GitHub] [spark] Hisoka-X opened a new pull request, #40564: [SPARK-42519] [Test] [Connect] Add more WriteTo tests after Scala Client session config is supported

2023-03-27 Thread via GitHub


Hisoka-X opened a new pull request, #40564:
URL: https://github.com/apache/spark/pull/40564

   
   
   ### What changes were proposed in this pull request?
   Add more WriteTo tests for Spark Connect Client
   
   
   
   ### Why are the changes needed?
   Improve Test Case, remove same todo
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   Yes
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.12.4

2023-03-27 Thread via GitHub


jaceklaskowski commented on code in PR #40555:
URL: https://github.com/apache/spark/pull/40555#discussion_r1149120335


##
project/SparkBuild.scala:
##
@@ -307,7 +307,9 @@ object SparkBuild extends PomBuild {
   DefaultMavenRepository,
   Resolver.mavenLocal,
   Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns)
-),
+) ++ Seq(
+  "staging-releases-mirror" at 
"https://repository.apache.org/content/repositories/staging/;,

Review Comment:
   Why is this required? Looks like a temporary change to me (to test out the 
change before parquet is widely available in the public repo).



##
pom.xml:
##
@@ -325,6 +325,17 @@
 
   
   
+

Review Comment:
   Why is this required?



##
pom.xml:
##
@@ -2361,7 +2372,7 @@
 ${hive.group}
 hive-service-rpc
   
-  
+  

Review Comment:
   Rather than "upgrading" the version in the comment (that does not really say 
much) I'd reword it to the following (and get rid of the version):
   
   ```
   parquet-hadoop-bundle:1.8.1 conflict with [the name of the artifact it 
conflicts with]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >