[GitHub] [spark] ulysses-you opened a new pull request, #40574: [SPARK-42942][SQL] Support coalesce table cache stage partitions
ulysses-you opened a new pull request, #40574: URL: https://github.com/apache/spark/pull/40574 ### What changes were proposed in this pull request? Add a new rule `CoalesceCachePartitions` to support coalesce partitions with `TableCacheQueryStageExec`. In order to reuse the code path with `CoalesceShufflePartitions`, this pr also does a small refactor about how we coalesce partitions. RDD cache use the RDD id and partition id as the block id, so it seems not possible to split skewd partitions like shuffle. To reduce complexity, this pr does not allow coalesce partitions with both shuffle and cache stage since shuffle read may contain skewed partition spec. For example, the follow case can not be coalesced by both `CoalesceCachePartitions` and `CoalesceShufflePartitions`. ``` SMJ ShuffleQueryStage TableCacheStage ``` ### Why are the changes needed? Make AQE support coalesce table cache stage partitions. ### Does this PR introduce _any_ user-facing change? yes, add a new config to control if coalesce partitions for table cache stage. ### How was this patch tested? add tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized
HyukjinKwon closed pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized URL: https://github.com/apache/spark/pull/40534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized
HyukjinKwon commented on PR #40534: URL: https://github.com/apache/spark/pull/40534#issuecomment-1486246281 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40536: [SPARK-42895][CONNECT] Improve error messages for stopped Spark sessions
HyukjinKwon commented on code in PR #40536: URL: https://github.com/apache/spark/pull/40536#discussion_r1150045902 ## python/pyspark/sql/connect/client.py: ## @@ -997,10 +1000,32 @@ def config(self, operation: pb2.ConfigRequest.Operation) -> ConfigResult: ) return ConfigResult.fromProto(resp) raise SparkConnectException("Invalid state during retry exception handling.") -except grpc.RpcError as rpc_error: -self._handle_error(rpc_error) +except Exception as error: +self._handle_error(error) + +def _handle_error(self, error: Exception) -> NoReturn: +""" +Handle errors that occur during RPC calls. + +Parameters +-- +error : Exception +An exception thrown during RPC calls. + +Returns +--- +Throws the appropriate internal Python exception. +""" +if isinstance(error, grpc.RpcError): +self._handle_rpc_error(cast(grpc.RpcError, error)) +elif isinstance(error, ValueError): +if "Cannot invoke RPC" in str(error) and "closed" in str(error): +raise SparkConnectException( +error_class="NO_ACTIVE_SESSION", message_parameters=dict() +) Review Comment: Would be great to double check how it looks like in the console. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40536: [SPARK-42895][CONNECT] Improve error messages for stopped Spark sessions
HyukjinKwon commented on code in PR #40536: URL: https://github.com/apache/spark/pull/40536#discussion_r1150045613 ## python/pyspark/sql/connect/client.py: ## @@ -997,10 +1000,32 @@ def config(self, operation: pb2.ConfigRequest.Operation) -> ConfigResult: ) return ConfigResult.fromProto(resp) raise SparkConnectException("Invalid state during retry exception handling.") -except grpc.RpcError as rpc_error: -self._handle_error(rpc_error) +except Exception as error: +self._handle_error(error) + +def _handle_error(self, error: Exception) -> NoReturn: +""" +Handle errors that occur during RPC calls. + +Parameters +-- +error : Exception +An exception thrown during RPC calls. + +Returns +--- +Throws the appropriate internal Python exception. +""" +if isinstance(error, grpc.RpcError): +self._handle_rpc_error(cast(grpc.RpcError, error)) +elif isinstance(error, ValueError): +if "Cannot invoke RPC" in str(error) and "closed" in str(error): +raise SparkConnectException( +error_class="NO_ACTIVE_SESSION", message_parameters=dict() +) Review Comment: I think Python exception chaining would work here so it'd be fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40572: [SPARK-37677][CORE] Unzip could keep file permissions
HyukjinKwon closed pull request #40572: [SPARK-37677][CORE] Unzip could keep file permissions URL: https://github.com/apache/spark/pull/40572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40572: [SPARK-37677][CORE] Unzip could keep file permissions
HyukjinKwon commented on PR #40572: URL: https://github.com/apache/spark/pull/40572#issuecomment-1486238107 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1486232865 @cloud-fan Can you please check my last comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates
shrprasa commented on PR #40128: URL: https://github.com/apache/spark/pull/40128#issuecomment-1486231326 Gentle Ping @dongjoon-hyun @holdenk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40536: [SPARK-42895][CONNECT] Improve error messages for stopped Spark sessions
amaliujia commented on code in PR #40536: URL: https://github.com/apache/spark/pull/40536#discussion_r1150013541 ## python/pyspark/sql/connect/client.py: ## @@ -997,10 +1000,32 @@ def config(self, operation: pb2.ConfigRequest.Operation) -> ConfigResult: ) return ConfigResult.fromProto(resp) raise SparkConnectException("Invalid state during retry exception handling.") -except grpc.RpcError as rpc_error: -self._handle_error(rpc_error) +except Exception as error: +self._handle_error(error) + +def _handle_error(self, error: Exception) -> NoReturn: +""" +Handle errors that occur during RPC calls. + +Parameters +-- +error : Exception +An exception thrown during RPC calls. + +Returns +--- +Throws the appropriate internal Python exception. +""" +if isinstance(error, grpc.RpcError): +self._handle_rpc_error(cast(grpc.RpcError, error)) +elif isinstance(error, ValueError): +if "Cannot invoke RPC" in str(error) and "closed" in str(error): +raise SparkConnectException( +error_class="NO_ACTIVE_SESSION", message_parameters=dict() +) Review Comment: I am thinking this will swallow other ValueError? You probably need to re-throw non-closed channel ValueError? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #40573: [SPARK-42943][SQL] Use LONGTEXT instead of TEXT for StringType for effective length
yaooqinn commented on PR #40573: URL: https://github.com/apache/spark/pull/40573#issuecomment-1486195050 cc @cloud-fan @dongjoon-hyun @HyukjinKwon, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #40573: [SPARK-42943][SQL] Use LONGTEXT instead of TEXT for StringType for effective length
yaooqinn commented on code in PR #40573: URL: https://github.com/apache/spark/pull/40573#discussion_r1150009461 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala: ## @@ -176,6 +176,7 @@ private case object MySQLDialect extends JdbcDialect with SQLConfHelper { // See SPARK-35446: MySQL treats REAL as a synonym to DOUBLE by default // We override getJDBCType so that FloatType is mapped to FLOAT instead case FloatType => Option(JdbcType("FLOAT", java.sql.Types.FLOAT)) +case StringType => Option(JdbcType("LONGTEXT", java.sql.Types.LONGVARCHAR)) Review Comment: Before this, it's ```case StringType => Option(JdbcType("TEXT", java.sql.Types.CLOB))``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn opened a new pull request, #40573: [SPARK-42943][SQL] Use LONGTEXT instead of TEXT for StringType for effective length
yaooqinn opened a new pull request, #40573: URL: https://github.com/apache/spark/pull/40573 ### What changes were proposed in this pull request? Referring to https://dev.mysql.com/doc/refman/8.0/en/string-type-syntax.html, A [TEXT](https://dev.mysql.com/doc/refman/8.0/en/blob.html) column with a maximum length of 65,535 (2^16 − 1) characters. We currently convert our string to MySQL's `text` and jdbc's `CLOB`. The `text` here is insufficient. And `CLOB` is incorrect, which LONGVARCHAR should be replaced instead. ### Why are the changes needed? better compatibility with MySQL and bugfix ### Does this PR introduce _any_ user-facing change? Yes, you won't see MysqlDataTruncation if you store a string exceeding 65536 into a column which defined by spark' string with MySQL catalog. ### How was this patch tested? new tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns
cloud-fan commented on code in PR #40300: URL: https://github.com/apache/spark/pull/40300#discussion_r1149997056 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala: ## @@ -297,8 +297,14 @@ abstract class InMemoryBaseTable( InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, tableSchema) override def pruneColumns(requiredSchema: StructType): Unit = { - val schemaNames = metadataColumnNames ++ tableSchema.map(_.name) - schema = StructType(requiredSchema.filter(f => schemaNames.contains(f.name))) + // The required schema could contain conflict-renamed metadata columns, so we need to match + // them by their logical (original) names, not their current names. + val schemaNames = tableSchema.map(_.name).toSet + val prunedFields = requiredSchema.filter { Review Comment: > ... simply won't gain the ability to rename metadata columns on conflict. is it true? The rename happens at the Spark side, when Spark append metadata columns to the normal output columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit
cloud-fan commented on PR #40462: URL: https://github.com/apache/spark/pull/40462#issuecomment-1486163555 OK now I got the use case. At the time when we add the rebalance hint, we don't know what the final query is. Shall we make this optimization a bit more conservative to match both global and local limit? I still think it's risky to remove rebalance below local limit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets commented on a diff in pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true
bersprockets commented on code in PR #40569: URL: https://github.com/apache/spark/pull/40569#discussion_r1149995693 ## sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala: ## @@ -2695,4 +2695,26 @@ class SubquerySuite extends QueryTest } } } + + test("SPARK-42937: Outer join with subquery in condition") { +withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false", Review Comment: @allisonwang-db When adaptive execution is enabled, `PlanAdaptiveSubqueries` always sets `shouldBroadcast=true`, so the subquery's result is available on the executor, if needed. https://github.com/apache/spark/blob/31965a06c9f85abf2296971237b1f88065eb67c2/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveSubqueries.scala#L46 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40568: [SPARK-42922][SQL] Move from Random to SecureRandom
srowen commented on PR #40568: URL: https://github.com/apache/spark/pull/40568#issuecomment-1486162373 Merged to master/3.4/3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #40568: [SPARK-42922][SQL] Move from Random to SecureRandom
srowen closed pull request #40568: [SPARK-42922][SQL] Move from Random to SecureRandom URL: https://github.com/apache/spark/pull/40568 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40570: [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator
HyukjinKwon commented on PR #40570: URL: https://github.com/apache/spark/pull/40570#issuecomment-1486155106 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40570: [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator
HyukjinKwon closed pull request #40570: [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator URL: https://github.com/apache/spark/pull/40570 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring
HyukjinKwon closed pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring URL: https://github.com/apache/spark/pull/40571 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring
HyukjinKwon commented on PR #40571: URL: https://github.com/apache/spark/pull/40571#issuecomment-1486153075 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] smallzhongfeng commented on pull request #40572: [SPARK-37677][CORE] Unzip could keep file permissions
smallzhongfeng commented on PR #40572: URL: https://github.com/apache/spark/pull/40572#issuecomment-1486149209 Discussed in https://github.com/apache/spark/pull/35278#issuecomment-1033927506, PTAL. @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] smallzhongfeng opened a new pull request, #40572: [SPARK-37677][CORE] Unzip could keep file permissions
smallzhongfeng opened a new pull request, #40572: URL: https://github.com/apache/spark/pull/40572 ### What changes were proposed in this pull request? Just remove comment. ### Why are the changes needed? After https://github.com/apache/hadoop/pull/4036, unzip could unzip could keep file permissions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No need, has been added in hadoop-client side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit
wangyum commented on PR #40462: URL: https://github.com/apache/spark/pull/40462#issuecomment-1486144770 If such queries cannot be optimized, the performance of such queries will be very poor. We use a partition to fetch data from MySQL, and increase its parallelism for downstream computing after fetching the data: ```sql CREATE VIEW full_query_log AS SELECT h.* FROM query_log_hdfs h UNION ALL SELECT /*+ REBALANCE */ q.*, DATE(start) FROM query_log_mysql q; SELECT * FROM full_query_log limit 5; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate
cloud-fan closed pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate URL: https://github.com/apache/spark/pull/40558 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate
cloud-fan commented on PR #40558: URL: https://github.com/apache/spark/pull/40558#issuecomment-1486139796 thanks, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side
yaooqinn commented on code in PR #40543: URL: https://github.com/apache/spark/pull/40543#discussion_r1149965908 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala: ## @@ -128,7 +128,7 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { "('abcd', 'efgh', 'ijkl', 'mnop', 'q')").executeUpdate() conn.prepareStatement("CREATE TABLE char_array_types (" + - "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 varchar(4)[], c4 bpchar[])" + "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 varchar(4)[], c4 bpchar(1)[])" Review Comment: updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side
yaooqinn commented on code in PR #40543: URL: https://github.com/apache/spark/pull/40543#discussion_r1149965776 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala: ## @@ -86,6 +86,8 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCTes s"jdbc:oracle:thin:system/$oracle_password@//$ip:$port/xe" } + override val typeMapping: Map[DataType, DataType] = Map(StringType -> VarcharType(255)) Review Comment: According to the comment with it, 255 is from https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/index.html, but I think we misunderstand the doc and Int.MaxValue is much more reasonable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true
allisonwang-db commented on code in PR #40569: URL: https://github.com/apache/spark/pull/40569#discussion_r1149956290 ## sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala: ## @@ -2695,4 +2695,26 @@ class SubquerySuite extends QueryTest } } } + + test("SPARK-42937: Outer join with subquery in condition") { +withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false", Review Comment: Why does this fail only when AQE is disabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`
LuciferYang commented on PR #40560: URL: https://github.com/apache/spark/pull/40560#issuecomment-1486109613 Thanks @gengliangwang ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on a diff in pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized
xinrong-meng commented on code in PR #40534: URL: https://github.com/apache/spark/pull/40534#discussion_r1149944567 ## python/pyspark/sql/utils.py: ## @@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any: return cast(FuncT, wrapped) +def require_spark_context_initialized() -> SparkContext: Review Comment: Good point, renamed. ## python/pyspark/sql/utils.py: ## @@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any: return cast(FuncT, wrapped) +def require_spark_context_initialized() -> SparkContext: +"""Raise RuntimeError if SparkContext is not initialized, +otherwise, returns the active SparkContext.""" +sc = SparkContext._active_spark_context +if sc is None or sc._jvm is None: +raise RuntimeError("SparkContext must be initialized for JVM access.") Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wankunde commented on pull request #40523: [SPARK-42897][SQL] Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition
wankunde commented on PR #40523: URL: https://github.com/apache/spark/pull/40523#issuecomment-1486090996 Hi, @c21 @cloud-fan this seems to be SMJ full outer join codegen bug, could you have a look at this issue ? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true
HyukjinKwon commented on PR #40569: URL: https://github.com/apache/spark/pull/40569#issuecomment-1486087962 cc @allisonwang-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution
HyukjinKwon commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1149940286 ## python/pyspark/sql/pandas/map_ops.py: ## @@ -60,6 +60,7 @@ def mapInPandas( schema : :class:`pyspark.sql.types.DataType` or str the return type of the `func` in PySpark. The value can be either a :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string. +isBarrier : Use barrier mode execution if True. Review Comment: https://github.com/apache/spark/pull/40571 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring
HyukjinKwon opened a new pull request, #40571: URL: https://github.com/apache/spark/pull/40571 ### What changes were proposed in this pull request? This PR is a followup of proposes to fix: - Add `versionchanged` in its docstring. - Rename `isBarrier` to `barrier` to make it look Python friendly - Fix some wording and examples. ### Why are the changes needed? For better documentation, and make it more Python friendly. ### Does this PR introduce _any_ user-facing change? Yes, it renames the parameter, and fixes the documentation. ### How was this patch tested? Linters in this PR should test them out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40571: [SPARK-42896][SQL][PYTHON][FOLLOW-UP] Rename isBarrier to barrier, and correct docstring
HyukjinKwon commented on PR #40571: URL: https://github.com/apache/spark/pull/40571#issuecomment-1486086978 cc @WeichenXu123 and @ueshin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution
HyukjinKwon commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1149936423 ## python/pyspark/sql/pandas/map_ops.py: ## @@ -60,6 +60,7 @@ def mapInPandas( schema : :class:`pyspark.sql.types.DataType` or str the return type of the `func` in PySpark. The value can be either a :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string. +isBarrier : Use barrier mode execution if True. Review Comment: let me just make a quick followup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution
HyukjinKwon commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1149935975 ## python/pyspark/sql/pandas/map_ops.py: ## @@ -60,6 +60,7 @@ def mapInPandas( schema : :class:`pyspark.sql.types.DataType` or str the return type of the `func` in PySpark. The value can be either a :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string. +isBarrier : Use barrier mode execution if True. Review Comment: Yes we should. with the directive `.. versionchanged`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized
HyukjinKwon commented on code in PR #40534: URL: https://github.com/apache/spark/pull/40534#discussion_r1149935560 ## python/pyspark/sql/utils.py: ## @@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any: return cast(FuncT, wrapped) +def require_spark_context_initialized() -> SparkContext: +"""Raise RuntimeError if SparkContext is not initialized, +otherwise, returns the active SparkContext.""" +sc = SparkContext._active_spark_context +if sc is None or sc._jvm is None: +raise RuntimeError("SparkContext must be initialized for JVM access.") Review Comment: I would just say something like: `SparkContext or SparkSession should be created first` instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40534: [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized
HyukjinKwon commented on code in PR #40534: URL: https://github.com/apache/spark/pull/40534#discussion_r1149934958 ## python/pyspark/sql/utils.py: ## @@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any: return cast(FuncT, wrapped) +def require_spark_context_initialized() -> SparkContext: Review Comment: I would name it like: `get_jvm_spark_context` ## python/pyspark/sql/utils.py: ## @@ -193,6 +193,15 @@ def wrapped(*args: Any, **kwargs: Any) -> Any: return cast(FuncT, wrapped) +def require_spark_context_initialized() -> SparkContext: Review Comment: I would name it like: `get_active_spark_context` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas
dongjoon-hyun commented on PR #40521: URL: https://github.com/apache/spark/pull/40521#issuecomment-1486035003 Got it. Thanks. +1, LGTM too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40520: [SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution
ueshin commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1149903109 ## python/pyspark/sql/pandas/map_ops.py: ## @@ -60,6 +60,7 @@ def mapInPandas( schema : :class:`pyspark.sql.types.DataType` or str the return type of the `func` in PySpark. The value can be either a :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string. +isBarrier : Use barrier mode execution if True. Review Comment: We should mark these new arguments by `.. versionadded`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
HeartSaVioR commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1486008365 Sigh I didn't indicate we already took a step of Scala API with Spark connect. I thought there's only in PySpark. Thanks for correcting me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request, #40570: [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator
ueshin opened a new pull request, #40570: URL: https://github.com/apache/spark/pull/40570 ### What changes were proposed in this pull request? Implements `DataFrame.toLocalIterator`. The argument `prefetchPartitions` won't take effect for Spark Connect. ### Why are the changes needed? Missing API. ### Does this PR introduce _any_ user-facing change? `DataFrame.toLocalIterator` will be available. ### How was this patch tested? Enabled the related tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
amaliujia commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1486006675 hmm I am not sure what you already did but I am thinking if you don't add anything into https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala, then you need update `CheckConnectJvmClientCompatibility`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas
HyukjinKwon closed pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas URL: https://github.com/apache/spark/pull/40521 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas
HyukjinKwon commented on PR #40521: URL: https://github.com/apache/spark/pull/40521#issuecomment-1486003187 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
HeartSaVioR commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1485999022 I was wondering what is different from dropDuplicates and this one. I don't see dropDuplicates being handled separately. Is it because the PySpark implementation of dropDuplicates is available? If this method has to be excluded, could you please guide how to do that? Thanks in advance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
amaliujia commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1485996207 I think you need update at connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
HeartSaVioR commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1485964292 @HyukjinKwon @amaliujia Would you mind if I ask what happens with the mima check for this PR? https://github.com/HeartSaVioR/spark/actions/runs/4536405777/jobs/7993077860 Is it required to add PySpark API in this PR to pass Spark connect check? At least MiMa check failed in Scala codebase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40533: [SPARK-42906][K8S] Replace a starting digit with `x` in resource name prefix
dongjoon-hyun commented on PR #40533: URL: https://github.com/apache/spark/pull/40533#issuecomment-1485948727 Merged to master/3.4/3.3/3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40533: [SPARK-42906][K8S] Replace a starting digit with `x` in resource name prefix
dongjoon-hyun closed pull request #40533: [SPARK-42906][K8S] Replace a starting digit with `x` in resource name prefix URL: https://github.com/apache/spark/pull/40533 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets opened a new pull request, #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true
bersprockets opened a new pull request, #40569: URL: https://github.com/apache/spark/pull/40569 ### What changes were proposed in this pull request? Change `PlanSubqueries` to set `shouldBroadcast` to true when instantiating an `InSubqueryExec` instance. ### Why are the changes needed? The below left outer join gets an error: ``` create or replace temp view v1 as select * from values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10); create or replace temp view v2 as select * from values (1, 2), (3, 8), (7, 9) as v2(a, b); create or replace temp view v3 as select * from values (3), (8) as v3(col1); set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 set spark.sql.adaptive.enabled=false; select * from v1 left outer join v2 on key = a and key in (select col1 from v3); ``` The join fails during predicate codegen: ``` 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) at org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) ``` It fails again after fallback to interpreter mode: ``` 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) ``` Both the predicate codegen and the evaluation fail for the same reason: `PlanSubqueries` creates `InSubqueryExec` with `shouldBroadcast=false`. The driver waits for the subquery to finish, but it's the executor that uses the results of the subquery (for predicate codegen or evaluation). Because `shouldBroadcast` is set to false, the result is stored in a transient field (`InSubqueryExec#result`), so the result of the subquery is not serialized when the `InSubqueryExec` instance is sent to the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific
[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
HeartSaVioR commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1485834545 Just added a dummy implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40568: [SPARK-42922][SQL] Move from Random to SecureRandom
dongjoon-hyun commented on PR #40568: URL: https://github.com/apache/spark/pull/40568#issuecomment-1485827152 According to the `Affected Version` in JIRA, I also agree with backporting to the applicable release branches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
HeartSaVioR commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1485819017 The error only occurred from linter - it now does not allow a new PR to introduce a new public API "without adding to spark-connect". This PR intentionally postpones addressing PySpark in separate JIRA ticket, hence addressing spark-connect should go to there as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40568: [SPARK-42922][SQL]: Move from Random to SecureRandom
srowen commented on PR #40568: URL: https://github.com/apache/spark/pull/40568#issuecomment-1485705422 I think it's fine. These do look like better usages of RNGs. Let's see what tests say. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #40568: SPARK-42922: Move from Random to SecureRandom
mridulm commented on PR #40568: URL: https://github.com/apache/spark/pull/40568#issuecomment-1485666995 +CC @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm opened a new pull request, #40568: SPARK-42922: Move from Random to SecureRandom
mridulm opened a new pull request, #40568: URL: https://github.com/apache/spark/pull/40568 ### What changes were proposed in this pull request? Most uses of `Random` in spark are either in testcases or where we need a pseudo random number which is repeatable. Use `SecureRandom`, instead of `Random` for subset of cases where it helps: ### Why are the changes needed? Use of `SecureRandom` in more security sensitive contexts. This was flagged in our internal scans as well. ### Does this PR introduce _any_ user-facing change? Directly no. Would improve security posture of Apache Spark. ### How was this patch tested? Existing unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`
gengliangwang commented on PR #40560: URL: https://github.com/apache/spark/pull/40560#issuecomment-1485658926 +1, @LuciferYang Thanks for the work! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anchovYu commented on pull request #40558: [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate
anchovYu commented on PR #40558: URL: https://github.com/apache/spark/pull/40558#issuecomment-1485580609 @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] johanl-db commented on a diff in pull request #40545: [SPARK-42918] Generalize handling of metadata attributes in FileSourceStrategy
johanl-db commented on code in PR #40545: URL: https://github.com/apache/spark/pull/40545#discussion_r1149566989 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -519,6 +519,13 @@ object FileSourceMetadataAttribute { def cleanupFileSourceMetadataInformation(attr: Attribute): Attribute = attr.withMetadata(removeInternalMetadata(attr.metadata)) + /** + * Cleanup the internal metadata information of a struct field, if it is Review Comment: Done ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -221,14 +225,23 @@ object FileFormat { FileSourceConstantMetadataStructField(FILE_MODIFICATION_TIME, TimestampType, nullable = false)) /** - * Create a file metadata struct column containing fields supported by the given file format. + * All fields the file format's _metadata struct defines. Review Comment: Done ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -322,18 +329,13 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { // extra Project node: wrap flat metadata columns to a metadata struct val withMetadataProjections = metadataStructOpt.map { metadataStruct => -val structColumns = metadataColumns.map { col => col.name match { -case FileFormat.FILE_PATH | FileFormat.FILE_NAME | FileFormat.FILE_SIZE | - FileFormat.FILE_BLOCK_START | FileFormat.FILE_BLOCK_LENGTH | - FileFormat.FILE_MODIFICATION_TIME => - col -case FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME => - generatedMetadataColumns -.find(_.name == FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME) -// Change the `_tmp_metadata_row_index` to `row_index`, -// and also change the nullability to not nullable, -// which is consistent with the nullability of `row_index` field -.get.withName(FileFormat.ROW_INDEX).withNullability(false) +val structColumns = metadataStruct.dataType.asInstanceOf[StructType].fields.map { field => + // Construct the metadata struct the query expects to see, using the columns we previously + // created. Be sure to restore the proper name and nullability for each metadata field. + metadataColumnsByName(field.name) match { Review Comment: I simplified this code to always restore the name and nullability, I don't think it warrants a method anymore. I actually think it was not correct before that since you could have a generated column created with `nullable = true` on [L272](https://github.com/apache/spark/pull/40545/commits/2f5f9a26ec5cdf0c04893a18e2cfa386228e7aa2#diff-7e4f78e90e8699733afbe43e2b265b95e514896ac68f1fb9e60705d59a0b7ed9R272) and nullability wouldn't be reset if its `name` and `internalName` are the same ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -236,37 +247,41 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { // For generated metadata columns, they are set as nullable when passed to readers, // as the values will be null when trying to read the missing column from the file. // They are then replaced by the actual values later in the process. - // All metadata columns will be non-null in the returned output. - // We then change the nullability to non-nullable in the metadata projection node below. - val constantMetadataColumns: mutable.Buffer[Attribute] = mutable.Buffer.empty - val generatedMetadataColumns: mutable.Buffer[Attribute] = mutable.Buffer.empty + // We then restore the specified nullability in the metadata projection node below. + // Also remember the attribute for each column name, so we can easily map back to it. + val constantMetadataColumns = mutable.Buffer.empty[Attribute] + val generatedMetadataColumns = mutable.Buffer.empty[Attribute] + val metadataColumnsByName = mutable.Map.empty[String, Attribute] metadataStructOpt.foreach { metadataStruct => -metadataStruct.dataType.asInstanceOf[StructType].fields.foreach { field => - field.name match { -case FileFormat.ROW_INDEX => - if ((readDataColumns ++ partitionColumns).map(_.name.toLowerCase(Locale.ROOT)) - .contains(FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME)) { -throw new AnalysisException(FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME + - " is a reserved column name that cannot be read in combination with " + - s"${FileFormat.METADATA_NAME}.${FileFormat.ROW_INDEX} column.") - } - generatedMetadataColumns += -
[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit
dongjoon-hyun commented on PR #40462: URL: https://github.com/apache/spark/pull/40462#issuecomment-1485539305 `HINT` is a part of `SELECT` clause. https://github.com/apache/spark/blob/a3d9e0ae0f95a55766078da5d0bf0f74f3c3cfc3/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L530-L532 That's the reason why `LIMIT` is the root of the query. ``` scala> sql("SELECT /*+ REBALANCE */ * FROM t WHERE id > 1 LIMIT 5").explain(true) == Parsed Logical Plan == 'GlobalLimit 5 +- 'LocalLimit 5 +- 'UnresolvedHint REBALANCE +- 'Project [*] +- 'Filter ('id > 1) +- 'UnresolvedRelation [t], [], false ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40566: [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin`
LuciferYang commented on PR #40566: URL: https://github.com/apache/spark/pull/40566#issuecomment-1485485240 Thanks very much @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`
LuciferYang commented on PR #40560: URL: https://github.com/apache/spark/pull/40560#issuecomment-1485482919 Thanks @dongjoon-hyun @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40566: [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin`
dongjoon-hyun closed pull request #40566: [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin` URL: https://github.com/apache/spark/pull/40566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40566: [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin`
dongjoon-hyun commented on PR #40566: URL: https://github.com/apache/spark/pull/40566#issuecomment-1485469907 I verified this manually via Maven. Merged to master/3.4/3.3/3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`
dongjoon-hyun closed pull request #40560: [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]` URL: https://github.com/apache/spark/pull/40560 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40566: [SPARK-42934][BUILD] Move test property `spark.hadoop.hadoop.security.key.provider.path` from `maven-surefire-plugin` to `scala
LuciferYang commented on code in PR #40566: URL: https://github.com/apache/spark/pull/40566#discussion_r1149480752 ## pom.xml: ## @@ -2970,7 +2970,6 @@ false true true - test:/// Review Comment: OK ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40566: [SPARK-42934][BUILD] Move test property `spark.hadoop.hadoop.security.key.provider.path` from `maven-surefire-plugin` to `sca
dongjoon-hyun commented on code in PR #40566: URL: https://github.com/apache/spark/pull/40566#discussion_r1149478852 ## pom.xml: ## @@ -2970,7 +2970,6 @@ false true true - test:/// Review Comment: To be safe, shall we keep this, @LuciferYang ? We have `spark.ui.enabled` in both places. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40566: [SPARK-42934][BUILD] Move test property `spark.hadoop.hadoop.security.key.provider.path` from `maven-surefire-plugin` to `scalatest-mav
dongjoon-hyun commented on PR #40566: URL: https://github.com/apache/spark/pull/40566#issuecomment-1485412215 Thank you for pinging me, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side
dongjoon-hyun commented on code in PR #40543: URL: https://github.com/apache/spark/pull/40543#discussion_r1149467878 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala: ## @@ -86,6 +86,8 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCTes s"jdbc:oracle:thin:system/$oracle_password@//$ip:$port/xe" } + override val typeMapping: Map[DataType, DataType] = Map(StringType -> VarcharType(255)) Review Comment: Is `255` DB-specific value? Where this `255` came from? In MySQLIntegrationSuite, it seems to be `Int.MaxValue`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40543: [SPARK-42916][SQL] JDBCTableCatalog Keeps Char/Varchar meta on the read-side
dongjoon-hyun commented on code in PR #40543: URL: https://github.com/apache/spark/pull/40543#discussion_r1149465747 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala: ## @@ -128,7 +128,7 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { "('abcd', 'efgh', 'ijkl', 'mnop', 'q')").executeUpdate() conn.prepareStatement("CREATE TABLE char_array_types (" + - "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 varchar(4)[], c4 bpchar[])" + "c0 char(4)[], c1 character(4)[], c2 character varying(4)[], c3 varchar(4)[], c4 bpchar(1)[])" Review Comment: Since it's non-trivial findings, could you add some comments please, @yaooqinn ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5
dongjoon-hyun commented on PR #39124: URL: https://github.com/apache/spark/pull/39124#issuecomment-1485388919 Ya, right. I forgot to say that. Thank you so much, @steveloughran and @sunchao too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5
LuciferYang commented on PR #39124: URL: https://github.com/apache/spark/pull/39124#issuecomment-1485382860 Thanks @dongjoon-hyun @sunchao @steveloughran -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns
ryan-johnson-databricks commented on code in PR #40300: URL: https://github.com/apache/spark/pull/40300#discussion_r1149407329 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala: ## @@ -297,8 +297,14 @@ abstract class InMemoryBaseTable( InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, tableSchema) override def pruneColumns(requiredSchema: StructType): Unit = { - val schemaNames = metadataColumnNames ++ tableSchema.map(_.name) - schema = StructType(requiredSchema.filter(f => schemaNames.contains(f.name))) + // The required schema could contain conflict-renamed metadata columns, so we need to match + // them by their logical (original) names, not their current names. + val schemaNames = tableSchema.map(_.name).toSet + val prunedFields = requiredSchema.filter { Review Comment: As far as I can tell, v2 intentionally puts the burden of pruning (and of metadata handling) on the datasource. So by design, every datasource _does_ have to implement the policy it wants to use. But that's not necessarily a bad thing -- a datasource that chooses not to implement this change simply won't gain the ability to rename metadata columns on conflict. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns
ryan-johnson-databricks commented on code in PR #40300: URL: https://github.com/apache/spark/pull/40300#discussion_r1149407329 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala: ## @@ -297,8 +297,14 @@ abstract class InMemoryBaseTable( InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, tableSchema) override def pruneColumns(requiredSchema: StructType): Unit = { - val schemaNames = metadataColumnNames ++ tableSchema.map(_.name) - schema = StructType(requiredSchema.filter(f => schemaNames.contains(f.name))) + // The required schema could contain conflict-renamed metadata columns, so we need to match + // them by their logical (original) names, not their current names. + val schemaNames = tableSchema.map(_.name).toSet + val prunedFields = requiredSchema.filter { Review Comment: As far as I can tell, v2 intentionally puts the burden of pruning (and of metadata handling) on the datasource. So every datasource _does_ have to implement the policy it wants to use. But that's not necessarily a bad thing -- a datasource that chooses not to implement this change simply won't gain the ability to rename metadata columns on conflict. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5
dongjoon-hyun closed pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5 URL: https://github.com/apache/spark/pull/39124 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] clownxc commented on a diff in pull request #40400: [SPARK-41359][SQL] Use `PhysicalDataType` instead of DataType in UnsafeRow
clownxc commented on code in PR #40400: URL: https://github.com/apache/spark/pull/40400#discussion_r1149359986 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java: ## @@ -70,51 +66,25 @@ public static int calculateBitSetWidthInBytes(int numFields) { return ((numFields + 63)/ 64) * 8; } - /** - * Field types that can be updated in place in UnsafeRows (e.g. we support set() for these types) - */ - public static final Set mutableFieldTypes; - - // DecimalType, DayTimeIntervalType and YearMonthIntervalType are also mutable - static { -mutableFieldTypes = Collections.unmodifiableSet( - new HashSet<>( -Arrays.asList( - NullType, - BooleanType, - ByteType, - ShortType, - IntegerType, - LongType, - FloatType, - DoubleType, - DateType, - TimestampType, - TimestampNTZType -))); - } - public static boolean isFixedLength(DataType dt) { Review Comment: Ok, Let me do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40563: [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor `array_append` and `array_prepend` with `RuntimeReplaceable`
LuciferYang commented on code in PR #40563: URL: https://github.com/apache/spark/pull/40563#discussion_r1149348790 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala: ## @@ -1400,120 +1400,27 @@ case class ArrayContains(left: Expression, right: Expression) copy(left = newLeft, right = newRight) } -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = """ - _FUNC_(array, element) - Add the element at the beginning of the array passed as first - argument. Type of element should be the same as the type of the elements of the array. - Null element is also prepended to the array. But if the array passed is NULL - output is NULL -""", - examples = """ -Examples: - > SELECT _FUNC_(array('b', 'd', 'c', 'a'), 'd'); - ["d","b","d","c","a"] - > SELECT _FUNC_(array(1, 2, 3, null), null); - [null,1,2,3,null] - > SELECT _FUNC_(CAST(null as Array), 2); - NULL - """, - group = "array_funcs", - since = "3.5.0") -case class ArrayPrepend(left: Expression, right: Expression) - extends BinaryExpression -with ImplicitCastInputTypes -with ComplexTypeMergingExpression -with QueryErrorsBase { +trait ArrayInsertEnd extends RuntimeReplaceable Review Comment: should we rename this trait? `ArrayInsertEnd` cannot describe `ArrayPrepend` well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns
cloud-fan commented on code in PR #40300: URL: https://github.com/apache/spark/pull/40300#discussion_r1149347950 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala: ## @@ -297,8 +297,14 @@ abstract class InMemoryBaseTable( InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, tableSchema) override def pruneColumns(requiredSchema: StructType): Unit = { - val schemaNames = metadataColumnNames ++ tableSchema.map(_.name) - schema = StructType(requiredSchema.filter(f => schemaNames.contains(f.name))) + // The required schema could contain conflict-renamed metadata columns, so we need to match + // them by their logical (original) names, not their current names. + val schemaNames = tableSchema.map(_.name).toSet + val prunedFields = requiredSchema.filter { Review Comment: Shall we move this handling to `V2ScanRelationPushDown`? in memory table is just a testing v2 source and we can't expect all existing v2 sources to make the same change like here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns
ryan-johnson-databricks commented on code in PR #40300: URL: https://github.com/apache/spark/pull/40300#discussion_r1149336751 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala: ## @@ -297,8 +297,14 @@ abstract class InMemoryBaseTable( InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, tableSchema) override def pruneColumns(requiredSchema: StructType): Unit = { - val schemaNames = metadataColumnNames ++ tableSchema.map(_.name) - schema = StructType(requiredSchema.filter(f => schemaNames.contains(f.name))) + // The required schema could contain conflict-renamed metadata columns, so we need to match + // them by their logical (original) names, not their current names. + val schemaNames = tableSchema.map(_.name).toSet + val prunedFields = requiredSchema.filter { Review Comment: Indeed the latter -- that's why this code needs search for metadata columns by their logical names, and also why `createReaderFactory` needs to map them back to their logical names (because the internal code only knows about the logical names). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns
cloud-fan commented on PR #40300: URL: https://github.com/apache/spark/pull/40300#issuecomment-1485216976 about https://github.com/apache/spark/pull/40300/files#r1129818813 , I think if `SubqueryAlias` can't propagate metadata columns, then `df.metadataColumn` should not be able to get the column, what do you think? @ryan-johnson-databricks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns
cloud-fan commented on code in PR #40300: URL: https://github.com/apache/spark/pull/40300#discussion_r1149329725 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala: ## @@ -297,8 +297,14 @@ abstract class InMemoryBaseTable( InMemoryBatchScan(data.map(_.asInstanceOf[InputPartition]), schema, tableSchema) override def pruneColumns(requiredSchema: StructType): Unit = { - val schemaNames = metadataColumnNames ++ tableSchema.map(_.name) - schema = StructType(requiredSchema.filter(f => schemaNames.contains(f.name))) + // The required schema could contain conflict-renamed metadata columns, so we need to match + // them by their logical (original) names, not their current names. + val schemaNames = tableSchema.map(_.name).toSet + val prunedFields = requiredSchema.filter { Review Comment: does `requiredSchema` contain the logical name or physical name? I think it's the latter as `V2ScanRelationPushDown` does not handle metadata column name mapping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] clownxc commented on a diff in pull request #40400: [SPARK-41359][SQL] Use `PhysicalDataType` instead of DataType in UnsafeRow
clownxc commented on code in PR #40400: URL: https://github.com/apache/spark/pull/40400#discussion_r1149320479 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala: ## @@ -19,48 +19,83 @@ package org.apache.spark.sql.catalyst.types import org.apache.spark.sql.types._ -sealed abstract class PhysicalDataType - -case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) extends PhysicalDataType - -class PhysicalBinaryType() extends PhysicalDataType +sealed abstract class PhysicalDataType{ + private[sql] def isPrimitive: Boolean +} + +case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) + extends PhysicalDataType { + override def isPrimitive: Boolean = false +} + +class PhysicalBinaryType() extends PhysicalDataType { + override def isPrimitive: Boolean = false +} case object PhysicalBinaryType extends PhysicalBinaryType -class PhysicalBooleanType() extends PhysicalDataType +class PhysicalBooleanType() extends PhysicalDataType { + override def isPrimitive: Boolean = true Review Comment: Thank you for the code review comments, I will try to modify the code as you say. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhmin opened a new pull request, #40567: [SPARK-42935] [SQL] Add union required distribution push down
zhmin opened a new pull request, #40567: URL: https://github.com/apache/spark/pull/40567 ### What changes were proposed in this pull request? We indroduce a new idea to optimize exchange plan when union spark plan output partitoning can't match parent plan's required distribution. 1. First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. 2. Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. ### Why are the changes needed? Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? org.apache.spark.sql.execution.PlannerSuite org.apache.spark.sql.execution.exchange.UnionZipRDDSuite -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40566: [SPARK-42934][SQL][TESTS] Move `spark.hadoop.hadoop.security.key.provider.path` from `systemPropertyVariables` of `maven-surefire-plugin`
LuciferYang commented on PR #40566: URL: https://github.com/apache/spark/pull/40566#issuecomment-1485155792 cc @dongjoon-hyun FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #40566: [SPARK-42934][SQL][TESTS] Move `spark.hadoop.hadoop.security.key.provider.path` from `systemPropertyVariables` of `maven-surefire-plugin
LuciferYang opened a new pull request, #40566: URL: https://github.com/apache/spark/pull/40566 ### What changes were proposed in this pull request? When testing `OrcEncryptionSuite` using maven, all test suites are always skipped. So this pr move `spark.hadoop.hadoop.security.key.provider.path` from `systemPropertyVariables` of `maven-surefire-plugin` to `systemProperties` of `scalatest-maven-plugin` to make `OrcEncryptionSuite` can test by maven. ### Why are the changes needed? Make `OrcEncryptionSuite` can test by maven. ### Does this PR introduce _any_ user-facing change? No, just for maven test ### How was this patch tested? - Pass GitHub Actions - Manual testing: run ``` build/mvn clean install -pl sql/core -DskipTests -am build/mvn test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite ``` **Before** ``` Discovery starting. Discovery completed in 3 seconds, 218 milliseconds. Run starting. Expected test count is: 4 OrcEncryptionSuite: 21:57:58.344 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - Write and read an encrypted file !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider@5af5d76f doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:37) - Write and read an encrypted table !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider@5ad6cc21 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:65) - SPARK-35325: Write and read encrypted nested columns !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider@691124ee doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:116) - SPARK-35992: Write and read fully-encrypted columns with default masking !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider@5403799b doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:166) 21:58:00.035 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) = Run completed in 5 seconds, 41 milliseconds. Total number of tests run: 0 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 0, canceled 4, ignored 0, pending 0 No tests were executed. ``` **After** ``` Discovery starting. Discovery completed in 3 seconds, 185 milliseconds. Run starting. Expected test count is: 4 OrcEncryptionSuite: 21:58:46.540 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - Write and read an encrypted file - Write and read an encrypted table - SPARK-35325: Write and read encrypted nested columns - SPARK-35992: Write and read fully-encrypted columns with default masking 21:58:51.933 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) = Run completed in 8 seconds, 708 milliseconds. Total number of tests run: 4 Suites: completed 2, aborted 0 Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk opened a new pull request, #40565: [WIP][SPARK-42873][SQL] Define Spark SQL types as keywords
MaxGekk opened a new pull request, #40565: URL: https://github.com/apache/spark/pull/40565 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.12.4
wangyum commented on code in PR #40555: URL: https://github.com/apache/spark/pull/40555#discussion_r1149215476 ## pom.xml: ## @@ -325,6 +325,17 @@ + Review Comment: man need this to download parquet 1.12.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.12.4
wangyum commented on code in PR #40555: URL: https://github.com/apache/spark/pull/40555#discussion_r1149214805 ## project/SparkBuild.scala: ## @@ -307,7 +307,9 @@ object SparkBuild extends PomBuild { DefaultMavenRepository, Resolver.mavenLocal, Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns) -), +) ++ Seq( + "staging-releases-mirror" at "https://repository.apache.org/content/repositories/staging/;, Review Comment: sbt need this to download parquet 1.12.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5
LuciferYang commented on PR #39124: URL: https://github.com/apache/spark/pull/39124#issuecomment-1485027282 > what version of jettison has come in from hadoop-common? > > HADOOP-18676 has gone in this weekend to exclude transitive jettison dependencies which don't get into a hadoop tarball, but will come in from pom imports. 1.5.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit
beliefer commented on PR #40355: URL: https://github.com/apache/spark/pull/40355#issuecomment-1484986064 ping @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #40528: [SPARK-42584][CONNECT] Improve output of Column.explain
beliefer commented on PR #40528: URL: https://github.com/apache/spark/pull/40528#issuecomment-1484985770 ping @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on pull request #39124: [SPARK-42913][BUILD] Upgrade Hadoop to 3.3.5
steveloughran commented on PR #39124: URL: https://github.com/apache/spark/pull/39124#issuecomment-1484958857 what version of jettison has come in from hadoop-common? HADOOP-18676 has gone in this weekend to exclude transitive jettison dependencies which don't get into a hadoop tarball, but will come in from pom imports. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jaceklaskowski commented on a diff in pull request #39907: [SPARK-42359][SQL] Support row skipping when reading CSV files
jaceklaskowski commented on code in PR #39907: URL: https://github.com/apache/spark/pull/39907#discussion_r1149152549 ## docs/sql-data-sources-csv.md: ## @@ -102,6 +102,12 @@ Data source options of CSV can be set via: For reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists. CSV built-in functions ignore this option. read/write + +skipLines +0 +Sets a number of non-empty, uncommented lines to skip before parsing each of the CSV files. If the header option is set to true, the first line after the number of skipLines will be taken as the header. Review Comment: nit: s/a number/the number + "before parsing CSV files" ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala: ## @@ -25,21 +25,33 @@ import org.apache.spark.sql.functions._ object CSVUtils { /** - * Filter ignorable rows for CSV dataset (lines empty and starting with `comment`). - * This is currently being used in CSV schema inference. + * Filter blank lines, remove comments, and skip specified rows from a CSV iterator. Then blank + * entries (and comments if set) are removed. This is currently being used in CSV schema + * inference. */ - def filterCommentAndEmpty(lines: Dataset[String], options: CSVOptions): Dataset[String] = { + def filterUnwantedLines(lines: Dataset[String], options: CSVOptions): Dataset[String] = { // Note that this was separately made by SPARK-18362. Logically, this should be the same -// with the one below, `filterCommentAndEmpty` but execution path is different. One of them +// with the one below, `filterUnwantedLines` but execution path is different. One of them // might have to be removed in the near future if possible. import lines.sqlContext.implicits._ val aliased = lines.toDF("value") val nonEmptyLines = aliased.filter(length(trim($"value")) > 0) -if (options.isCommentSet) { - nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String] -} else { - nonEmptyLines.as[String] +val commentFilteredLines = { + if (options.isCommentSet) { + nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String] + } else { +nonEmptyLines.as[String] + } } +// Note that unlike actual CSV reading path, it simply filters the given skipped lines. +// Therefore, this skips the line same with the skipped lines if exists. +val linesToSkip = commentFilteredLines.head(options.skipLines) +commentFilteredLines.rdd Review Comment: Is `.rdd` required here? `Dataset.mapPartitions` should give us what we want, shouldn't it? ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala: ## @@ -25,21 +25,33 @@ import org.apache.spark.sql.functions._ object CSVUtils { /** - * Filter ignorable rows for CSV dataset (lines empty and starting with `comment`). - * This is currently being used in CSV schema inference. + * Filter blank lines, remove comments, and skip specified rows from a CSV iterator. Then blank + * entries (and comments if set) are removed. This is currently being used in CSV schema + * inference. */ - def filterCommentAndEmpty(lines: Dataset[String], options: CSVOptions): Dataset[String] = { + def filterUnwantedLines(lines: Dataset[String], options: CSVOptions): Dataset[String] = { // Note that this was separately made by SPARK-18362. Logically, this should be the same -// with the one below, `filterCommentAndEmpty` but execution path is different. One of them +// with the one below, `filterUnwantedLines` but execution path is different. One of them // might have to be removed in the near future if possible. import lines.sqlContext.implicits._ val aliased = lines.toDF("value") val nonEmptyLines = aliased.filter(length(trim($"value")) > 0) -if (options.isCommentSet) { - nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String] -} else { - nonEmptyLines.as[String] +val commentFilteredLines = { + if (options.isCommentSet) { + nonEmptyLines.filter(!$"value".startsWith(options.comment.toString)).as[String] + } else { +nonEmptyLines.as[String] + } } +// Note that unlike actual CSV reading path, it simply filters the given skipped lines. +// Therefore, this skips the line same with the skipped lines if exists. +val linesToSkip = commentFilteredLines.head(options.skipLines) +commentFilteredLines.rdd + .mapPartitions { iter => +iter.filterNot(linesToSkip.contains(_)) Review Comment: `contains` should be enough. Moreover, the following should work too: ```
[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40474: [SPARK-42849] [WIP] [SQL] Session Variables
jaceklaskowski commented on code in PR #40474: URL: https://github.com/apache/spark/pull/40474#discussion_r1149130887 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala: ## @@ -582,9 +583,29 @@ class SparkSqlAstBuilder extends AstBuilder { } override def visitSetVariable(ctx: SetVariableContext): LogicalPlan = withOrigin(ctx) { -val variableName = visitMultipartIdentifier(ctx.multipartIdentifier) -val expression = ctx.expression().toString -SetVariableCommand(VariableIdentifier(variableName.last), expression) +if (ctx.assignmentList().isEmpty) { + SetVariableCommand(Seq(VariableIdentifier("var")), "7") Review Comment: 7?! What does this "magic number" represent? ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala: ## @@ -582,9 +583,29 @@ class SparkSqlAstBuilder extends AstBuilder { } override def visitSetVariable(ctx: SetVariableContext): LogicalPlan = withOrigin(ctx) { -val variableName = visitMultipartIdentifier(ctx.multipartIdentifier) -val expression = ctx.expression().toString -SetVariableCommand(VariableIdentifier(variableName.last), expression) +if (ctx.assignmentList().isEmpty) { + SetVariableCommand(Seq(VariableIdentifier("var")), "7") +} else { + val assignCtx = ctx.assignmentList() + val varList = assignCtx.assignment().asScala.map { +assign => { Review Comment: nit: no need for `{`. I'd also move `assign` to the line above (as more Scala-idiomatic IMHO) ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/variables.scala: ## @@ -119,13 +116,45 @@ case class DropVariableCommand( /** * This command is for setting a SQL variable. */ -case class SetVariableCommand(identifier: VariableIdentifier, - newExpr: String) extends LeafRunnableCommand { +case class SetVariableCommand(identifierList: Seq[VariableIdentifier], + newQueryStr: String) extends LeafRunnableCommand { override def run(sparkSession: SparkSession): Seq[Row] = { +val parser = sparkSession.sessionState.sqlParser +val analyzer = sparkSession.sessionState.analyzer val catalog = sparkSession.sessionState.catalog -assert(identifier.database.isEmpty) -val varInfo = catalog.getTempVariable(identifier.variableName) +val varInfoList = identifierList.collect { case identifier: VariableIdentifier => Review Comment: nit: `varInfos` ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala: ## @@ -582,9 +583,29 @@ class SparkSqlAstBuilder extends AstBuilder { } override def visitSetVariable(ctx: SetVariableContext): LogicalPlan = withOrigin(ctx) { -val variableName = visitMultipartIdentifier(ctx.multipartIdentifier) -val expression = ctx.expression().toString -SetVariableCommand(VariableIdentifier(variableName.last), expression) +if (ctx.assignmentList().isEmpty) { + SetVariableCommand(Seq(VariableIdentifier("var")), "7") +} else { + val assignCtx = ctx.assignmentList() + val varList = assignCtx.assignment().asScala.map { +assign => { + val varname = visitMultipartIdentifier(assign.key) + VariableIdentifier(varname.head) +} + } + + val assignments = assignCtx.assignment() + val exprStrList = assignments.asScala.map { +assign => { Review Comment: nit: no need for `{`. I'd also move assign to the line above (as more Scala-idiomatic IMHO) ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/variables.scala: ## @@ -119,13 +116,45 @@ case class DropVariableCommand( /** * This command is for setting a SQL variable. */ -case class SetVariableCommand(identifier: VariableIdentifier, - newExpr: String) extends LeafRunnableCommand { +case class SetVariableCommand(identifierList: Seq[VariableIdentifier], + newQueryStr: String) extends LeafRunnableCommand { override def run(sparkSession: SparkSession): Seq[Row] = { +val parser = sparkSession.sessionState.sqlParser +val analyzer = sparkSession.sessionState.analyzer val catalog = sparkSession.sessionState.catalog -assert(identifier.database.isEmpty) -val varInfo = catalog.getTempVariable(identifier.variableName) +val varInfoList = identifierList.collect { case identifier: VariableIdentifier => + assert(identifier.database.isEmpty) + val varInfo = catalog.getTempVariable(identifier.variableName) + if (varInfo.isEmpty) { +throw new NoSuchTempVariableException(identifier.variableName) + } + (identifier.variableName, varInfo.get._2) +} + +val varNames = varInfoList.collect { case variable => variable._1 } +val casts = varInfoList.collect { Review Comment: nit: `map` (not `collect`)
[GitHub] [spark] Hisoka-X opened a new pull request, #40564: [SPARK-42519] [Test] [Connect] Add more WriteTo tests after Scala Client session config is supported
Hisoka-X opened a new pull request, #40564: URL: https://github.com/apache/spark/pull/40564 ### What changes were proposed in this pull request? Add more WriteTo tests for Spark Connect Client ### Why are the changes needed? Improve Test Case, remove same todo ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.12.4
jaceklaskowski commented on code in PR #40555: URL: https://github.com/apache/spark/pull/40555#discussion_r1149120335 ## project/SparkBuild.scala: ## @@ -307,7 +307,9 @@ object SparkBuild extends PomBuild { DefaultMavenRepository, Resolver.mavenLocal, Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns) -), +) ++ Seq( + "staging-releases-mirror" at "https://repository.apache.org/content/repositories/staging/;, Review Comment: Why is this required? Looks like a temporary change to me (to test out the change before parquet is widely available in the public repo). ## pom.xml: ## @@ -325,6 +325,17 @@ + Review Comment: Why is this required? ## pom.xml: ## @@ -2361,7 +2372,7 @@ ${hive.group} hive-service-rpc - + Review Comment: Rather than "upgrading" the version in the comment (that does not really say much) I'd reword it to the following (and get rid of the version): ``` parquet-hadoop-bundle:1.8.1 conflict with [the name of the artifact it conflicts with] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org