Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-04-12 Thread via GitHub
TakawaAkirayo commented on PR #45367: URL: https://github.com/apache/spark/pull/45367#issuecomment-2053512982 @mridulm @beliefer @LuciferYang Thanks for your review and guidance to improve the PR :-) -- This is an automated message from the Apache Git Service. To respond to the message,

[PR] [WIP] Upgrade postgresql driver [spark]

2024-04-12 Thread via GitHub
panbingkun opened a new pull request, #46038: URL: https://github.com/apache/spark/pull/46038 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

Re: [PR] [SPARK-44444][SQL] Enabled ANSI mode by default [spark]

2024-04-12 Thread via GitHub
mridulm commented on PR #46013: URL: https://github.com/apache/spark/pull/46013#issuecomment-2053492527 +CC @shardulm94 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-04-12 Thread via GitHub
mridulm commented on PR #45367: URL: https://github.com/apache/spark/pull/45367#issuecomment-2053492389 I have updated the description, and merged to master. Thanks for fixing this @TakawaAkirayo ! Thanks for the review @beliefer and @LuciferYang :-) -- This is an automated message

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-04-12 Thread via GitHub
mridulm closed pull request #45367: [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue URL: https://github.com/apache/spark/pull/45367 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-04-12 Thread via GitHub
mridulm commented on PR #45367: URL: https://github.com/apache/spark/pull/45367#issuecomment-2053490996 The test failures are unrelated to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-47812][CONNECT] Support Serialization of SparkSession for ForEachBatch worker [spark]

2024-04-12 Thread via GitHub
grundprinzip commented on PR #46002: URL: https://github.com/apache/spark/pull/46002#issuecomment-2053489832 Thank you @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-12 Thread via GitHub
ericm-db commented on PR #45932: URL: https://github.com/apache/spark/pull/45932#issuecomment-2053489012 @HeartSaVioR PTAL, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47603][KUBERNETES][YARN] Resource managers: Migrate logWarn with variables to structured logging framework [spark]

2024-04-12 Thread via GitHub
panbingkun commented on PR #45957: URL: https://github.com/apache/spark/pull/45957#issuecomment-2053481060 > @panbingkun Thanks for the work. LGTM except for two comments. Updated, done. -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-47603][KUBERNETES][YARN] Resource managers: Migrate logWarn with variables to structured logging framework [spark]

2024-04-12 Thread via GitHub
panbingkun commented on code in PR #45957: URL: https://github.com/apache/spark/pull/45957#discussion_r1563691923 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -210,10 +211,10 @@ class

Re: [PR] [WIP][SPARK-47757][SPARK-47756][CONNECT][PYTHON][TESTS] Make testing Spark Connect server having pyspark.core [spark]

2024-04-12 Thread via GitHub
HyukjinKwon commented on PR #46036: URL: https://github.com/apache/spark/pull/46036#issuecomment-2053258705 https://github.com/HyukjinKwon/spark/actions/runs/8670824482/job/23779286417 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[PR] Spark 47233 client side listener 2 [spark]

2024-04-12 Thread via GitHub
WweiL opened a new pull request, #46037: URL: https://github.com/apache/spark/pull/46037 ### What changes were proposed in this pull request? Server and client side for the client side listener. The client should start send a `add_listener_bus_listener` RPC for the

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-04-12 Thread via GitHub
TakawaAkirayo commented on code in PR #45367: URL: https://github.com/apache/spark/pull/45367#discussion_r1563578933 ## core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala: ## @@ -176,6 +176,56 @@ class SparkListenerSuite extends SparkFunSuite with

Re: [PR] [SPARK-47812][CONNECT] Support Serialization of SparkSession for ForEachBatch worker [spark]

2024-04-12 Thread via GitHub
HyukjinKwon commented on code in PR #46002: URL: https://github.com/apache/spark/pull/46002#discussion_r1563534141 ## python/pyspark/sql/connect/streaming/readwriter.py: ## @@ -557,7 +557,7 @@ def foreach(self, f: Union[Callable[[Row], None], "SupportsProcess"]) -> "DataSt

Re: [PR] [SPARK-47812][CONNECT] Support Serialization of SparkSession for ForEachBatch worker [spark]

2024-04-12 Thread via GitHub
HyukjinKwon commented on code in PR #46002: URL: https://github.com/apache/spark/pull/46002#discussion_r1563534141 ## python/pyspark/sql/connect/streaming/readwriter.py: ## @@ -557,7 +557,7 @@ def foreach(self, f: Union[Callable[[Row], None], "SupportsProcess"]) -> "DataSt

Re: [PR] [SPARK-47812][CONNECT] Support Serialization of SparkSession for ForEachBatch worker [spark]

2024-04-12 Thread via GitHub
HyukjinKwon closed pull request #46002: [SPARK-47812][CONNECT] Support Serialization of SparkSession for ForEachBatch worker URL: https://github.com/apache/spark/pull/46002 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-47812][CONNECT] Support Serialization of SparkSession for ForEachBatch worker [spark]

2024-04-12 Thread via GitHub
HyukjinKwon commented on PR #46002: URL: https://github.com/apache/spark/pull/46002#issuecomment-2052885568 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [WIP][SPARK-47757][SPARK-47756][CONNECT][PYTHON][TESTS] Make testing Spark Connect server having pyspark.core [spark]

2024-04-12 Thread via GitHub
HyukjinKwon commented on PR #46036: URL: https://github.com/apache/spark/pull/46036#issuecomment-2052879562 https://github.com/HyukjinKwon/spark/actions/runs/8670017978/job/23777524704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[PR] [WIP][SPARK-47757][SPARK-47756][CONNECT][PYTHON][TESTS] Make testing Spark Connect server having pyspark.core [spark]

2024-04-12 Thread via GitHub
HyukjinKwon opened a new pull request, #46036: URL: https://github.com/apache/spark/pull/46036 ### What changes were proposed in this pull request? This PR proposes to testing PySpark Connect server to have `pyspark.core` package by running Python workers once (and they will be

Re: [PR] [MINOR][PYTHON] Enable parity test `test_different_group_key_cardinality` [spark]

2024-04-12 Thread via GitHub
HyukjinKwon commented on PR #46032: URL: https://github.com/apache/spark/pull/46032#issuecomment-2052780662 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [MINOR][PYTHON] Enable parity test `test_different_group_key_cardinality` [spark]

2024-04-12 Thread via GitHub
HyukjinKwon closed pull request #46032: [MINOR][PYTHON] Enable parity test `test_different_group_key_cardinality` URL: https://github.com/apache/spark/pull/46032 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-46574][BUILD] Upgrade maven plugin to latest version [spark]

2024-04-12 Thread via GitHub
github-actions[bot] commented on PR #43908: URL: https://github.com/apache/spark/pull/43908#issuecomment-205272 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

Re: [PR] [SPARK-46477][SQL] Add bucket info to SD in toHivePartition [spark]

2024-04-12 Thread via GitHub
github-actions[bot] closed pull request #1: [SPARK-46477][SQL] Add bucket info to SD in toHivePartition URL: https://github.com/apache/spark/pull/1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-39800][SQL][WIP] DataSourceV2: View Support [spark]

2024-04-12 Thread via GitHub
github-actions[bot] closed pull request #44197: [SPARK-39800][SQL][WIP] DataSourceV2: View Support URL: https://github.com/apache/spark/pull/44197 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

2024-04-12 Thread via GitHub
github-actions[bot] commented on PR #44546: URL: https://github.com/apache/spark/pull/44546#issuecomment-2052721068 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

Re: [PR] [SPARK-46566][SQL] Session level config was not loaded when isolation is enable. [spark]

2024-04-12 Thread via GitHub
github-actions[bot] commented on PR #44572: URL: https://github.com/apache/spark/pull/44572#issuecomment-2052721053 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

Re: [PR] [SPARK-47591][SQL] Hive-thriftserver: Migrate logInfo with variables to structured logging framework [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on PR #45926: URL: https://github.com/apache/spark/pull/45926#issuecomment-2052711109 @itholic The Hive-thriftserver tests failed. Please check it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on code in PR #45923: URL: https://github.com/apache/spark/pull/45923#discussion_r1563384223 ## sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala: ## @@ -218,7 +232,9 @@ private[thriftserver]

Re: [PR] [SPARK-47603][KUBERNETES][YARN] Resource managers: Migrate logWarn with variables to structured logging framework [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on PR #45957: URL: https://github.com/apache/spark/pull/45957#issuecomment-2052707701 @panbingkun Thanks for the work. LGTM except for two comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
fanyue-xia commented on code in PR #45971: URL: https://github.com/apache/spark/pull/45971#discussion_r1563379719 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala: ## @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47603][KUBERNETES][YARN] Resource managers: Migrate logWarn with variables to structured logging framework [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on code in PR #45957: URL: https://github.com/apache/spark/pull/45957#discussion_r1563377336 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -282,7 +283,7 @@ class

Re: [PR] [SPARK-47603][KUBERNETES][YARN] Resource managers: Migrate logWarn with variables to structured logging framework [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on code in PR #45957: URL: https://github.com/apache/spark/pull/45957#discussion_r1563376076 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -210,10 +211,10 @@ class

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
fanyue-xia commented on PR #45971: URL: https://github.com/apache/spark/pull/45971#issuecomment-2052704919 > > the seed might behave differently across runs/on different machines > > Ah I see, this indeed makes sense. > > In this case, I think we should fix the generator of

Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]

2024-04-12 Thread via GitHub
sahnib commented on PR #46035: URL: https://github.com/apache/spark/pull/46035#issuecomment-2052704693 @HeartSaVioR PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]

2024-04-12 Thread via GitHub
sahnib opened a new pull request, #46035: URL: https://github.com/apache/spark/pull/46035 ### What changes were proposed in this pull request? Streaming queries with Union of 2 data streams followed by an Aggregate (groupBy) can produce incorrect results if the grouping

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563370002 ## python/pyspark/sql/worker/plan_data_source_read.py: ## @@ -51,6 +52,71 @@ ) +def records_to_arrow_batches( +output_iter: Iterator[Tuple], +

Re: [PR] Operator 1.0.0-alpha [spark-kubernetes-operator]

2024-04-12 Thread via GitHub
margorczynski commented on PR #2: URL: https://github.com/apache/spark-kubernetes-operator/pull/2#issuecomment-2052702949 Hey, great stuff. Could you tell me if you'll find a moment how this relates to the https://github.com/kubeflow/spark-operator/tree/master operator? Is the approach

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
fanyue-xia commented on code in PR #45971: URL: https://github.com/apache/spark/pull/45971#discussion_r1563367019 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala: ## @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
fanyue-xia commented on code in PR #45971: URL: https://github.com/apache/spark/pull/45971#discussion_r1563365767 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala: ## @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-12 Thread via GitHub
kelvinjian-db commented on PR #46034: URL: https://github.com/apache/spark/pull/46034#issuecomment-2052701361 cc @cloud-fan @jchen5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-12 Thread via GitHub
kelvinjian-db opened a new pull request, #46034: URL: https://github.com/apache/spark/pull/46034 ### What changes were proposed in this pull request? - Fixes a bug where `RewriteWithExpression` can rewrite an `Aggregate` into an invalid one. The fix is done by separating

Re: [PR] [SPARK-44444][SQL] Enabled ANSI mode by default [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on PR #46013: URL: https://github.com/apache/spark/pull/46013#issuecomment-2052657458 @yaooqinn enabling the ANSI SQL mode does address certain unreasonable SQL behaviors, such as integer overflow and division by zero, which could potentially disrupt users'

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on code in PR #45990: URL: https://github.com/apache/spark/pull/45990#discussion_r1563258451 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -1609,6 +1609,19 @@ object SQLConf {

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on code in PR #45990: URL: https://github.com/apache/spark/pull/45990#discussion_r1563257890 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -1609,6 +1609,19 @@ object SQLConf {

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-12 Thread via GitHub
gengliangwang commented on code in PR #45990: URL: https://github.com/apache/spark/pull/45990#discussion_r1563255189 ## sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala: ## @@ -204,6 +215,8 @@ class CacheManager extends Logging with

Re: [PR] [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue [spark]

2024-04-12 Thread via GitHub
mridulm commented on code in PR #45367: URL: https://github.com/apache/spark/pull/45367#discussion_r1563152817 ## core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala: ## @@ -176,6 +176,56 @@ class SparkListenerSuite extends SparkFunSuite with

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563133283 ## python/pyspark/sql/datasource.py: ## @@ -469,6 +501,192 @@ def stop(self) -> None: ... +class SimpleInputPartition(InputPartition): +def

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563131511 ## python/pyspark/sql/datasource.py: ## @@ -469,6 +501,192 @@ def stop(self) -> None: ... +class SimpleInputPartition(InputPartition): +def

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563131182 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala: ## @@ -164,7 +175,20 @@ class PythonStreamingSourceRunner(

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563130456 ## python/pyspark/sql/streaming/python_streaming_source_runner.py: ## @@ -76,6 +97,19 @@ def commit_func(reader: DataSourceStreamReader, infile: IO, outfile:

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563129813 ## python/pyspark/sql/datasource.py: ## @@ -469,6 +501,192 @@ def stop(self) -> None: ... +class SimpleInputPartition(InputPartition): +def

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
WweiL commented on PR #45971: URL: https://github.com/apache/spark/pull/45971#issuecomment-2052419964 > the seed might behave differently across runs/on different machines Ah I see, this indeed makes sense. In this case, I think we should fix the generator of rows. It's okay

Re: [PR] [BACKPORT][SPARK-42369][CORE] Fix constructor for java.nio.DirectByteBuffer (#39909) [spark]

2024-04-12 Thread via GitHub
ayinresh closed pull request #46033: [BACKPORT][SPARK-42369][CORE] Fix constructor for java.nio.DirectByteBuffer (#39909) URL: https://github.com/apache/spark/pull/46033 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[PR] [BACKPORT][SPARK-42369][CORE] Fix constructor for java.nio.DirectByteBuffer (#39909) [spark]

2024-04-12 Thread via GitHub
ayinresh opened a new pull request, #46033: URL: https://github.com/apache/spark/pull/46033 ### What changes were proposed in this pull request? Backport of https://github.com/apache/spark/pull/39909 to 3.4. ### Why are the changes needed? It's required to

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-12 Thread via GitHub
anishshri-db commented on code in PR #45932: URL: https://github.com/apache/spark/pull/45932#discussion_r1563064461 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithValueStateTTLSuite.scala: ## @@ -171,203 +160,15 @@ case class

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-12 Thread via GitHub
anishshri-db commented on code in PR #45932: URL: https://github.com/apache/spark/pull/45932#discussion_r1563063876 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithListStateTTLSuite.scala: ## @@ -0,0 +1,349 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-12 Thread via GitHub
anishshri-db commented on code in PR #45932: URL: https://github.com/apache/spark/pull/45932#discussion_r1563050397 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateImplWithTTL.scala: ## @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-44444][SQL] Enabled ANSI mode by default [spark]

2024-04-12 Thread via GitHub
amaliujia commented on PR #46013: URL: https://github.com/apache/spark/pull/46013#issuecomment-2052308167 Will this be included into any maintenance release that is cut before Spark 4? I assume no? -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-47318][CORE][3.5] Adds HKDF round to AuthEngine key derivation to follow standard KEX practices [spark]

2024-04-12 Thread via GitHub
mridulm commented on PR #46014: URL: https://github.com/apache/spark/pull/46014#issuecomment-2052306349 Sounds good to me for docs @dongjoon-hyun - we will have to forward port that to master as well. And I guess we leave the config as 4.0 in code ? -- This is an automated message

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563028452 ## python/pyspark/sql/datasource.py: ## @@ -469,6 +501,192 @@ def stop(self) -> None: ... +class SimpleInputPartition(InputPartition): +def

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
chaoqin-li1123 commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1563028108 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala: ## @@ -199,4 +223,30 @@ class PythonStreamingSourceRunner(

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
fanyue-xia commented on PR #45971: URL: https://github.com/apache/spark/pull/45971#issuecomment-2052224756 > Thanks for the effort! This really requires some deep understanding of spark internals... > > There is still one important concern, that the golden file size is too big. I

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
WweiL commented on code in PR #45971: URL: https://github.com/apache/spark/pull/45971#discussion_r1562953554 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala: ## @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
fanyue-xia commented on code in PR #45971: URL: https://github.com/apache/spark/pull/45971#discussion_r1562948068 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala: ## @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
WweiL commented on code in PR #45971: URL: https://github.com/apache/spark/pull/45971#discussion_r1562849582 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala: ## @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

2024-04-12 Thread via GitHub
WweiL commented on PR #45971: URL: https://github.com/apache/spark/pull/45971#issuecomment-2052183431 Thanks for the effort! This really requires some deep understanding of spark internals... There is still one important concern, that the golden file size is too big. I looked a bit,

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

2024-04-12 Thread via GitHub
sahnib commented on code in PR #45977: URL: https://github.com/apache/spark/pull/45977#discussion_r1562866326 ## python/pyspark/sql/datasource.py: ## @@ -469,6 +501,192 @@ def stop(self) -> None: ... +class SimpleInputPartition(InputPartition): +def

Re: [PR] [SPARK-47318][CORE][3.5] Adds HKDF round to AuthEngine key derivation to follow standard KEX practices [spark]

2024-04-12 Thread via GitHub
dongjoon-hyun commented on PR #46014: URL: https://github.com/apache/spark/pull/46014#issuecomment-2052103818 We can follow the Apache Spark Security page convention. - https://spark.apache.org/security.html > 3.2.2, or 3.3.1 or later In this case, maybe, `3.4.3, or 3.5.2 or

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-12 Thread via GitHub
ericm-db commented on code in PR #45932: URL: https://github.com/apache/spark/pull/45932#discussion_r1562819807 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithValueStateTTLSuite.scala: ## @@ -399,7 +205,7 @@ class TransformWithValueStateTTLSuite

Re: [PR] [SPARK-46935][DOCS] Consolidate error documentation [spark]

2024-04-12 Thread via GitHub
nchammas commented on PR #44971: URL: https://github.com/apache/spark/pull/44971#issuecomment-2052098166 > The discussion on [SPARK-46810](https://issues.apache.org/jira/browse/SPARK-46810) is not moving forward, unfortunately. This discussion has since been resolved in favor of

Re: [PR] [SPARK-47825][DSTREAMS][3.5] Make `KinesisTestUtils` & `WriteInputFormatTestDataGenerator` deprecated [spark]

2024-04-12 Thread via GitHub
dongjoon-hyun commented on code in PR #46019: URL: https://github.com/apache/spark/pull/46019#discussion_r1562817518 ## core/src/main/scala/org/apache/spark/api/python/WriteInputFormatTestDataGenerator.scala: ## @@ -104,6 +105,7 @@ private[python] class

Re: [PR] [SPARK-47825][DSTREAMS][3.5] Make `KinesisTestUtils` & `WriteInputFormatTestDataGenerator` deprecated [spark]

2024-04-12 Thread via GitHub
dongjoon-hyun commented on code in PR #46019: URL: https://github.com/apache/spark/pull/46019#discussion_r1562817518 ## core/src/main/scala/org/apache/spark/api/python/WriteInputFormatTestDataGenerator.scala: ## @@ -104,6 +105,7 @@ private[python] class

Re: [PR] [SPARK-47825][DSTREAMS][3.5] Make `KinesisTestUtils` & `WriteInputFormatTestDataGenerator` deprecated [spark]

2024-04-12 Thread via GitHub
dongjoon-hyun commented on code in PR #46019: URL: https://github.com/apache/spark/pull/46019#discussion_r1562814727 ## core/src/main/scala/org/apache/spark/api/python/WriteInputFormatTestDataGenerator.scala: ## @@ -104,6 +105,7 @@ private[python] class

Re: [PR] [SPARK-47825][DSTREAMS][3.5] Make `KinesisTestUtils` & `WriteInputFormatTestDataGenerator` deprecated [spark]

2024-04-12 Thread via GitHub
dongjoon-hyun commented on code in PR #46019: URL: https://github.com/apache/spark/pull/46019#discussion_r1562814727 ## core/src/main/scala/org/apache/spark/api/python/WriteInputFormatTestDataGenerator.scala: ## @@ -104,6 +105,7 @@ private[python] class

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-12 Thread via GitHub
anishshri-db commented on code in PR #45932: URL: https://github.com/apache/spark/pull/45932#discussion_r1562801897 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateImplWithTTL.scala: ## @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software

Re: [PR] Operator 1.0.0-alpha [spark-kubernetes-operator]

2024-04-12 Thread via GitHub
csviri commented on code in PR #2: URL: https://github.com/apache/spark-kubernetes-operator/pull/2#discussion_r1562687820 ## spark-operator/src/main/java/org/apache/spark/kubernetes/operator/health/SentinelManager.java: ## @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-12 Thread via GitHub
sahnib commented on code in PR #45932: URL: https://github.com/apache/spark/pull/45932#discussion_r1562685312 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateImplWithTTL.scala: ## @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] [SPARK-47821][SQL] Implement is_variant_null expression [spark]

2024-04-12 Thread via GitHub
harshmotw-db commented on code in PR #46011: URL: https://github.com/apache/spark/pull/46011#discussion_r1562704219 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala: ## @@ -41,4 +41,15 @@ object

Re: [PR] [SPARK-47822][SQL] Prohibit Hash Expressions from hashing the Variant Data Type [spark]

2024-04-12 Thread via GitHub
harshmotw-db commented on PR #46017: URL: https://github.com/apache/spark/pull/46017#issuecomment-2051963610 > Do we support variant as the join keys? HashJoin will hash the join key values as well. We currently do not since we haven't implemented the `=` operator on variant. --

Re: [PR] [SPARK-47829] Text Datasource supports Zstd compression codec [spark]

2024-04-12 Thread via GitHub
yaooqinn commented on PR #46026: URL: https://github.com/apache/spark/pull/46026#issuecomment-2051928714 Is the Hadoop native zstd library still missing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-47357][SQL] Add support for Upper, Lower, InitCap (all collations) [spark]

2024-04-12 Thread via GitHub
cloud-fan commented on code in PR #46008: URL: https://github.com/apache/spark/pull/46008#discussion_r1562653009 ## sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala: ## @@ -163,6 +165,45 @@ class CollationStringExpressionsSuite }) } +

[PR] [MINOR][PYTHON] Enable parity test `test_different_group_key_cardinality` [spark]

2024-04-12 Thread via GitHub
zhengruifeng opened a new pull request, #46032: URL: https://github.com/apache/spark/pull/46032 ### What changes were proposed in this pull request? Enable parity test `test_different_group_key_cardinality`, by trigger the analysis ### Why are the changes needed? for test

Re: [PR] [SPARK-47765][SQL] Add SET COLLATION to parser rules [spark]

2024-04-12 Thread via GitHub
cloud-fan closed pull request #45946: [SPARK-47765][SQL] Add SET COLLATION to parser rules URL: https://github.com/apache/spark/pull/45946 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-47765][SQL] Add SET COLLATION to parser rules [spark]

2024-04-12 Thread via GitHub
cloud-fan commented on PR #45946: URL: https://github.com/apache/spark/pull/45946#issuecomment-2051862860 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]

2024-04-12 Thread via GitHub
zhengruifeng commented on code in PR #46012: URL: https://github.com/apache/spark/pull/46012#discussion_r1562547428 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala: ## @@ -381,6 +405,53 @@ case class SessionHolder(userId:

Re: [PR] [WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]

2024-04-12 Thread via GitHub
zhengruifeng commented on code in PR #46012: URL: https://github.com/apache/spark/pull/46012#discussion_r1562547428 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala: ## @@ -381,6 +405,53 @@ case class SessionHolder(userId:

Re: [PR] [SPARK-29336][SQL] Fix the implementation of QuantileSummaries.merge (guarantee that the relativeError will be respected) [spark]

2024-04-12 Thread via GitHub
tanelk commented on PR #26029: URL: https://github.com/apache/spark/pull/26029#issuecomment-2051754913 Hello, I know this is an "ancient" PR, but it seems like it caused a severe performance regression. I made a jira issue with it https://issues.apache.org/jira/browse/SPARK-47836 I'll

[PR] Refactor of collation aware string functions [spark]

2024-04-12 Thread via GitHub
stefankandic opened a new pull request, #46031: URL: https://github.com/apache/spark/pull/46031 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[PR] [SPARK-47835][SHUFFLE] Remove switch for remoteReadNioBufferConversion [spark]

2024-04-12 Thread via GitHub
pan3793 opened a new pull request, #46030: URL: https://github.com/apache/spark/pull/46030 ### What changes were proposed in this pull request? This PR logically reverts https://github.com/apache/spark/commit/2c82745686f4456c4d5c84040a431dcb5b6cb60b, to allow disable

Re: [PR] [SPARK-47819][CONNECT] Use asynchronous callback for execution cleanup [spark]

2024-04-12 Thread via GitHub
hvanhovell commented on PR #46027: URL: https://github.com/apache/spark/pull/46027#issuecomment-2051701249 @vicennial @xi-db should we also fix this in 3.5? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [SPARK-47819][CONNECT] Use asynchronous callback for execution cleanup [spark]

2024-04-12 Thread via GitHub
hvanhovell closed pull request #46027: [SPARK-47819][CONNECT] Use asynchronous callback for execution cleanup URL: https://github.com/apache/spark/pull/46027 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] [SPARK-47357][SQL] Add support for Upper, Lower, InitCap (all collations) [spark]

2024-04-12 Thread via GitHub
uros-db commented on code in PR #46008: URL: https://github.com/apache/spark/pull/46008#discussion_r1562392631 ## sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala: ## @@ -163,6 +165,45 @@ class CollationStringExpressionsSuite }) } +

Re: [PR] [SPARK-47357][SQL] Add support for Upper, Lower, InitCap (all collations) [spark]

2024-04-12 Thread via GitHub
uros-db commented on code in PR #46008: URL: https://github.com/apache/spark/pull/46008#discussion_r1562392631 ## sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala: ## @@ -163,6 +165,45 @@ class CollationStringExpressionsSuite }) } +

Re: [PR] [SPARK-47357][SQL] Add support for Upper, Lower, InitCap (all collations) [spark]

2024-04-12 Thread via GitHub
mihailom-db commented on code in PR #46008: URL: https://github.com/apache/spark/pull/46008#discussion_r1562381738 ## sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala: ## @@ -163,6 +165,45 @@ class CollationStringExpressionsSuite }) }

Re: [PR] [SPARK-47357][SQL] Add support for Upper, Lower, InitCap (all collations) [spark]

2024-04-12 Thread via GitHub
mihailom-db commented on code in PR #46008: URL: https://github.com/apache/spark/pull/46008#discussion_r1562379501 ## sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala: ## @@ -89,6 +89,45 @@ class CollationStringExpressionsSuite

Re: [PR] [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary [spark]

2024-04-12 Thread via GitHub
pan3793 commented on code in PR #25899: URL: https://github.com/apache/spark/pull/25899#discussion_r1562356543 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala: ## @@ -734,30 +735,52 @@ object DataSource extends Logging { * Checks and

[PR] [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits` [spark]

2024-04-12 Thread via GitHub
LuciferYang opened a new pull request, #46029: URL: https://github.com/apache/spark/pull/46029 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

Re: [PR] [SPARK-47833][SQL][CORE] Supply caller stackstrace for checkAndGlobPathIfNecessary AnalysisException [spark]

2024-04-12 Thread via GitHub
pan3793 commented on PR #46028: URL: https://github.com/apache/spark/pull/46028#issuecomment-2051465166 cc @srowen @viirya @dongjoon-hyun @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[PR] [SPARK-47833][SQL][CORE] Supply caller stackstrace for checkAndGlobPathIfNecessary AnalysisException [spark]

2024-04-12 Thread via GitHub
pan3793 opened a new pull request, #46028: URL: https://github.com/apache/spark/pull/46028 ### What changes were proposed in this pull request? SPARK-29089 parallelized `checkAndGlobPathIfNecessary` by leveraging fork join pools, it also introduced a side effect, the reported

Re: [PR] [SPARK-47819][CONNECT] Use asynchronous callback for execution cleanup [spark]

2024-04-12 Thread via GitHub
vicennial commented on PR #46027: URL: https://github.com/apache/spark/pull/46027#issuecomment-2051417438 cc @HyukjinKwon @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

  1   2   >