[GitHub] [spark] WweiL commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Add DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1193358986 ## connector/connect/common/src/main/protobuf/spark/connect/commands.proto: ## @@ -209,6 +209,15 @@ message WriteStreamOperationStart { string path = 11;

[GitHub] [spark] LuciferYang commented on pull request #41122: [SPARK-43436][BUILD] Upgrade rocksdbjni to 8.1.1.1

2023-05-15 Thread via GitHub
LuciferYang commented on PR #41122: URL: https://github.com/apache/spark/pull/41122#issuecomment-1547239087 @anishshri-db I need to update the results of `StateStoreBasicOperationsBenchmark` when upgrade rocksdbjni, we can run `StateStoreBasicOperationsBenchmark` with GA as following:

[GitHub] [spark] WweiL commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Add DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1193358986 ## connector/connect/common/src/main/protobuf/spark/connect/commands.proto: ## @@ -209,6 +209,15 @@ message WriteStreamOperationStart { string path = 11;

[GitHub] [spark] rangadi commented on a diff in pull request #41039: [SPARK-43360][SS][CONNECT] Scala client StreamingQueryManager

2023-05-15 Thread via GitHub
rangadi commented on code in PR #41039: URL: https://github.com/apache/spark/pull/41039#discussion_r1193360679 ## connector/connect/common/src/main/protobuf/spark/connect/commands.proto: ## @@ -360,7 +360,7 @@ message StreamingQueryManagerCommandResult { // (Required) The

[GitHub] [spark] anishshri-db commented on pull request #41122: [SPARK-43436][BUILD] Upgrade rocksdbjni to 8.1.1.1

2023-05-15 Thread via GitHub
anishshri-db commented on PR #41122: URL: https://github.com/apache/spark/pull/41122#issuecomment-1547242180 @LuciferYang - ok cool thanks. I'll update the b/mark and try running this workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] itholic commented on a diff in pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread via GitHub
itholic commented on code in PR #41149: URL: https://github.com/apache/spark/pull/41149#discussion_r1193369528 ## python/pyspark/sql/pandas/serializers.py: ## @@ -186,6 +186,65 @@ def arrow_to_pandas(self, arrow_column): else: return s +def

[GitHub] [spark] itholic commented on a diff in pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread via GitHub
itholic commented on code in PR #41149: URL: https://github.com/apache/spark/pull/41149#discussion_r1193369528 ## python/pyspark/sql/pandas/serializers.py: ## @@ -186,6 +186,65 @@ def arrow_to_pandas(self, arrow_column): else: return s +def

[GitHub] [spark] panbingkun commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

2023-05-15 Thread via GitHub
panbingkun commented on PR #41169: URL: https://github.com/apache/spark/pull/41169#issuecomment-1547342279 cc @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #41150: [SPARK-43482][SS] Expand QueryTerminatedEvent to contain error class if it exists in exception

2023-05-15 Thread via GitHub
HeartSaVioR commented on code in PR #41150: URL: https://github.com/apache/spark/pull/41150#discussion_r1193464953 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala: ## @@ -154,11 +154,22 @@ object StreamingQueryListener { * @param runId

[GitHub] [spark] turboFei commented on a diff in pull request #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.

2023-05-15 Thread via GitHub
turboFei commented on code in PR #22911: URL: https://github.com/apache/spark/pull/22911#discussion_r1193384703 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfExecutorFeatureStep.scala: ## @@ -1,40 +0,0 @@ -/* - * Licensed to

[GitHub] [spark] eejbyfeldt commented on a diff in pull request #38428: [SPARK-40912][CORE]Overhead of Exceptions in KryoDeserializationStream

2023-05-15 Thread via GitHub
eejbyfeldt commented on code in PR #38428: URL: https://github.com/apache/spark/pull/38428#discussion_r1193411810 ## core/src/test/scala/org/apache/spark/serializer/KryoIteratorBenchmark.scala: ## @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] [spark] panbingkun opened a new pull request, #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

2023-05-15 Thread via GitHub
panbingkun opened a new pull request, #41169: URL: https://github.com/apache/spark/pull/41169 ### What changes were proposed in this pull request? The pr aims to add a max distance argument to the levenshtein() function. ### Why are the changes needed? Currently, Spark's

[GitHub] [spark] WweiL commented on a diff in pull request #41039: [SPARK-43360][SS][CONNECT] Scala client StreamingQueryManager

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41039: URL: https://github.com/apache/spark/pull/41039#discussion_r1193425714 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -131,6 +131,7 @@ object

[GitHub] [spark] advancedxy commented on pull request #41168: [SPARK-43454] support substitution for SparkConf's get and getAllWithPrefix

2023-05-15 Thread via GitHub
advancedxy commented on PR #41168: URL: https://github.com/apache/spark/pull/41168#issuecomment-1547370686 The CI should pass. Would you mind to take a look at this @cloud-fan @vanzin? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] zhengruifeng commented on pull request #41167: [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list and names containing dot

2023-05-15 Thread via GitHub
zhengruifeng commented on PR #41167: URL: https://github.com/apache/spark/pull/41167#issuecomment-1547442342 thanks @HyukjinKwon for reviews, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] itholic commented on a diff in pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread via GitHub
itholic commented on code in PR #41149: URL: https://github.com/apache/spark/pull/41149#discussion_r1193369528 ## python/pyspark/sql/pandas/serializers.py: ## @@ -186,6 +186,65 @@ def arrow_to_pandas(self, arrow_column): else: return s +def

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Add DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
HeartSaVioR commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1193370059 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala: ## @@ -534,6 +552,8 @@ final class DataStreamWriter[T] private[sql](ds:

[GitHub] [spark] HeartSaVioR commented on pull request #40892: [SPARK-43128][CONNECT][SS] Make `recentProgress` and `lastProgress` return `StreamingQueryProgress` consistent with the native Scala Api

2023-05-15 Thread via GitHub
HeartSaVioR commented on PR #40892: URL: https://github.com/apache/spark/pull/40892#issuecomment-1547265202 I'll leave this to you for self-merging, so that you can test your new permission. Congrats again :) -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] LuciferYang commented on pull request #40892: [SPARK-43128][CONNECT][SS] Make `recentProgress` and `lastProgress` return `StreamingQueryProgress` consistent with the native Scala Api

2023-05-15 Thread via GitHub
LuciferYang commented on PR #40892: URL: https://github.com/apache/spark/pull/40892#issuecomment-1547269299 > I'll leave this to you for self-merging, so that you can test your new permission. Congrats again :) Thanks @HeartSaVioR :) -- This is an automated message from the Apache

[GitHub] [spark] turboFei commented on a diff in pull request #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.

2023-05-15 Thread via GitHub
turboFei commented on code in PR #22911: URL: https://github.com/apache/spark/pull/22911#discussion_r1193384703 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfExecutorFeatureStep.scala: ## @@ -1,40 +0,0 @@ -/* - * Licensed to

[GitHub] [spark] WweiL commented on a diff in pull request #41039: [SPARK-43360][SS][CONNECT] Scala client StreamingQueryManager

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41039: URL: https://github.com/apache/spark/pull/41039#discussion_r1193425714 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -131,6 +131,7 @@ object

[GitHub] [spark] Stove-hust commented on pull request #40412: [SPARK-42784] should still create subDir when the number of subDir in merge dir is less than conf

2023-05-15 Thread via GitHub
Stove-hust commented on PR #40412: URL: https://github.com/apache/spark/pull/40412#issuecomment-1547331364 @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] nija-at commented on a diff in pull request #41138: [SPARK-43457][CONNECT][PYTHON] Augument user agent with OS, Python and Spark versions

2023-05-15 Thread via GitHub
nija-at commented on code in PR #41138: URL: https://github.com/apache/spark/pull/41138#discussion_r1193456123 ## python/pyspark/sql/connect/client.py: ## @@ -299,7 +301,12 @@ def userAgent(self) -> str: raise SparkConnectException( f"'user_agent'

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #41150: [SPARK-43482][SS] Expand QueryTerminatedEvent to contain error class if it exists in exception

2023-05-15 Thread via GitHub
HeartSaVioR commented on code in PR #41150: URL: https://github.com/apache/spark/pull/41150#discussion_r1193464953 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala: ## @@ -154,11 +154,22 @@ object StreamingQueryListener { * @param runId

[GitHub] [spark] Hisoka-X commented on pull request #41111: [SPARK-39420][SQL] Support `ANALYZE TABLE` on Datasource V2 tables

2023-05-15 Thread via GitHub
Hisoka-X commented on PR #4: URL: https://github.com/apache/spark/pull/4#issuecomment-1547474099 cc @MaxGekk @cloud-fan @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] LuciferYang commented on pull request #41122: [SPARK-43436][BUILD] Upgrade rocksdbjni to 8.1.1.1

2023-05-15 Thread via GitHub
LuciferYang commented on PR #41122: URL: https://github.com/apache/spark/pull/41122#issuecomment-1547242903 Thanks @anishshri-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] viirya commented on a diff in pull request #41150: [SPARK-43482][SS] Expand QueryTerminatedEvent to contain error class if it exists in exception

2023-05-15 Thread via GitHub
viirya commented on code in PR #41150: URL: https://github.com/apache/spark/pull/41150#discussion_r1193388602 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala: ## @@ -154,11 +154,22 @@ object StreamingQueryListener { * @param runId A

[GitHub] [spark] turboFei commented on pull request #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.

2023-05-15 Thread via GitHub
turboFei commented on PR #22911: URL: https://github.com/apache/spark/pull/22911#issuecomment-1547294539 > The main two things that don't need to happen in executors anymore are: > adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #41150: [SPARK-43482][SS] Expand QueryTerminatedEvent to contain error class if it exists in exception

2023-05-15 Thread via GitHub
HeartSaVioR commented on code in PR #41150: URL: https://github.com/apache/spark/pull/41150#discussion_r1193454180 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala: ## @@ -154,11 +154,22 @@ object StreamingQueryListener { * @param runId

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #41150: [SPARK-43482][SS] Expand QueryTerminatedEvent to contain error class if it exists in exception

2023-05-15 Thread via GitHub
HeartSaVioR commented on code in PR #41150: URL: https://github.com/apache/spark/pull/41150#discussion_r1193464953 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala: ## @@ -154,11 +154,22 @@ object StreamingQueryListener { * @param runId

[GitHub] [spark] zhengruifeng closed pull request #41167: [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list and names containing dot

2023-05-15 Thread via GitHub
zhengruifeng closed pull request #41167: [SPARK-43500][PYTHON][TESTS] Test `DataFrame.drop` with empty column list and names containing dot URL: https://github.com/apache/spark/pull/41167 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] cloud-fan commented on a diff in pull request #41151: [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause.

2023-05-15 Thread via GitHub
cloud-fan commented on code in PR #41151: URL: https://github.com/apache/spark/pull/41151#discussion_r1193569101 ## docs/sql-ref-syntax-qry-select-limit.md: ## @@ -91,7 +91,21 @@ SELECT name, age FROM person ORDER BY name LIMIT length('SPARK'); -- A non-foldable expression

[GitHub] [spark] panbingkun commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

2023-05-15 Thread via GitHub
panbingkun commented on PR #41169: URL: https://github.com/apache/spark/pull/41169#issuecomment-1547530123 > Maybe we should keep connect client synchronized with this change, or at least add an `exclude` entry in `CheckConnectJvmClientCompatibility` I will add `exclude` entry in

[GitHub] [spark] MaxGekk closed pull request #41143: [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions

2023-05-15 Thread via GitHub
MaxGekk closed pull request #41143: [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions URL: https://github.com/apache/spark/pull/41143 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] cloud-fan commented on a diff in pull request #41151: [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause.

2023-05-15 Thread via GitHub
cloud-fan commented on code in PR #41151: URL: https://github.com/apache/spark/pull/41151#discussion_r1193569941 ## docs/sql-ref-syntax-qry-select-limit.md: ## @@ -91,7 +91,21 @@ SELECT name, age FROM person ORDER BY name LIMIT length('SPARK'); -- A non-foldable expression

[GitHub] [spark] MaxGekk commented on pull request #41143: [SPARK-43485][SQL] Fix the error message for the `unit` argument of the datetime add/diff functions

2023-05-15 Thread via GitHub
MaxGekk commented on PR #41143: URL: https://github.com/apache/spark/pull/41143#issuecomment-1547632273 I do believe the failed test suite `HealthTrackerIntegrationSuite` is not related to my changes. Though, I have checked it locally. Merging to master. Thank you, @HyukjinKwon and

[GitHub] [spark] panbingkun opened a new pull request, #41170: [MINOR] Remove redundant character escape "\\" and add UT

2023-05-15 Thread via GitHub
panbingkun opened a new pull request, #41170: URL: https://github.com/apache/spark/pull/41170 ### What changes were proposed in this pull request? The pr aims to remove redundant character escape "\\" and add UT for SparkHadoopUtil.substituteHadoopVariables. ### Why are the

[GitHub] [spark] LuciferYang commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

2023-05-15 Thread via GitHub
LuciferYang commented on PR #41169: URL: https://github.com/apache/spark/pull/41169#issuecomment-1547697865 @panbingkun run `ProtoToParsedPlanTestSuite:` with this pr, `function_levenshtein` failed as follows: ``` [info] - function_levenshtein *** FAILED *** (4 milliseconds) [info]

[GitHub] [spark] DHKold commented on pull request #40491: [SPARK-41006][K8S] Generate new ConfigMap names for each run

2023-05-15 Thread via GitHub
DHKold commented on PR #40491: URL: https://github.com/apache/spark/pull/40491#issuecomment-1547710656 Hi, sorry, Iw was away for some time. What remains to do: - add some Unit Tests for the propose changes (I'll try to do that this week) - check if the configmap for the driver should

[GitHub] [spark] bozhang2820 commented on pull request #41140: [SPARK-38469][CORE] Use error class in org.apache.spark.network

2023-05-15 Thread via GitHub
bozhang2820 commented on PR #41140: URL: https://github.com/apache/spark/pull/41140#issuecomment-1547714691 CI test `single listener, check trigger events are generated correctly` failed, which should be irrelevant to this change? -- This is an automated message from the Apache Git

[GitHub] [spark] jakubhava commented on pull request #40491: [SPARK-41006][K8S] Generate new ConfigMap names for each run

2023-05-15 Thread via GitHub
jakubhava commented on PR #40491: URL: https://github.com/apache/spark/pull/40491#issuecomment-1547714727 On our side we duplicated the code also because of driver config map being hard-coded as well. We are using same launcher to start multiple spark sessions hence it would lead to

[GitHub] [spark] cloud-fan commented on a diff in pull request #41142: [SPARK-43302][SQL][FOLLOWUP] Code cleanup for PythonUDAF

2023-05-15 Thread via GitHub
cloud-fan commented on code in PR #41142: URL: https://github.com/apache/spark/pull/41142#discussion_r1193816925 ## python/pyspark/sql/udf.py: ## @@ -439,13 +439,23 @@ def func(*args: Any, **kwargs: Any) -> Any: func.__signature__ = inspect.signature(f) #

[GitHub] [spark] LuciferYang commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

2023-05-15 Thread via GitHub
LuciferYang commented on code in PR #41169: URL: https://github.com/apache/spark/pull/41169#discussion_r1193717481 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java: ## @@ -37,7 +37,6 @@ import static org.apache.spark.unsafe.Platform.*; - Review

[GitHub] [spark] panbingkun commented on pull request #41171: [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread via GitHub
panbingkun commented on PR #41171: URL: https://github.com/apache/spark/pull/41171#issuecomment-1547728854 cc @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] cloud-fan commented on pull request #41151: [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause.

2023-05-15 Thread via GitHub
cloud-fan commented on PR #41151: URL: https://github.com/apache/spark/pull/41151#issuecomment-1547787139 thanks, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] LuciferYang commented on pull request #41171: [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread via GitHub
LuciferYang commented on PR #41171: URL: https://github.com/apache/spark/pull/41171#issuecomment-1547794345 Are there any other similar cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] panbingkun commented on pull request #41171: [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread via GitHub
panbingkun commented on PR #41171: URL: https://github.com/apache/spark/pull/41171#issuecomment-1547798463 > Are there any other similar cases No further cases have been found so far. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] cloud-fan closed pull request #41151: [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause.

2023-05-15 Thread via GitHub
cloud-fan closed pull request #41151: [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause. URL: https://github.com/apache/spark/pull/41151 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] LuciferYang commented on pull request #41166: [SPARK-40189][SQL] Add `json_array_get` function

2023-05-15 Thread via GitHub
LuciferYang commented on PR #41166: URL: https://github.com/apache/spark/pull/41166#issuecomment-1547821409 https://github.com/apache/spark/assets/1475305/342d7c37-95cb-4760-9829-5cdc4a4ebfc9;> It seems that presto does not recommend using this function, and it may be removed in

[GitHub] [spark] Hisoka-X commented on pull request #41166: [SPARK-40189][SQL] Add `json_array_get` function

2023-05-15 Thread via GitHub
Hisoka-X commented on PR #41166: URL: https://github.com/apache/spark/pull/41166#issuecomment-1547829158 > https://user-images.githubusercontent.com/1475305/238357206-342d7c37-95cb-4760-9829-5cdc4a4ebfc9.png;> > It seems that presto does not recommend using this function, and it may be

[GitHub] [spark] HeartSaVioR commented on pull request #40959: [CONNECT][SS]Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark Connect

2023-05-15 Thread via GitHub
HeartSaVioR commented on PR #40959: URL: https://github.com/apache/spark/pull/40959#issuecomment-1547833963 Could you please file a new JIRA ticket, or add JIRA ticket number as the prefix of PR title? You can see examples https://github.com/apache/spark/pulls. -- This is an automated

[GitHub] [spark] whutpencil commented on pull request #38146: [SPARK-40687][SQL] Support data masking built-in function 'mask'

2023-05-15 Thread via GitHub
whutpencil commented on PR #38146: URL: https://github.com/apache/spark/pull/38146#issuecomment-1547840396 @vinodkc In the source code of hive, the types supported by the mask have this annotation: value - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING,

[GitHub] [spark] panbingkun commented on a diff in pull request #41170: [MINOR] Remove redundant character escape "\\" and add UT

2023-05-15 Thread via GitHub
panbingkun commented on code in PR #41170: URL: https://github.com/apache/spark/pull/41170#discussion_r1193924645 ## core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala: ## @@ -123,6 +123,58 @@ class SparkHadoopUtilSuite extends SparkFunSuite {

[GitHub] [spark] nija-at commented on a diff in pull request #41013: [SPARK-43509][CONNECT] Support Creating multiple Spark Connect sessions

2023-05-15 Thread via GitHub
nija-at commented on code in PR #41013: URL: https://github.com/apache/spark/pull/41013#discussion_r1193998391 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -3272,6 +3272,21 @@ def test_error_stack_trace(self): ) spark.stop() +def

[GitHub] [spark] sweisdb commented on a diff in pull request #40970: [SPARK-43290][SQL] Adds IV and AAD support to aes_encrypt/aes_decrypt

2023-05-15 Thread via GitHub
sweisdb commented on code in PR #40970: URL: https://github.com/apache/spark/pull/40970#discussion_r1194059698 ## core/src/main/resources/error/error-classes.json: ## @@ -1051,6 +1051,16 @@ "expects a binary value with 16, 24 or 32 bytes, but got bytes." ]

[GitHub] [spark] rangadi commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Add DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
rangadi commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1194147496 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala: ## @@ -455,6 +465,16 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T])

[GitHub] [spark] panbingkun commented on a diff in pull request #41170: [MINOR] Remove redundant character escape "\\" and add UT

2023-05-15 Thread via GitHub
panbingkun commented on code in PR #41170: URL: https://github.com/apache/spark/pull/41170#discussion_r1193922765 ## core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala: ## @@ -247,7 +247,7 @@ private[spark] class SparkHadoopUtil extends Logging { if

[GitHub] [spark] nija-at commented on a diff in pull request #41013: [SPARK-43509][CONNECT] Support Creating multiple Spark Connect sessions

2023-05-15 Thread via GitHub
nija-at commented on code in PR #41013: URL: https://github.com/apache/spark/pull/41013#discussion_r1193998391 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -3272,6 +3272,21 @@ def test_error_stack_trace(self): ) spark.stop() +def

[GitHub] [spark] WweiL commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Add DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1194200121 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala: ## @@ -534,6 +554,8 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {

[GitHub] [spark] chenhao-db commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

2023-05-15 Thread via GitHub
chenhao-db commented on PR #41169: URL: https://github.com/apache/spark/pull/41169#issuecomment-1548233629 I am wondering whether it is better to follow PostgreSQL's semantics: > If the actual distance is less than or equal to max_d, then levenshtein_less_equal returns the correct

[GitHub] [spark] srowen closed pull request #41171: [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread via GitHub
srowen closed pull request #41171: [SPARK-43508][DOC] Replace the link related to hadoop version 2 with hadoop version 3 URL: https://github.com/apache/spark/pull/41171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] amaliujia commented on a diff in pull request #41013: [SPARK-43509][CONNECT] Support Creating multiple Spark Connect sessions

2023-05-15 Thread via GitHub
amaliujia commented on code in PR #41013: URL: https://github.com/apache/spark/pull/41013#discussion_r1194102518 ## python/pyspark/sql/session.py: ## @@ -394,6 +394,36 @@ def enableHiveSupport(self) -> "SparkSession.Builder": """ return

[GitHub] [spark] WweiL commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Add DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1194154060 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala: ## @@ -455,6 +465,16 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {

[GitHub] [spark] rangadi commented on a diff in pull request #41075: [SPARK-43361][PROTOBUF] spark-protobuf: allow serde with enum as ints

2023-05-15 Thread via GitHub
rangadi commented on code in PR #41075: URL: https://github.com/apache/spark/pull/41075#discussion_r1194138082 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -2812,4 +2813,18 @@ private[sql] object QueryExecutionErrors extends

[GitHub] [spark] panbingkun commented on a diff in pull request #41170: [MINOR] Remove redundant character escape "\\" and add UT

2023-05-15 Thread via GitHub
panbingkun commented on code in PR #41170: URL: https://github.com/apache/spark/pull/41170#discussion_r1193920052 ## core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala: ## @@ -247,7 +247,7 @@ private[spark] class SparkHadoopUtil extends Logging { if

[GitHub] [spark] wangyum commented on pull request #41141: [SPARK-43461][BUILD] Skip compiling useless files when making distribution

2023-05-15 Thread via GitHub
wangyum commented on PR #41141: URL: https://github.com/apache/spark/pull/41141#issuecomment-1548013401 cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] hvanhovell commented on pull request #40796: [SPARK-43223][Connect] Typed agg, reduce functions, RelationalGroupedDataset#as

2023-05-15 Thread via GitHub
hvanhovell commented on PR #40796: URL: https://github.com/apache/spark/pull/40796#issuecomment-1548042425 Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] dongjoon-hyun closed pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

2023-05-15 Thread via GitHub
dongjoon-hyun closed pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect` URL: https://github.com/apache/spark/pull/41127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] dongjoon-hyun commented on pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

2023-05-15 Thread via GitHub
dongjoon-hyun commented on PR #41127: URL: https://github.com/apache/spark/pull/41127#issuecomment-1548298138 Thank you so much, @zhengruifeng , @HyukjinKwon , @xinrong-meng , @itholic ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] warrenzhu25 commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

2023-05-15 Thread via GitHub
warrenzhu25 commented on PR #41083: URL: https://github.com/apache/spark/pull/41083#issuecomment-1548009051 > How are you observing recoverable fetch failures ? I have seen 2 cases when target executor has busy shuffle fetch and upload due to shuffle migration: 1. All Netty

[GitHub] [spark] hvanhovell closed pull request #40796: [SPARK-43223][Connect] Typed agg, reduce functions, RelationalGroupedDataset#as

2023-05-15 Thread via GitHub
hvanhovell closed pull request #40796: [SPARK-43223][Connect] Typed agg, reduce functions, RelationalGroupedDataset#as URL: https://github.com/apache/spark/pull/40796 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] manuzhang opened a new pull request, #41173: [SPARK-43510][YARN] Fix YarnAllocator internal state when adding running executor after processing completed containers

2023-05-15 Thread via GitHub
manuzhang opened a new pull request, #41173: URL: https://github.com/apache/spark/pull/41173 ### What changes were proposed in this pull request? Keep track of completed container ids in YarnAllocator and don't update internal state of a container if it's already completed.

[GitHub] [spark] sunchao closed pull request #41164: [SPARK-43494][CORE] Directly call `replicate()` for `HdfsDataOutputStreamBuilder` instead of reflection in `SparkHadoopUtil#createFile`

2023-05-15 Thread via GitHub
sunchao closed pull request #41164: [SPARK-43494][CORE] Directly call `replicate()` for `HdfsDataOutputStreamBuilder` instead of reflection in `SparkHadoopUtil#createFile` URL: https://github.com/apache/spark/pull/41164 -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] sunchao commented on pull request #41164: [SPARK-43494][CORE] Directly call `replicate()` for `HdfsDataOutputStreamBuilder` instead of reflection in `SparkHadoopUtil#createFile`

2023-05-15 Thread via GitHub
sunchao commented on PR #41164: URL: https://github.com/apache/spark/pull/41164#issuecomment-1548109195 Merged to master, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] WweiL commented on a diff in pull request #41129: [SPARK-43133] Scala Client DataStreamWriter Foreach support

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41129: URL: https://github.com/apache/spark/pull/41129#discussion_r1194152295 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala: ## @@ -534,6 +566,11 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {

[GitHub] [spark] dtenedor commented on a diff in pull request #41007: [WIP][SPARK-43205] IDENTIFIER clause

2023-05-15 Thread via GitHub
dtenedor commented on code in PR #41007: URL: https://github.com/apache/spark/pull/41007#discussion_r1194198592 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -434,17 +434,31 @@ resource dmlStatementNoWith : insertInto query

[GitHub] [spark] bogao007 commented on a diff in pull request #41039: [SPARK-43360][SS][CONNECT] Scala client StreamingQueryManager

2023-05-15 Thread via GitHub
bogao007 commented on code in PR #41039: URL: https://github.com/apache/spark/pull/41039#discussion_r1194437164 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache

[GitHub] [spark] zhenlineo opened a new pull request, #41174: [SPARK-43415]Adding mapValues func before the agg exprs

2023-05-15 Thread via GitHub
zhenlineo opened a new pull request, #41174: URL: https://github.com/apache/spark/pull/41174 ### What changes were proposed in this pull request? The `KVGDS#agg` and `reduceGroups` was not able to chain mapValues functions directly with columns. This PR add the a new unresolved func

[GitHub] [spark] srowen closed pull request #41170: [MINOR] Remove redundant character escape "\\" and add UT

2023-05-15 Thread via GitHub
srowen closed pull request #41170: [MINOR] Remove redundant character escape "\\" and add UT URL: https://github.com/apache/spark/pull/41170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] srowen commented on pull request #41170: [MINOR] Remove redundant character escape "\\" and add UT

2023-05-15 Thread via GitHub
srowen commented on PR #41170: URL: https://github.com/apache/spark/pull/41170#issuecomment-1548731573 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] WeichenXu123 opened a new pull request, #41176: [WIP] [SPARK-43516] [ML] Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator

2023-05-15 Thread via GitHub
WeichenXu123 opened a new pull request, #41176: URL: https://github.com/apache/spark/pull/41176 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[GitHub] [spark] chaoqin-li1123 commented on pull request #41099: [SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State Store Provider

2023-05-15 Thread via GitHub
chaoqin-li1123 commented on PR #41099: URL: https://github.com/apache/spark/pull/41099#issuecomment-1548781495 I have addressed most pending comments. Could you take a look? @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] xinrong-meng commented on pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread via GitHub
xinrong-meng commented on PR #41149: URL: https://github.com/apache/spark/pull/41149#issuecomment-1548403863 Does that refactoring still conform to UNSUPPORTED_DATA_TYPE_FOR_ARROW_VERSION? -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] gengliangwang commented on a diff in pull request #41007: [WIP][SPARK-43205] IDENTIFIER clause

2023-05-15 Thread via GitHub
gengliangwang commented on code in PR #41007: URL: https://github.com/apache/spark/pull/41007#discussion_r1194354914 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -434,17 +434,31 @@ resource dmlStatementNoWith : insertInto

[GitHub] [spark] srielau commented on a diff in pull request #41007: [WIP][SPARK-43205] IDENTIFIER clause

2023-05-15 Thread via GitHub
srielau commented on code in PR #41007: URL: https://github.com/apache/spark/pull/41007#discussion_r1194384090 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala: ## @@ -276,6 +279,138 @@ object UnresolvedAttribute { } } +/** + * Holds

[GitHub] [spark] jiangxb1987 commented on pull request #40690: [SPARK-43043][CORE] Improve the performance of MapOutputTracker.updateMapOutput

2023-05-15 Thread via GitHub
jiangxb1987 commented on PR #40690: URL: https://github.com/apache/spark/pull/40690#issuecomment-1548728418 @dongjoon-hyun I created https://issues.apache.org/jira/browse/SPARK-43515 as a followup task to add a micro-benchmark. -- This is an automated message from the Apache Git Service.

[GitHub] [spark] rangadi commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Python Client DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
rangadi commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1194458342 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -2385,6 +2385,13 @@ class SparkConnectPlanner(val

[GitHub] [spark] srielau commented on a diff in pull request #41007: [WIP][SPARK-43205] IDENTIFIER clause

2023-05-15 Thread via GitHub
srielau commented on code in PR #41007: URL: https://github.com/apache/spark/pull/41007#discussion_r1194378050 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -2108,6 +2123,18 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] WweiL commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Add DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1194355729 ## sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala: ## @@ -532,7 +547,10 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {

[GitHub] [spark] xinrong-meng commented on pull request #41160: [SPARK-41971][PYTHON][FOLLOWUP] Fix toPandas to support empty columns

2023-05-15 Thread via GitHub
xinrong-meng commented on PR #41160: URL: https://github.com/apache/spark/pull/41160#issuecomment-1548618044 Late LGTM, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] zhengruifeng commented on pull request #40896: [SPARK-43229][ML][PYTHON][CONNECT] Introduce Barrier Python UDF

2023-05-15 Thread via GitHub
zhengruifeng commented on PR #40896: URL: https://github.com/apache/spark/pull/40896#issuecomment-1548778624 I think we (@HyukjinKwon @WeichenXu123 and I) have reach to agreement that this PR (which introduces an `@barrier` annotation) is no longer, and we will keep current `barrier`

[GitHub] [spark] github-actions[bot] closed pull request #39502: [SPARK-41981][SQL] Merge percentile functions if possible

2023-05-15 Thread via GitHub
github-actions[bot] closed pull request #39502: [SPARK-41981][SQL] Merge percentile functions if possible URL: https://github.com/apache/spark/pull/39502 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] github-actions[bot] closed pull request #39381: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…

2023-05-15 Thread via GitHub
github-actions[bot] closed pull request #39381: [SPARK-41554] fix changing of Decimal scale when scale decreased by m… URL: https://github.com/apache/spark/pull/39381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] github-actions[bot] closed pull request #39770: [WIP][SPARK-42206][CORE] Omit "Task Executor Metrics" field in eventlogs if values are all zero

2023-05-15 Thread via GitHub
github-actions[bot] closed pull request #39770: [WIP][SPARK-42206][CORE] Omit "Task Executor Metrics" field in eventlogs if values are all zero URL: https://github.com/apache/spark/pull/39770 -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] WweiL commented on a diff in pull request #41026: [SPARK-43132] [SS] [CONNECT] Python Client DataStreamWriter foreach() API

2023-05-15 Thread via GitHub
WweiL commented on code in PR #41026: URL: https://github.com/apache/spark/pull/41026#discussion_r1194484668 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -40,7 +41,11 @@ import

[GitHub] [spark] srielau commented on a diff in pull request #41007: [WIP][SPARK-43205] IDENTIFIER clause

2023-05-15 Thread via GitHub
srielau commented on code in PR #41007: URL: https://github.com/apache/spark/pull/41007#discussion_r1194380150 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala: ## @@ -276,6 +279,138 @@ object UnresolvedAttribute { } } +/** + * Holds

[GitHub] [spark] xinrong-meng commented on pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread via GitHub
xinrong-meng commented on PR #41149: URL: https://github.com/apache/spark/pull/41149#issuecomment-1548613143 Sorry I meant `UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION`. Do we have plans to remove the constraints? @ueshin -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] srielau commented on a diff in pull request #41007: [WIP][SPARK-43205] IDENTIFIER clause

2023-05-15 Thread via GitHub
srielau commented on code in PR #41007: URL: https://github.com/apache/spark/pull/41007#discussion_r1194386802 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala: ## @@ -276,6 +279,138 @@ object UnresolvedAttribute { } } +/** + * Holds

[GitHub] [spark] ueshin commented on pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread via GitHub
ueshin commented on PR #41149: URL: https://github.com/apache/spark/pull/41149#issuecomment-1548658278 > Do we have plans to remove the constraints? I'm not sure if it's planned, but now we can remove the constraints. -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] zhengruifeng commented on pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

2023-05-15 Thread via GitHub
zhengruifeng commented on PR #41127: URL: https://github.com/apache/spark/pull/41127#issuecomment-1548749738 thank you all for the reviews, @dongjoon-hyun @HyukjinKwon @xinrong-meng @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please

  1   2   >