[GitHub] [spark] rithwik-db commented on a diff in pull request #39369: [SPARK-41775][PYTHON][ML] Adding support for PyTorch functions

2023-01-20 Thread GitBox
rithwik-db commented on code in PR #39369: URL: https://github.com/apache/spark/pull/39369#discussion_r1082896210 ## python/pyspark/ml/torch/tests/test_distributor.py: ## @@ -349,6 +434,13 @@ def test_get_num_tasks_distributed(self) -> None:

[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39369: [SPARK-41775][PYTHON][ML] Adding support for PyTorch functions

2023-01-20 Thread GitBox
lu-wang-dl commented on code in PR #39369: URL: https://github.com/apache/spark/pull/39369#discussion_r1082894756 ## python/pyspark/ml/torch/tests/test_distributor.py: ## @@ -349,6 +434,13 @@ def test_get_num_tasks_distributed(self) -> None:

[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398718941 FYI, GitHub Action is currently on the old version but `jdk8u362` will be automatically applied in one or two weeks. I'll keep monitoring the version change.

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39677: [SPARK-42043][CONNECT][TEST][FOLLOWUP] Better env var and a few bug fixes

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39677: URL: https://github.com/apache/spark/pull/39677#discussion_r1082842120 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -121,14 +117,14 @@ object

[GitHub] [spark] dongjoon-hyun commented on pull request #39677: [SPARK-42043][CONNECT][TEST][FOLLOWUP] Better env var and a few bug fixes

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39677: URL: https://github.com/apache/spark/pull/39677#issuecomment-1398677709 Hi, @zhenlineo . When we use the same JIRA id, we need to add `[FOLLOWUP]` in the PR title. I added at this time. -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] dongjoon-hyun closed pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
dongjoon-hyun closed pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362 URL: https://github.com/apache/spark/pull/39671 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398671949 Now, both Zulu and Adoptiun(Temurin) are available. Thank you all. Merged to master.

[GitHub] [spark] srowen commented on pull request #39190: [SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API

2023-01-20 Thread GitBox
srowen commented on PR #39190: URL: https://github.com/apache/spark/pull/39190#issuecomment-1398662080 FWIW, this part was last changed in https://issues.apache.org/jira/browse/SPARK-24415 to fix a different bug (CC @ankuriitg ) It might be worth re-running the simple example there to see

[GitHub] [spark] sunchao commented on pull request #39633: [SPARK-42038][SQL] SPJ: Support partially clustered distribution

2023-01-20 Thread GitBox
sunchao commented on PR #39633: URL: https://github.com/apache/spark/pull/39633#issuecomment-1398661965 @cloud-fan the idea is similar to skew join but for v2 sources, let me try to split the code into a separate rule following your idea. -- This is an automated message from the Apache

[GitHub] [spark] zhenlineo commented on pull request #39677: [SPARK-42043][CONNECT][TEST] Better env var and a few bug fixes

2023-01-20 Thread GitBox
zhenlineo commented on PR #39677: URL: https://github.com/apache/spark/pull/39677#issuecomment-1398630857 cc @HyukjinKwon @LuciferYang Thanks for the review. I address the immediate fix in this PR. For other improvements, I've created tickets and we can add as follow ups. -- This is

[GitHub] [spark] zhenlineo commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
zhenlineo commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082775554 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache

[GitHub] [spark] zhenlineo commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
zhenlineo commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082771233 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache

[GitHub] [spark] peter-toth commented on pull request #38038: [SPARK-42136] Refactor BroadcastHashJoinExec output partitioning calculation

2023-01-20 Thread GitBox
peter-toth commented on PR #38038: URL: https://github.com/apache/spark/pull/38038#issuecomment-1398625037 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] zhenlineo commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
zhenlineo commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082763118 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache

[GitHub] [spark] zhenlineo opened a new pull request, #39677: [SPARK-42043][CONNECT][TEST] Better env var and a few bug fixes

2023-01-20 Thread GitBox
zhenlineo opened a new pull request, #39677: URL: https://github.com/apache/spark/pull/39677 ### What changes were proposed in this pull request? Use a better env var to find the spark home in E2E tests. Fixed the jar finding bug for RC builds. Use Nano instead of MS for

[GitHub] [spark] itholic commented on pull request #39505: [SPARK-41979][SQL] Add missing dots for error messages in error classes.

2023-01-20 Thread GitBox
itholic commented on PR #39505: URL: https://github.com/apache/spark/pull/39505#issuecomment-1398592475 Test paseed. cc @MaxGekk @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] kuwii commented on pull request #39190: [SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API

2023-01-20 Thread GitBox
kuwii commented on PR #39190: URL: https://github.com/apache/spark/pull/39190#issuecomment-1398579998 I'm not familiar with how Spark creates and runs jobs and stages for a query, but I think it may be related to this case. I can reproduce this locally using Spark on Yarn mode with this

[GitHub] [spark] zhenlineo commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
zhenlineo commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082698957 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala: ## @@ -78,7 +78,7 @@ class SparkConnectClientSuite

[GitHub] [spark] zhenlineo commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
zhenlineo commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082698009 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/util/Cleaner.scala: ## @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] srowen commented on a diff in pull request #39660: [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown

2023-01-20 Thread GitBox
srowen commented on code in PR #39660: URL: https://github.com/apache/spark/pull/39660#discussion_r1082695284 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala: ## @@ -307,11 +307,12 @@ private[jdbc] class JDBCRDD( "" } +

[GitHub] [spark] tgravescs commented on a diff in pull request #39674: [DON'T MERGE] Test remove SPARK_USE_CONC_INCR_GC

2023-01-20 Thread GitBox
tgravescs commented on code in PR #39674: URL: https://github.com/apache/spark/pull/39674#discussion_r1082615175 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1005,26 +1005,6 @@ private[spark] class Client( val tmpDir = new

[GitHub] [spark] grundprinzip commented on a diff in pull request #39585: [SPARK-42124][PYTHON][CONNECT] Scalar Inline Python UDF in Spark Connect

2023-01-20 Thread GitBox
grundprinzip commented on code in PR #39585: URL: https://github.com/apache/spark/pull/39585#discussion_r1082600958 ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -217,6 +218,28 @@ message Expression { bool is_user_defined_function =

[GitHub] [spark] ggershinsky commented on pull request #39665: [SPARK-42114][SQL][TESTS] Add uniform parquet encryption test case

2023-01-20 Thread GitBox
ggershinsky commented on PR #39665: URL: https://github.com/apache/spark/pull/39665#issuecomment-1398455770 Thanks Dongjoon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] peter-toth commented on pull request #39676: [SPARK-42134][SQL] Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes

2023-01-20 Thread GitBox
peter-toth commented on PR #39676: URL: https://github.com/apache/spark/pull/39676#issuecomment-1398437534 cc @cloud-fan, @huaxingao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] peter-toth opened a new pull request, #39676: [SPARK-42134][SQL] Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes

2023-01-20 Thread GitBox
peter-toth opened a new pull request, #39676: URL: https://github.com/apache/spark/pull/39676 ### What changes were proposed in this pull request? This is a small correctness fix to `DataSourceUtils.getPartitionFiltersAndDataFilters()` to handle filters without any referenced attributes

[GitHub] [spark] srowen commented on pull request #39190: [SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API

2023-01-20 Thread GitBox
srowen commented on PR #39190: URL: https://github.com/apache/spark/pull/39190#issuecomment-1398393768 Yeah but do you know how it happens, or have a theory? Just want to see if the change seems to match with some theory of how it arises. Or does this change definitely change the output

[GitHub] [spark] LuciferYang commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
LuciferYang commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398388671 Ok, plenty of time. I am fine to make this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39675: [MINOR][DOCS] Update the doc of arrow & kubernetes

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39675: URL: https://github.com/apache/spark/pull/39675#discussion_r1082534577 ## docs/running-on-kubernetes.md: ## @@ -34,13 +34,13 @@ Please see [Spark Security](security.html) and the specific security sections in Images built from

[GitHub] [spark] LuciferYang commented on a diff in pull request #39674: [DON'T MERGE] Test remove SPARK_USE_CONC_INCR_GC

2023-01-20 Thread GitBox
LuciferYang commented on code in PR #39674: URL: https://github.com/apache/spark/pull/39674#discussion_r1082533145 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1005,26 +1005,6 @@ private[spark] class Client( val tmpDir = new

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39675: [MINOR][DOCS] Update the doc of arrow & kubernetes

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39675: URL: https://github.com/apache/spark/pull/39675#discussion_r1082532817 ## docs/running-on-kubernetes.md: ## @@ -34,13 +34,13 @@ Please see [Spark Security](security.html) and the specific security sections in Images built from

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39675: [MINOR][DOCS] Update the doc of arrow & kubernetes

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39675: URL: https://github.com/apache/spark/pull/39675#discussion_r1082529696 ## docs/index.md: ## @@ -45,7 +45,6 @@ Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. When using the Scala API, it is necessary for

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39674: [DON'T MERGE] Test remove SPARK_USE_CONC_INCR_GC

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39674: URL: https://github.com/apache/spark/pull/39674#discussion_r1082528460 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1005,26 +1005,6 @@ private[spark] class Client( val tmpDir = new

[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398367056 BTW, we didn't cut the branch yet and we still have one month for Apache Spark 3.4.0 release. I'm considering that time period for this decision, @LuciferYang . You are also

[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398362594 Timezone issues are inevitably which we need to adjust the code in a regular basis, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] LuciferYang commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
LuciferYang commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398356049 @dongjoon-hyun Hmm...do you remember SPARK-40846? When we upgrade from 8u345 to 8u352 for GA testing, there are some time zone issue that need to be solved by changing the code, so I

[GitHub] [spark] LuciferYang commented on pull request #39663: [SPARK-42129][BUILD] Upgrade rocksdbjni to 7.9.2

2023-01-20 Thread GitBox
LuciferYang commented on PR #39663: URL: https://github.com/apache/spark/pull/39663#issuecomment-1398346083 Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] LuciferYang commented on a diff in pull request #39674: [DON'T MERGE] Test remove SPARK_USE_CONC_INCR_GC

2023-01-20 Thread GitBox
LuciferYang commented on code in PR #39674: URL: https://github.com/apache/spark/pull/39674#discussion_r1082496069 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1005,26 +1005,6 @@ private[spark] class Client( val tmpDir = new

[GitHub] [spark] LuciferYang commented on a diff in pull request #39674: [DON'T MERGE] Test remove SPARK_USE_CONC_INCR_GC

2023-01-20 Thread GitBox
LuciferYang commented on code in PR #39674: URL: https://github.com/apache/spark/pull/39674#discussion_r1082496069 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1005,26 +1005,6 @@ private[spark] class Client( val tmpDir = new

[GitHub] [spark] panbingkun opened a new pull request, #39675: [MINOR][DOCS] Update the doc of arrow & kubernetes

2023-01-20 Thread GitBox
panbingkun opened a new pull request, #39675: URL: https://github.com/apache/spark/pull/39675 ### What changes were proposed in this pull request? The pr aims to update the doc of arrow & kubernetes. ### Why are the changes needed?

[GitHub] [spark] LuciferYang opened a new pull request, #39674: [DON'T MERGE] Test remove SPARK_USE_CONC_INCR_GC

2023-01-20 Thread GitBox
LuciferYang opened a new pull request, #39674: URL: https://github.com/apache/spark/pull/39674 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[GitHub] [spark] dongjoon-hyun closed pull request #39663: [SPARK-42129][BUILD] Upgrade rocksdbjni to 7.9.2

2023-01-20 Thread GitBox
dongjoon-hyun closed pull request #39663: [SPARK-42129][BUILD] Upgrade rocksdbjni to 7.9.2 URL: https://github.com/apache/spark/pull/39663 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398319998 To @LuciferYang , I don't think this is a compatibility or any failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] dongjoon-hyun commented on pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39541: URL: https://github.com/apache/spark/pull/39541#issuecomment-1398317390 As @HyukjinKwon pointed out, this causes a failure for RC and official release. - https://github.com/apache/spark/pull/39668#issuecomment-1398314758 ![Screenshot

[GitHub] [spark] LuciferYang commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
LuciferYang commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398316727 Could you use 8u362 to run full UTs offline to check compatibility? Thanks ~ @wangyum -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] dongjoon-hyun commented on pull request #39668: [WIP] Test 3.4.0 tagging

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39668: URL: https://github.com/apache/spark/pull/39668#issuecomment-1398314758 It seems that we have only one failure. ![Screenshot 2023-01-20 at 4 28 07

[GitHub] [spark] LuciferYang commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362

2023-01-20 Thread GitBox
LuciferYang commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398314217 One problem is that GA is still using Temurin 8u352 for build and test. We need to wait for a while before running GA tasks using 8u362. -- This is an automated message

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #39369: [SPARK-41775][PYTHON][ML] Adding support for PyForch functions

2023-01-20 Thread GitBox
WeichenXu123 commented on code in PR #39369: URL: https://github.com/apache/spark/pull/39369#discussion_r1082450887 ## python/pyspark/ml/torch/tests/test_distributor.py: ## @@ -224,8 +293,10 @@ def setUp(self) -> None: self.sc =

[GitHub] [spark] wecharyu commented on pull request #39115: [SPARK-41563][SQL] Support partition filter in MSCK REPAIR TABLE statement

2023-01-20 Thread GitBox
wecharyu commented on PR #39115: URL: https://github.com/apache/spark/pull/39115#issuecomment-1398308497 > Can you tune the config spark.sql.addPartitionInBatch.size? Setting it to a larger number can reduce the number of RPCs. It does not help in `RepairTableCommand`, when enable

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #39369: [SPARK-41775][PYTHON][ML] Adding support for PyForch functions

2023-01-20 Thread GitBox
WeichenXu123 commented on code in PR #39369: URL: https://github.com/apache/spark/pull/39369#discussion_r1082443370 ## python/pyspark/ml/torch/distributor.py: ## @@ -495,32 +546,119 @@ def set_gpus(context: "BarrierTaskContext") -> None: def _run_distributed_training(

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #39299: [WIP][SPARK-41593][PYTHON][ML] Adding logging from executors

2023-01-20 Thread GitBox
WeichenXu123 commented on code in PR #39299: URL: https://github.com/apache/spark/pull/39299#discussion_r1082428873 ## python/pyspark/ml/torch/log_communication.py: ## @@ -0,0 +1,201 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor

[GitHub] [spark] LuciferYang commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
LuciferYang commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082425788 ## core/src/main/scala/org/apache/spark/status/protobuf/Utils.scala: ## @@ -17,10 +17,24 @@ package org.apache.spark.status.protobuf +import

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #39299: [WIP][SPARK-41593][PYTHON][ML] Adding logging from executors

2023-01-20 Thread GitBox
WeichenXu123 commented on code in PR #39299: URL: https://github.com/apache/spark/pull/39299#discussion_r1082420774 ## python/pyspark/ml/torch/log_communication.py: ## @@ -0,0 +1,201 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #39299: [WIP][SPARK-41593][PYTHON][ML] Adding logging from executors

2023-01-20 Thread GitBox
WeichenXu123 commented on code in PR #39299: URL: https://github.com/apache/spark/pull/39299#discussion_r1082420394 ## python/pyspark/ml/torch/log_communication.py: ## @@ -0,0 +1,201 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor

[GitHub] [spark] EnricoMi commented on pull request #39640: [SPARK-38591][SQL] Add flatMapSortedGroups and cogroupSorted

2023-01-20 Thread GitBox
EnricoMi commented on PR #39640: URL: https://github.com/apache/spark/pull/39640#issuecomment-1398282605 @cloud-fan following issue: `ds.groupByKey` adds key columns to the plan: ``` def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] = { val withGroupingKey

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #39299: [WIP][SPARK-41593][PYTHON][ML] Adding logging from executors

2023-01-20 Thread GitBox
WeichenXu123 commented on code in PR #39299: URL: https://github.com/apache/spark/pull/39299#discussion_r1082416985 ## python/pyspark/ml/torch/log_communication.py: ## @@ -0,0 +1,201 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #39299: [WIP][SPARK-41593][PYTHON][ML] Adding logging from executors

2023-01-20 Thread GitBox
WeichenXu123 commented on code in PR #39299: URL: https://github.com/apache/spark/pull/39299#discussion_r1082414835 ## python/pyspark/ml/torch/log_communication.py: ## @@ -0,0 +1,201 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor

[GitHub] [spark] wangyum commented on pull request #39671: [SPARK-40303][DOCS] Recommends users to use JDK 8u362 and later versions

2023-01-20 Thread GitBox
wangyum commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398267365 > Oh, does `Zulu` only have that released version, @wangyum ? > > * https://bugs.openjdk.org/browse/JDK-8296506 > > I cannot find docker image and Adoptium (Temurin) Java

[GitHub] [spark] kuwii commented on pull request #39190: [SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API

2023-01-20 Thread GitBox
kuwii commented on PR #39190: URL: https://github.com/apache/spark/pull/39190#issuecomment-1398263751 @srowen We found this issue in some of Spark applications. Here's the event log of an example, which can be loaded through history server:

[GitHub] [spark] EnricoMi commented on pull request #39673: [SPARK-42132][SQL] Deduplicate attributes in groupByKey.cogroup

2023-01-20 Thread GitBox
EnricoMi commented on PR #39673: URL: https://github.com/apache/spark/pull/39673#issuecomment-1398246138 Ideally, `QueryPlan.rewriteAttrs` would not replace occurrences `id#0L#` with `id#13L` in all fields of `CoGroup`, but only in `rightDeserializer`, `rightGroup`, `rightAttr`,

[GitHub] [spark] EnricoMi opened a new pull request, #39673: [SPARK-42132][SQL] Deduplicate attributes in groupByKey.cogroup

2023-01-20 Thread GitBox
EnricoMi opened a new pull request, #39673: URL: https://github.com/apache/spark/pull/39673 ### What changes were proposed in this pull request? This deduplicate attributes that exist on both sides of a `CoGroup` by aliasing the occurrence on the right side. ### Why are the

[GitHub] [spark] vicennial opened a new pull request, #39672: [SPARK-42133] Add basic Dataset API methods to Spark Connect Scala Client

2023-01-20 Thread GitBox
vicennial opened a new pull request, #39672: URL: https://github.com/apache/spark/pull/39672 ### What changes were proposed in this pull request? Adds the following methods: - Dataframe API methods - project - filter - limit - SparkSession - range (and

[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Recommends users to use JDK 8u362 and later versions

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398234989 Oh, is `Zulu` only have that released version, @wangyum ? - https://bugs.openjdk.org/browse/JDK-8296506 I cannot find docker image and Adoptium (Temurin) Java yet. -

[GitHub] [spark] dongjoon-hyun commented on pull request #38376: [SPARK-40817][K8S] `spark.files` should preserve remote files

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #38376: URL: https://github.com/apache/spark/pull/38376#issuecomment-1398224853 Perfect, @antonipp ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] wangyum opened a new pull request, #39671: [SPARK-40303][DOCS] Recommends users to use JDK 8u362 and later versions

2023-01-20 Thread GitBox
wangyum opened a new pull request, #39671: URL: https://github.com/apache/spark/pull/39671 ### What changes were proposed in this pull request? This PR update document recommends users to use JDK 8u362 and later versions. ### Why are the changes needed? 8u362 fixed a

[GitHub] [spark] antonipp commented on pull request #38376: [SPARK-40817][K8S] `spark.files` should preserve remote files

2023-01-20 Thread GitBox
antonipp commented on PR #38376: URL: https://github.com/apache/spark/pull/38376#issuecomment-1398209302 Thank you for the reviews and for the merge! I am not 100% sure what is the backport process but I opened 2 PRs (for 3.3 and 3.2) since I believe both are still supported based on

[GitHub] [spark] antonipp opened a new pull request, #39669: [SPARK-40817][K8S][3.3] `spark.files` should preserve remote files

2023-01-20 Thread GitBox
antonipp opened a new pull request, #39669: URL: https://github.com/apache/spark/pull/39669 ### What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/38376 to `branch-3.3` You can find a detailed description of the issue and an example

[GitHub] [spark] antonipp opened a new pull request, #39670: [SPARK-40817][K8S][3.2] `spark.files` should preserve remote files

2023-01-20 Thread GitBox
antonipp opened a new pull request, #39670: URL: https://github.com/apache/spark/pull/39670 ### What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/38376 to `branch-3.2` You can find a detailed description of the issue and an example

[GitHub] [spark] HyukjinKwon commented on pull request #39668: [WIP] Test 3.4.0 tagging

2023-01-20 Thread GitBox
HyukjinKwon commented on PR #39668: URL: https://github.com/apache/spark/pull/39668#issuecomment-1398189235 cc @xinrong-meng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082329957 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082329724 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala: ## @@ -78,7 +78,7 @@ class

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082329225 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082319733 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082327967 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082326873 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082326160 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] EnricoMi commented on pull request #39640: [SPARK-38591][SQL] Add flatMapSortedGroups and cogroupSorted

2023-01-20 Thread GitBox
EnricoMi commented on PR #39640: URL: https://github.com/apache/spark/pull/39640#issuecomment-1398179550 Thanks for your time! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] dongjoon-hyun commented on pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39541: URL: https://github.com/apache/spark/pull/39541#issuecomment-1398177490 BTW, while I was reviewing this PR, I felt the necessity to open an official PR to test any potential test cases on tagging. Here is the general PR to detect any `SNAPSHOT`

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082323765 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082319733 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082320316 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] dongjoon-hyun opened a new pull request, #39668: [WIP] Test 3.4.0 tagging

2023-01-20 Thread GitBox
dongjoon-hyun opened a new pull request, #39668: URL: https://github.com/apache/spark/pull/39668 This aims to test the possible test failures on Spark 3.4.0 RC tag. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-01-20 Thread GitBox
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082316930 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] dongjoon-hyun commented on pull request #39665: [SPARK-42114][SQL][TESTS] Add uniform parquet encryption test case

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39665: URL: https://github.com/apache/spark/pull/39665#issuecomment-1398167017 I fixed the `Affected Version` from 3.3.1 to 3.4.0 because this fails in `branch-3.3`. ``` [info] ParquetEncryptionSuite: [info] - SPARK-34990: Write and read an encrypted

[GitHub] [spark] dongjoon-hyun commented on pull request #39664: [SPARK-42114][SQL] Test of uniform parquet encryption

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39664: URL: https://github.com/apache/spark/pull/39664#issuecomment-1398159305 I merged the newer PR, @ggershinsky . :) - https://github.com/apache/spark/commit/e1c630a98c45ae07c43c8cf95979532b51bf59ec -- This is an automated message from the Apache Git

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082308565 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -22,7 +22,12 @@ package org.apache.spark.status.protobuf; * Developer

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082308565 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -22,7 +22,12 @@ package org.apache.spark.status.protobuf; * Developer

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
dongjoon-hyun commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082307793 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -22,7 +22,12 @@ package org.apache.spark.status.protobuf; * Developer

[GitHub] [spark] dongjoon-hyun commented on pull request #39665: [SPARK-42114][SQL][TESTS] Add uniform parquet encryption test case

2023-01-20 Thread GitBox
dongjoon-hyun commented on PR #39665: URL: https://github.com/apache/spark/pull/39665#issuecomment-1398155059 BTW, please add `ggershin...@apple.com` to your GitHub profile as the secondary email. ``` $ git log -n1 commit e1c630a98c45ae07c43c8cf95979532b51bf59ec (HEAD -> master,

[GitHub] [spark] dongjoon-hyun closed pull request #39665: [SPARK-42114][SQL][TESTS] Add uniform parquet encryption test case

2023-01-20 Thread GitBox
dongjoon-hyun closed pull request #39665: [SPARK-42114][SQL][TESTS] Add uniform parquet encryption test case URL: https://github.com/apache/spark/pull/39665 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] gengliangwang commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
gengliangwang commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082286371 ## core/src/main/scala/org/apache/spark/status/protobuf/Utils.scala: ## @@ -17,10 +17,24 @@ package org.apache.spark.status.protobuf +import

[GitHub] [spark] gengliangwang commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
gengliangwang commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082251439 ## core/src/main/scala/org/apache/spark/status/protobuf/Utils.scala: ## @@ -17,10 +17,24 @@ package org.apache.spark.status.protobuf +import

[GitHub] [spark] zhengruifeng commented on pull request #39661: [SPARK-41884][CONNECT] Support naive tuple as a nested row

2023-01-20 Thread GitBox
zhengruifeng commented on PR #39661: URL: https://github.com/apache/spark/pull/39661#issuecomment-1398122520 LGTM, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] beliefer commented on a diff in pull request #39660: [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown

2023-01-20 Thread GitBox
beliefer commented on code in PR #39660: URL: https://github.com/apache/spark/pull/39660#discussion_r1082270560 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala: ## @@ -544,6 +544,14 @@ abstract class JdbcDialect extends Serializable with Logging {

[GitHub] [spark] beliefer opened a new pull request, #39667: [SPARK-42131][SQL] Extract the function that construct the select statement for JDBC dialect.

2023-01-20 Thread GitBox
beliefer opened a new pull request, #39667: URL: https://github.com/apache/spark/pull/39667 ### What changes were proposed in this pull request? Currently, JDBCRDD uses fixed format for SELECT statement. ``` val sqlText = options.prepareQuery + s"SELECT $columnList FROM

[GitHub] [spark] sadikovi commented on pull request #39660: [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown

2023-01-20 Thread GitBox
sadikovi commented on PR #39660: URL: https://github.com/apache/spark/pull/39660#issuecomment-1398098578 Thanks @dongjoon-hyun. I will address your comments soon-ish . @beliefer, Yes, you are right. The documentation describes TOP (N) returning the N top rows when used together with

[GitHub] [spark] gengliangwang commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
gengliangwang commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082251439 ## core/src/main/scala/org/apache/spark/status/protobuf/Utils.scala: ## @@ -17,10 +17,24 @@ package org.apache.spark.status.protobuf +import

[GitHub] [spark] LuciferYang commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary

2023-01-20 Thread GitBox
LuciferYang commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082248889 ## core/src/main/scala/org/apache/spark/status/protobuf/Utils.scala: ## @@ -17,10 +17,24 @@ package org.apache.spark.status.protobuf +import

[GitHub] [spark] EnricoMi closed pull request #37551: [SPARK-38591][SQL] Add sortWithinGroups to KeyValueGroupedDataset

2023-01-20 Thread GitBox
EnricoMi closed pull request #37551: [SPARK-38591][SQL] Add sortWithinGroups to KeyValueGroupedDataset URL: https://github.com/apache/spark/pull/37551 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] EnricoMi commented on pull request #37551: [SPARK-38591][SQL] Add sortWithinGroups to KeyValueGroupedDataset

2023-01-20 Thread GitBox
EnricoMi commented on PR #37551: URL: https://github.com/apache/spark/pull/37551#issuecomment-1398081395 Closing as #39640 has been merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] LuciferYang commented on pull request #39642: [SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for `StreamingQueryProgressWrapper`

2023-01-20 Thread GitBox
LuciferYang commented on PR #39642: URL: https://github.com/apache/spark/pull/39642#issuecomment-1398075391 Will refactor after https://github.com/apache/spark/pull/39666 merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on

  1   2   3   4   5   6   7   8   9   10   >