[GitHub] [spark] LuciferYang commented on a diff in pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
LuciferYang commented on code in PR #39682: URL: https://github.com/apache/spark/pull/39682#discussion_r1083259379 ## sql/core/src/test/scala/org/apache/spark/status/protobuf/sql/KVStoreProtobufSerializerSuite.scala: ## @@ -48,6 +48,43 @@ class KVStoreProtobufSerializerSuite extends SparkFunSuite { assert(result.metricValues == input.metricValues) } + test("SQLExecutionUIData with null string fields") { Review Comment: [5bb20d9](https://github.com/apache/spark/pull/39682/commits/5bb20d90ce0e17a999423be3d6577b7835f1aaff) move this to previous test case with a new loop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39686: [SPARK-42143][UI] Handle null string values in RDDStorageInfo/RDDDataDistribution/RDDPartitionInfo
gengliangwang commented on PR #39686: URL: https://github.com/apache/spark/pull/39686#issuecomment-1399199547 cc @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #39686: [SPARK-42143][UI] Handle null string values in RDDStorageInfo/RDDDataDistribution/RDDPartitionInfo
gengliangwang opened a new pull request, #39686: URL: https://github.com/apache/spark/pull/39686 ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/39666, this PR handles null string values in RDDStorageInfo/RDDDataDistribution/RDDPartitionInfo ### Why are the changes needed? Properly handles null string values in the protobuf serializer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39637: [SPARK-41777][PYSPARK][ML] Integration testing for TorchDistributor
HyukjinKwon commented on code in PR #39637: URL: https://github.com/apache/spark/pull/39637#discussion_r1083256527 ## dev/requirements.txt: ## @@ -59,3 +59,7 @@ googleapis-common-protos==1.56.4 mypy-protobuf==3.3.0 googleapis-common-protos-stubs==2.2.0 grpc-stubs==1.24.11 + +# TorchDistributor dependencies +torch==1.13.1 +torchvision==0.14.1 Review Comment: You actually don't need to pin this version. This is only for developement enviornment, and I would prefer to let other people test it with the latest versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39637: [SPARK-41777][PYSPARK][ML] Integration testing for TorchDistributor
HyukjinKwon closed pull request #39637: [SPARK-41777][PYSPARK][ML] Integration testing for TorchDistributor URL: https://github.com/apache/spark/pull/39637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39637: [SPARK-41777][PYSPARK][ML] Integration testing for TorchDistributor
HyukjinKwon commented on PR #39637: URL: https://github.com/apache/spark/pull/39637#issuecomment-1399198148 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39299: [SPARK-41593][PYTHON][ML] Adding logging from executors
HyukjinKwon closed pull request #39299: [SPARK-41593][PYTHON][ML] Adding logging from executors URL: https://github.com/apache/spark/pull/39299 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39299: [SPARK-41593][PYTHON][ML] Adding logging from executors
HyukjinKwon commented on PR #39299: URL: https://github.com/apache/spark/pull/39299#issuecomment-1399197595 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
gengliangwang commented on code in PR #39682: URL: https://github.com/apache/spark/pull/39682#discussion_r1083255761 ## sql/core/src/test/scala/org/apache/spark/status/protobuf/sql/KVStoreProtobufSerializerSuite.scala: ## @@ -48,6 +48,43 @@ class KVStoreProtobufSerializerSuite extends SparkFunSuite { assert(result.metricValues == input.metricValues) } + test("SQLExecutionUIData with null string fields") { Review Comment: why don't we just update the previous test case with a new loop for null values. Or, we can abstract the code for checking whether two SQLExecutionUIData are the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
gengliangwang commented on code in PR #39682: URL: https://github.com/apache/spark/pull/39682#discussion_r1083255631 ## sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLPlanMetricSerializer.scala: ## @@ -19,18 +19,24 @@ package org.apache.spark.status.protobuf.sql import org.apache.spark.sql.execution.ui.SQLPlanMetric import org.apache.spark.status.protobuf.StoreTypes +import org.apache.spark.status.protobuf.Utils._ +import org.apache.spark.util.Utils.weakIntern object SQLPlanMetricSerializer { def serialize(metric: SQLPlanMetric): StoreTypes.SQLPlanMetric = { -StoreTypes.SQLPlanMetric.newBuilder() - .setName(metric.name) - .setAccumulatorId(metric.accumulatorId) - .setMetricType(metric.metricType) - .build() +val builder = StoreTypes.SQLPlanMetric.newBuilder() +setStringField(metric.name, builder.setName) +builder.setAccumulatorId(metric.accumulatorId) +setStringField(metric.metricType, builder.setMetricType) +builder.build() } def deserialize(metrics: StoreTypes.SQLPlanMetric): SQLPlanMetric = { -SQLPlanMetric(metrics.getName, metrics.getAccumulatorId, metrics.getMetricType) +SQLPlanMetric( + name = getStringField(metrics.hasName, () => weakIntern(metrics.getName)), Review Comment: ok you are adding weakIntern here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39628: [SPARK-40264][ML][DOCS] Supplement docstring in pyspark.ml.functions.predict_batch_udf
HyukjinKwon closed pull request #39628: [SPARK-40264][ML][DOCS] Supplement docstring in pyspark.ml.functions.predict_batch_udf URL: https://github.com/apache/spark/pull/39628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39628: [SPARK-40264][ML][DOCS] Supplement docstring in pyspark.ml.functions.predict_batch_udf
HyukjinKwon commented on PR #39628: URL: https://github.com/apache/spark/pull/39628#issuecomment-1399197002 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39683: [SPARK-42144][CORE][SQL] Handle null string values in StageDataWrapper/StreamBlockData/StreamingQueryData
gengliangwang commented on code in PR #39683: URL: https://github.com/apache/spark/pull/39683#discussion_r1083255313 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -495,9 +495,10 @@ message RDDOperationGraphWrapper { } message StreamingQueryData { - string name = 1; + optional string name = 1; + // convert from uuid, id should not be null Review Comment: let's simply convert all the string fields. We can even have tests to check if there is any string which is not optional. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39685: [SPARK-42142][UI] Handle null string values in CachedQuantile/ExecutorSummary/PoolData
gengliangwang commented on PR #39685: URL: https://github.com/apache/spark/pull/39685#issuecomment-1399196312 cc @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #39685: [SPARK-42142][UI] Handle null string values in CachedQuantile/ExecutorSummary/PoolData
gengliangwang opened a new pull request, #39685: URL: https://github.com/apache/spark/pull/39685 ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/39666, this PR handles null string values in CachedQuantile/ExecutorSummary/PoolData ### Why are the changes needed? Properly handles null string values in the protobuf serializer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39684: [SPARK-42140][CORE] Handle null string values in ApplicationEnvironmentInfoWrapper/ApplicationInfoWrapper
LuciferYang commented on PR #39684: URL: https://github.com/apache/spark/pull/39684#issuecomment-1399196254 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #39684: [SPARK-42140][CORE] Handle null string values in ApplicationEnvironmentInfoWrapper/ApplicationInfoWrapper
LuciferYang opened a new pull request, #39684: URL: https://github.com/apache/spark/pull/39684 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39677: [SPARK-42043][CONNECT][TEST][FOLLOWUP] Fix jar finding bug and use better env vars and time measurement
HyukjinKwon closed pull request #39677: [SPARK-42043][CONNECT][TEST][FOLLOWUP] Fix jar finding bug and use better env vars and time measurement URL: https://github.com/apache/spark/pull/39677 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39677: [SPARK-42043][CONNECT][TEST][FOLLOWUP] Fix jar finding bug and use better env vars and time measurement
HyukjinKwon commented on PR #39677: URL: https://github.com/apache/spark/pull/39677#issuecomment-1399194601 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
HyukjinKwon closed pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions URL: https://github.com/apache/spark/pull/39550 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
HyukjinKwon commented on PR #39550: URL: https://github.com/apache/spark/pull/39550#issuecomment-1399194512 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
mridulm commented on PR #39682: URL: https://github.com/apache/spark/pull/39682#issuecomment-1399193866 Thanks for clarifying - yeah, you have to use has/get and either Optional(value).isPresent(set) or null check for set -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
gengliangwang closed pull request #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper URL: https://github.com/apache/spark/pull/39680 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
gengliangwang commented on PR #39680: URL: https://github.com/apache/spark/pull/39680#issuecomment-1399192216 @dongjoon-hyun @LuciferYang Thanks for the review. Merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
gengliangwang commented on code in PR #39680: URL: https://github.com/apache/spark/pull/39680#discussion_r1083252155 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -83,10 +83,10 @@ message TaskDataWrapper { int64 launch_time = 5; int64 result_fetch_start = 6; int64 duration = 7; - string executor_id = 8; - string host = 9; - string status = 10; - string task_locality = 11; + optional string executor_id = 8; Review Comment: Yeah that's complicated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #39678: [SPARK-16484][SQL] Add HyperLogLogPlusPlus sketch generator/evaluator/aggregator
huaxingao commented on PR #39678: URL: https://github.com/apache/spark/pull/39678#issuecomment-1399184316 @RyanBerti Thanks for the great work! +1 for using Apache DataSketches library. I am wondering if we can use [Theta Sketches](https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html) because it supports union, intersection and difference, while HLL sketches only supports union. Trino recently has added the support to use Apache DataSketches's Theta sketch to calculate distinct value, and Apache Iceberg defines Apache DataSketches's Theta as the common data sketch for keeping track of distinct values in a table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
LuciferYang commented on code in PR #39680: URL: https://github.com/apache/spark/pull/39680#discussion_r1083244804 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -83,10 +83,10 @@ message TaskDataWrapper { int64 launch_time = 5; int64 result_fetch_start = 6; int64 duration = 7; - string executor_id = 8; - string host = 9; - string status = 10; - string task_locality = 11; + optional string executor_id = 8; Review Comment: Have we considered defining a message as follows? ``` message NullableString { bool is_null = 1; string value = 2; } ``` In this way, we can distinguish the string that are originally optional, but we need to record one more state -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39683: [SPARK-42144][CORE][SQL] Handle null string values in StageDataWrapper/StreamBlockData/StreamingQueryData
LuciferYang commented on PR #39683: URL: https://github.com/apache/spark/pull/39683#issuecomment-1399179648 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #39683: [SPARK-42144][CORE][SQL] Handle null string values in StageDataWrapper/StreamBlockData/StreamingQueryData
LuciferYang opened a new pull request, #39683: URL: https://github.com/apache/spark/pull/39683 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
LuciferYang commented on PR #39682: URL: https://github.com/apache/spark/pull/39682#issuecomment-1399169469 And there is another pr do similar work https://github.com/apache/spark/pull/39682 @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
LuciferYang commented on PR #39682: URL: https://github.com/apache/spark/pull/39682#issuecomment-1399169247 For String type, `SomeProtoBufBuilder. setString (null)` will throw NPE, so in https://github.com/apache/spark/pull/39666 @gengliangwang propose: - checking nullability of all string fields to avoid NPE - using optional string for the protobuf definition of all string fields. If the deserialized result is None, then set the string field as null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
mridulm commented on PR #39682: URL: https://github.com/apache/spark/pull/39682#issuecomment-1399168603 I missed the discussion in #39666 ... can you tell me why this is an issue only for protobuf @LuciferYang ? If the field is null, it will be null for the regular case as well, right ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39654: [MINOR][SHUFFLE] Include IOException in warning log of finalizeShuffleMerge
mridulm commented on PR #39654: URL: https://github.com/apache/spark/pull/39654#issuecomment-1399168240 To clarify @tedyu, I am fine with including the exception message (ex.getMessage) in the warn message - not the entire stack trace - so there is value in the PR ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm closed pull request #39645: [SPARK-41415][3.2] SASL Request Retries
mridulm closed pull request #39645: [SPARK-41415][3.2] SASL Request Retries URL: https://github.com/apache/spark/pull/39645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39645: [SPARK-41415][3.2] SASL Request Retries
mridulm commented on PR #39645: URL: https://github.com/apache/spark/pull/39645#issuecomment-1399167106 Merged to 3.2 Thanks for fixing this @akpatnam25 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SparkPlanGraphWrapper/SQLPlanMetric
LuciferYang commented on PR #39682: URL: https://github.com/apache/spark/pull/39682#issuecomment-1399167087 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39644: [SPARK-41415][3.3] SASL Request Retries
mridulm commented on PR #39644: URL: https://github.com/apache/spark/pull/39644#issuecomment-1399166933 Closing, as PR has been merged. Thanks for backporting this @akpatnam25 ! And thanks for all the help @dongjoon-hyun :-) (you are literally responding in seconds !) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm closed pull request #39644: [SPARK-41415][3.3] SASL Request Retries
mridulm closed pull request #39644: [SPARK-41415][3.3] SASL Request Retries URL: https://github.com/apache/spark/pull/39644 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #39645: [SPARK-41415][3.2] SASL Request Retries
mridulm commented on code in PR #39645: URL: https://github.com/apache/spark/pull/39645#discussion_r1083234871 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java: ## @@ -187,13 +196,29 @@ private synchronized void initiateRetry() { /** * Returns true if we should retry due a block transfer failure. We will retry if and only if - * the exception was an IOException and we haven't retried 'maxRetries' times already. + * the exception was an IOException or SaslTimeoutException and we haven't retried + * 'maxRetries' times already. */ private synchronized boolean shouldRetry(Throwable e) { boolean isIOException = e instanceof IOException - || (e.getCause() != null && e.getCause() instanceof IOException); + || e.getCause() instanceof IOException; Review Comment: This looks like the main change from 3.3/master related to this diff. This is fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39644: [SPARK-41415][3.3] SASL Request Retries
dongjoon-hyun commented on PR #39644: URL: https://github.com/apache/spark/pull/39644#issuecomment-1399166157 Yes, only `master` branch is closed automatically. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39644: [SPARK-41415][3.3] SASL Request Retries
mridulm commented on PR #39644: URL: https://github.com/apache/spark/pull/39644#issuecomment-1399166043 Merged to branch-3.3 This does not get automatically closed @dongjoon-hyun ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39644: [SPARK-41415][3.3] SASL Request Retries
dongjoon-hyun commented on PR #39644: URL: https://github.com/apache/spark/pull/39644#issuecomment-1399165752 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #39644: [SPARK-41415][3.3] SASL Request Retries
mridulm commented on PR #39644: URL: https://github.com/apache/spark/pull/39644#issuecomment-1399165650 Thanks for the ping @dongjoon-hyun :) I will merge this to 3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39638: [SPARK-42082][SPARK-41598][PYTHON][CONNECT] Introduce `PySparkValueError` and `PySparkTypeError`
HyukjinKwon commented on code in PR #39638: URL: https://github.com/apache/spark/pull/39638#discussion_r1083233072 ## python/pyspark/sql/tests/test_functions.py: ## @@ -763,25 +798,55 @@ def test_higher_order_function_failures(self): from pyspark.sql.functions import col, transform # Should fail with varargs -with self.assertRaises(ValueError): +with self.assertRaises(PySparkException) as pe: Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #39682: [SPARK-42139][CORE][SQL] Handle null string values in SQLExecutionUIData/SQLPlanMetric
LuciferYang opened a new pull request, #39682: URL: https://github.com/apache/spark/pull/39682 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #39676: [SPARK-42134][SQL] Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes
huaxingao commented on PR #39676: URL: https://github.com/apache/spark/pull/39676#issuecomment-1399156214 Merged to 3.3/master. Thanks for fix this @peter-toth -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao closed pull request #39676: [SPARK-42134][SQL] Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes
huaxingao closed pull request #39676: [SPARK-42134][SQL] Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes URL: https://github.com/apache/spark/pull/39676 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] joveyuan-db opened a new pull request, #39681: [SPARK-18011] Fix SparkR NA date serialization
joveyuan-db opened a new pull request, #39681: URL: https://github.com/apache/spark/pull/39681 ### What changes were proposed in this pull request? This PR ensures that SparkR serializes `NA` dates as `"NA"` (string) to avoid an undefined length when deserializing in the JVM. ### Why are the changes needed? Currently, SparkR assumes that a `NegativeArraySizeException` when deserializing dates and timestamps represents an `NA`. However, this handling can be made more robust by ensuring that serialization on the R side always provides a valid string length (note that `nchar(as.character(as.Date(NA)))` is `NA`). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #39638: [SPARK-42082][SPARK-41598][PYTHON][CONNECT] Introduce `PySparkValueError` and `PySparkTypeError`
zhengruifeng commented on code in PR #39638: URL: https://github.com/apache/spark/pull/39638#discussion_r1083224562 ## python/pyspark/sql/tests/test_functions.py: ## @@ -763,25 +798,55 @@ def test_higher_order_function_failures(self): from pyspark.sql.functions import col, transform # Should fail with varargs -with self.assertRaises(ValueError): +with self.assertRaises(PySparkException) as pe: Review Comment: should we use subclass PySparkValueError/PySparkTypeError in those tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
LuciferYang commented on PR #39680: URL: https://github.com/apache/spark/pull/39680#issuecomment-1399145487 @gengliangwang @dongjoon-hyun I add some new tasks and will start work later, @panbingkun @techaddict feel free to take anyone from the them and leave a comment that you are working on it: - SPARK-42139: Handle null string values in SQLExecutionUIData/SQLPlanMetric - SPARK-42140: Handle null string values in ApplicationEnvironmentInfo/RuntimeInfo/PairStrings/ExecutorResourceRequest/TaskResourceRequest - SPARK-42141: Handle null string values in ApplicationInfo/ApplicationAttemptInfo - SPARK-42142: Handle null string values in CachedQuantile/ExecutorSummary/PoolData - SPARK-42143: Handle null string values in RDDStorageInfo/RDDDataDistribution/RDDPartitionInfo - SPARK-42144: Handle null string values in StageData/StreamBlockData/StreamingQueryData - SPARK-42145: Handle null string values in SparkPlanGraphNode/SparkPlanGraphClusterWrapper Some may be further consolidated into one task And I re-open [SPARK-41677](https://issues.apache.org/jira/browse/SPARK-41677) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39622: [SPARK-42099][SPARK-41845][CONNECT][PYTHON] Fix `count(*)` and `count(col(*))`
zhengruifeng commented on PR #39622: URL: https://github.com/apache/spark/pull/39622#issuecomment-1399144510 merged into master, thank you @cloud-fan @HyukjinKwon @dongjoon-hyun ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #39622: [SPARK-42099][SPARK-41845][CONNECT][PYTHON] Fix `count(*)` and `count(col(*))`
zhengruifeng closed pull request #39622: [SPARK-42099][SPARK-41845][CONNECT][PYTHON] Fix `count(*)` and `count(col(*))` URL: https://github.com/apache/spark/pull/39622 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39679: [SPARK-42137][CORE] Enable `spark.kryo.unsafe` by default
dongjoon-hyun closed pull request #39679: [SPARK-42137][CORE] Enable `spark.kryo.unsafe` by default URL: https://github.com/apache/spark/pull/39679 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39679: [SPARK-42137][CORE] Enable `spark.kryo.unsafe` by default
dongjoon-hyun commented on PR #39679: URL: https://github.com/apache/spark/pull/39679#issuecomment-1399134868 Merged to master for Apache Spark 3.4.0. The one pyspark pipeline seems to slow, but it's irrelevant to this PR and verified here before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39679: [SPARK-42137][CORE] Enable `spark.kryo.unsafe` by default
dongjoon-hyun commented on PR #39679: URL: https://github.com/apache/spark/pull/39679#issuecomment-1399118708 Thank you so much, @mridulm ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tedyu closed pull request #39654: [MINOR][SHUFFLE] Include IOException in warning log of finalizeShuffleMerge
tedyu closed pull request #39654: [MINOR][SHUFFLE] Include IOException in warning log of finalizeShuffleMerge URL: https://github.com/apache/spark/pull/39654 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #39654: [MINOR][SHUFFLE] Include IOException in warning log of finalizeShuffleMerge
mridulm commented on code in PR #39654: URL: https://github.com/apache/spark/pull/39654#discussion_r1083189284 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -815,7 +815,7 @@ public MergeStatuses finalizeShuffleMerge(FinalizeShuffleMerge msg) { } catch (IOException ioe) { logger.warn("{} attempt {} shuffle {} shuffleMerge {}: exception while " + "finalizing shuffle partition {}", msg.appId, msg.appAttemptId, msg.shuffleId, -msg.shuffleMergeId, partition.reduceId); +msg.shuffleMergeId, partition.reduceId, ioe); Review Comment: Including stack trace in this warn message, is not helpful. It is not actionable by users -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #39678: [SPARK-16484][SQL] Add HyperLogLogPlusPlus sketch generator/evaluator/aggregator
dtenedor commented on PR #39678: URL: https://github.com/apache/spark/pull/39678#issuecomment-1399101487 Hi @RyanBerti, after a few initial conversations about this proposal, we wanted to express some questions and opinions here for your consideration. In general, we wholeheartedly appreciate your time, energy, and attention in working on this collection of features in the interest of expanding the functionality and usefulness of Apache Spark (as well as increasing consistency with other engines). * We feel strongly that the new function names should not begin with "approx_count_distinct" in order to differentiate the new sketch-based approach against the existing approximate distinct count implementation. In particular, the sketch-based approach exposes intermediate buffers suitable for saving into storage and then retrieving and merging later, whereas the existing "approx_count_distinct" aggregate function consumes input values and then returns the result immediately (without exposing any intermediate sketch to the user). This is essentially a function naming detail, and we can consider a prefix like "HLL_" for the new functions in order to accomplish this objective. This change should not consume too much time to bring into effect. * After examining the existing helper library in the existing [HyperLogLogPlusPLusHelper utility](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala), and comparing to other formats for generating and merging sketches for similar purposes, we would like to express a strong preference for using the Apache DataSketches library instead (https://datasketches.apache.org/). This is a dedicated Apache project with more than a decade of investment behind it, supporting very high performance for multiple types of sketches (such as KLL sketches for computing approximate quantiles, for example) and with client library support in C++, Java, and Python. We can use the Java implementation to integrate with Apache Spark using Scala/Java integration. This also aligns with the principles of Apache Spark as an open source project Best, Daniel @czarifis @mkaravel @gatorsmile @entong -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39675: [MINOR][DOCS] Remove Python 3.9 and Apache Arrow warning comment
dongjoon-hyun closed pull request #39675: [MINOR][DOCS] Remove Python 3.9 and Apache Arrow warning comment URL: https://github.com/apache/spark/pull/39675 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #39675: [MINOR][DOCS] Update the doc of arrow & kubernetes
panbingkun commented on PR #39675: URL: https://github.com/apache/spark/pull/39675#issuecomment-1399080548 > it looks like controversial because K8s v1.25 is not available yet and the proposed one is still beta until v1.24. Ok, let me narrow down to Arrow first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
gengliangwang commented on PR #39680: URL: https://github.com/apache/spark/pull/39680#issuecomment-1399073949 cc @LuciferYang @panbingkun @techaddict -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #39680: [SPARK-42138][UI] Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
gengliangwang opened a new pull request, #39680: URL: https://github.com/apache/spark/pull/39680 ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/39666, this PR handles null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper ### Why are the changes needed? Properly handles null string values in the protobuf serializer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on a diff in pull request #39585: [SPARK-42124][PYTHON][CONNECT] Scalar Inline Python UDF in Spark Connect
xinrong-meng commented on code in PR #39585: URL: https://github.com/apache/spark/pull/39585#discussion_r1083112645 ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -217,6 +218,28 @@ message Expression { bool is_user_defined_function = 4; } + message ScalarInlineUserDefinedFunction { Review Comment: Sounds good. How about the new ScalarInlineUserDefinedFunction as shown below? We will serialize the existing PythonUDF message to `language_sepecific` field then. Please let me know if that's what we want. @hvanhovell @grundprinzip ``` message ScalarInlineUserDefinedFunction { string function_name = 1; bool deterministic = 2; repeated Expression arguments = 3; FunctionLanguage language = 4; bytes language_sepecific = 5; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on a diff in pull request #39585: [SPARK-42124][PYTHON][CONNECT] Scalar Inline Python UDF in Spark Connect
xinrong-meng commented on code in PR #39585: URL: https://github.com/apache/spark/pull/39585#discussion_r1083112645 ## connector/connect/common/src/main/protobuf/spark/connect/expressions.proto: ## @@ -217,6 +218,28 @@ message Expression { bool is_user_defined_function = 4; } + message ScalarInlineUserDefinedFunction { Review Comment: Sounds good. How about the new ScalarInlineUserDefinedFunction as shown below? We will serialize the existing PythonUDF message to `language_sepecific` field then. Please let me know if that's what we want. @hvanhovell @grundprinzip ``` message ScalarInlineUserDefinedFunction { string function_name = 1; bool deterministic = 2; repeated Expression arguments = 3; FunctionLanguage language = 4; bytes language_sepecific = 5 } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38376: [SPARK-40817][K8S] `spark.files` should preserve remote files
dongjoon-hyun commented on PR #38376: URL: https://github.com/apache/spark/pull/38376#issuecomment-1399039783 All PRs (master/3.3/3.2) are merged. I added to you the Apache Spark contributor group and assign SPARK-40817 to you. Welcome to the Apache Spark community, @antonipp ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39670: [SPARK-40817][K8S][3.2] `spark.files` should preserve remote files
dongjoon-hyun closed pull request #39670: [SPARK-40817][K8S][3.2] `spark.files` should preserve remote files URL: https://github.com/apache/spark/pull/39670 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39669: [SPARK-40817][K8S][3.3] `spark.files` should preserve remote files
dongjoon-hyun closed pull request #39669: [SPARK-40817][K8S][3.3] `spark.files` should preserve remote files URL: https://github.com/apache/spark/pull/39669 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #39679: [SPARK-42137][CORE] Enable `spark.kryo.unsafe` by default
dongjoon-hyun opened a new pull request, #39679: URL: https://github.com/apache/spark/pull/39679 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
rangadi commented on PR #39550: URL: https://github.com/apache/spark/pull/39550#issuecomment-1399007794 @dongjoon-hyun @HyukjinKwon could you approve this and merge? Mainly documentation change and passes 'options' from Python to scala. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
rangadi commented on code in PR #39550: URL: https://github.com/apache/spark/pull/39550#discussion_r1083089318 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala: ## @@ -26,7 +26,8 @@ import org.apache.spark.sql.types.{BinaryType, DataType} private[protobuf] case class CatalystDataToProtobuf( child: Expression, messageName: String, -descFilePath: Option[String] = None) +descFilePath: Option[String] = None, +options: Map[String, String] = Map.empty) Review Comment: Agree. Options like 'JSON format' will be in open source (not working on it yet). I think only schema-registry support is the only one missing. It is a bit tricky to get it in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
SandishKumarHN commented on code in PR #39550: URL: https://github.com/apache/spark/pull/39550#discussion_r1083087017 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala: ## @@ -26,7 +26,8 @@ import org.apache.spark.sql.types.{BinaryType, DataType} private[protobuf] case class CatalystDataToProtobuf( child: Expression, messageName: String, -descFilePath: Option[String] = None) +descFilePath: Option[String] = None, +options: Map[String, String] = Map.empty) Review Comment: These features would be useful to have in spark-protobuf. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary
gengliangwang closed pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary URL: https://github.com/apache/spark/pull/39666 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary
gengliangwang commented on PR #39666: URL: https://github.com/apache/spark/pull/39666#issuecomment-1398990949 @LuciferYang @dongjoon-hyun Thanks for the review. Merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
rangadi commented on code in PR #39550: URL: https://github.com/apache/spark/pull/39550#discussion_r1083073269 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala: ## @@ -26,7 +26,8 @@ import org.apache.spark.sql.types.{BinaryType, DataType} private[protobuf] case class CatalystDataToProtobuf( child: Expression, messageName: String, -descFilePath: Option[String] = None) +descFilePath: Option[String] = None, +options: Map[String, String] = Map.empty) Review Comment: We will likely add more options in the future. It is currently used in Databricks distribution for couple of things: - Enable support for sharded java classes (Databricks disables it by default due to complexity of shading) - Schema registry related options. I can imagine future options like supporting json serialization (rather than binary bytes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
rangadi commented on code in PR #39550: URL: https://github.com/apache/spark/pull/39550#discussion_r1083073269 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala: ## @@ -26,7 +26,8 @@ import org.apache.spark.sql.types.{BinaryType, DataType} private[protobuf] case class CatalystDataToProtobuf( child: Expression, messageName: String, -descFilePath: Option[String] = None) +descFilePath: Option[String] = None, +options: Map[String, String] = Map.empty) Review Comment: We will likely add more options in the future. It is currently used in Databricks distribution for couple of things: - Enable support for sharded java classes (Databricks disables it by default due to complexity of shading) - Schema registry related options. I can imagine future options like supporting json serialization (rather than binary bytes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
SandishKumarHN commented on code in PR #39550: URL: https://github.com/apache/spark/pull/39550#discussion_r1083066809 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala: ## @@ -26,7 +26,8 @@ import org.apache.spark.sql.types.{BinaryType, DataType} private[protobuf] case class CatalystDataToProtobuf( child: Expression, messageName: String, -descFilePath: Option[String] = None) +descFilePath: Option[String] = None, +options: Map[String, String] = Map.empty) Review Comment: @rangadi I don't see options being used anywhere in the process of converting CatalystData to Protobuf. do we need it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #39657: [SPARK-42123][SQL] Include column default values in DESCRIBE output
dtenedor commented on code in PR #39657: URL: https://github.com/apache/spark/pull/39657#discussion_r1083065750 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala: ## @@ -645,6 +645,16 @@ case class DescribeTableCommand( } else if (isExtended) { describeFormattedTableInfo(metadata, result) } + + // If any columns have default values, append them to the result. + if (metadata.schema.fields.exists(_.metadata.contains( +ResolveDefaultColumns.CURRENT_DEFAULT_COLUMN_METADATA_KEY))) { +append(result, "", "", "") +append(result, "# Column Default Information", "", "") +metadata.schema.foreach { column => + column.getCurrentDefaultValue().map(append(result, column.name, "", _)) Review Comment: good suggestion, this is done! ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala: ## @@ -225,4 +246,30 @@ class DescribeTableSuite extends DescribeTableSuiteBase with CommandSuiteBase { Row("Partition Provider", "Catalog", ""))) } } + + test("DESCRIBE TABLE EXTENDED of a table with a default column value") { Review Comment: yes, I have added this testing as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary
dongjoon-hyun commented on PR #39666: URL: https://github.com/apache/spark/pull/39666#issuecomment-1398943624 Oh, sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers
srielau commented on code in PR #38419: URL: https://github.com/apache/spark/pull/38419#discussion_r1083050679 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala: ## @@ -937,4 +937,135 @@ class MathExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(WidthBucket(5.35, 0.024, Double.NegativeInfinity, 5L), null) checkEvaluation(WidthBucket(5.35, 0.024, Double.PositiveInfinity, 5L), null) } + +test("truncate number") { +def evaluateForEachItem(from: Int, to: Int, increment: Int)(f: (Int, Int) => Unit): Unit = { + Range(from, to, increment).zipWithIndex.foreach { case (scale, i) => +f(scale, i) + } +} + +// Decimal +val decimalInput = BigDecimal("123456789123456789.123456789") +val truncDecimalResults = Seq( + "123456789123456789.123456789", + "123456789123456789.123456789", + "123456789123456789.12345678", + "123456789123456789.1234567", + "123456789123456789.123456", + "123456789123456789.12345", + "123456789123456789.1234", + "123456789123456789.123", + "123456789123456789.12", + "123456789123456789.1", + "123456789123456789", + "123456789123456780", + "123456789123456700", + "123456789123456000", + "12345678912345", + "12345678912340", + "12345678912300", + "12345678912000", + "1234567891", + "1234567890", + "1234567800", + "1234567000", + "123456", + "123450", + "123400", + "123000", + "12", + "10", + "0", + "0") + +evaluateForEachItem(10, -20, -1)((scale, index) => { + checkEvaluation(TruncNumber(decimalInput, scale), BigDecimal(truncDecimalResults(index))) + checkEvaluation(TruncNumber(-decimalInput, scale), -BigDecimal(truncDecimalResults(index))) +}) +// Double +val negativeDouble = -12345678.1234 +val neg_DoubleResults: Seq[Double] = + Seq(-12345678.0, -12345670.0, -12345600.0, -12345000.0, -1234.0, -1230.0, +-1200.0, -1000.0, 0.0, 0.0) +val positiveDouble = 12345678.1234 +val pos_DoubleResults: Seq[Double] = + Seq(12345678.0, 12345670.0, 12345600.0, 12345000.0, 1234.0, 1230.0, 1200.0, +1000.0, 0.0, 0.0) + +evaluateForEachItem(0, -10, -1)((scale, index) => { + checkEvaluation(TruncNumber(positiveDouble, scale), pos_DoubleResults(index)) + checkEvaluation(TruncNumber(negativeDouble, scale), neg_DoubleResults(index)) +}) + +// Float +val negativeFloat = -12345678.123f +val neg_floatResults: Seq[Float] = + Seq(-12345678f, -12345670f, -12345600f, -12345000f, -1234f, -1230f, -1200f, +-1000f, 0.0f, 0.0f) +val positiveFloat = 12345678.123f +val pos_FloatResults: Seq[Float] = + Seq(12345678f, 12345670f, 12345600f, 12345000f, 1234f, 1230f, 1200f, 1000f, +0.0f, 0.0f) + +evaluateForEachItem(0, -10, -1)((scale, index) => { + checkEvaluation(TruncNumber(positiveFloat, scale), pos_FloatResults(index)) + checkEvaluation(TruncNumber(negativeFloat, scale), neg_floatResults(index)) +}) + +// Long +val longInput = 123456789L +val longResults: Seq[Long] = + Seq(123456789L, 123456789L, 123456789L, 123456789L, 123456780L, 123456700L, 123456000L, +12345L, 12340L, 12300L, 12000L, 1L, 0, 0) +evaluateForEachItem(3, -11, -1)((scale, index) => { + checkEvaluation(TruncNumber(longInput, scale), longResults(index)) + checkEvaluation(TruncNumber(-longInput, scale), -longResults(index)) +}) + +// Int +val intInput = 123456789 +val intResults: Seq[Int] = + Seq(123456789, 123456789, 123456789, 123456789, 123456780, 123456700, 123456000, 12345, +12340, 12300, 12000, 1, 0, 0) +evaluateForEachItem(3, -11, -1)((scale, index) => { + checkEvaluation(TruncNumber(intInput, scale), intResults(index)) + checkEvaluation(TruncNumber(-intInput, scale), -intResults(index)) +}) + +// Short +val shortInput: Short = 32767 +val shortResults: Seq[Short] = + Seq(32767, 32767, 32767, 32767, 32760, 32700, 32000, 3, 0, 0) +evaluateForEachItem(3, -6, -1)((scale, index) => { + checkEvaluation(TruncNumber(shortInput, scale), shortResults(index)) + checkEvaluation(TruncNumber(-shortInput, scale), -shortResults(index)) +}) + +// Byte +val byteInput: Byte = 127 +val byteResults: Seq[Byte] = + Seq(127, 127, 127, 127, 120, 100, 0, 0) +evaluateForEachItem(3, -4, -1)((scale, index) => { + checkEvaluation(TruncNumber(byteInput, scale), byteResults(index)) + checkEvaluation(TruncNumber(-byteInput, scale), -byteResults(index))
[GitHub] [spark] vinodkc commented on a diff in pull request #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers
vinodkc commented on code in PR #38419: URL: https://github.com/apache/spark/pull/38419#discussion_r1083045053 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala: ## @@ -937,4 +937,135 @@ class MathExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(WidthBucket(5.35, 0.024, Double.NegativeInfinity, 5L), null) checkEvaluation(WidthBucket(5.35, 0.024, Double.PositiveInfinity, 5L), null) } + +test("truncate number") { +def evaluateForEachItem(from: Int, to: Int, increment: Int)(f: (Int, Int) => Unit): Unit = { + Range(from, to, increment).zipWithIndex.foreach { case (scale, i) => +f(scale, i) + } +} + +// Decimal +val decimalInput = BigDecimal("123456789123456789.123456789") +val truncDecimalResults = Seq( + "123456789123456789.123456789", + "123456789123456789.123456789", + "123456789123456789.12345678", + "123456789123456789.1234567", + "123456789123456789.123456", + "123456789123456789.12345", + "123456789123456789.1234", + "123456789123456789.123", + "123456789123456789.12", + "123456789123456789.1", + "123456789123456789", + "123456789123456780", + "123456789123456700", + "123456789123456000", + "12345678912345", + "12345678912340", + "12345678912300", + "12345678912000", + "1234567891", + "1234567890", + "1234567800", + "1234567000", + "123456", + "123450", + "123400", + "123000", + "12", + "10", + "0", + "0") + +evaluateForEachItem(10, -20, -1)((scale, index) => { + checkEvaluation(TruncNumber(decimalInput, scale), BigDecimal(truncDecimalResults(index))) + checkEvaluation(TruncNumber(-decimalInput, scale), -BigDecimal(truncDecimalResults(index))) +}) +// Double +val negativeDouble = -12345678.1234 +val neg_DoubleResults: Seq[Double] = + Seq(-12345678.0, -12345670.0, -12345600.0, -12345000.0, -1234.0, -1230.0, +-1200.0, -1000.0, 0.0, 0.0) +val positiveDouble = 12345678.1234 +val pos_DoubleResults: Seq[Double] = + Seq(12345678.0, 12345670.0, 12345600.0, 12345000.0, 1234.0, 1230.0, 1200.0, +1000.0, 0.0, 0.0) + +evaluateForEachItem(0, -10, -1)((scale, index) => { + checkEvaluation(TruncNumber(positiveDouble, scale), pos_DoubleResults(index)) + checkEvaluation(TruncNumber(negativeDouble, scale), neg_DoubleResults(index)) +}) + +// Float +val negativeFloat = -12345678.123f +val neg_floatResults: Seq[Float] = + Seq(-12345678f, -12345670f, -12345600f, -12345000f, -1234f, -1230f, -1200f, +-1000f, 0.0f, 0.0f) +val positiveFloat = 12345678.123f +val pos_FloatResults: Seq[Float] = + Seq(12345678f, 12345670f, 12345600f, 12345000f, 1234f, 1230f, 1200f, 1000f, +0.0f, 0.0f) + +evaluateForEachItem(0, -10, -1)((scale, index) => { + checkEvaluation(TruncNumber(positiveFloat, scale), pos_FloatResults(index)) + checkEvaluation(TruncNumber(negativeFloat, scale), neg_floatResults(index)) +}) + +// Long +val longInput = 123456789L +val longResults: Seq[Long] = + Seq(123456789L, 123456789L, 123456789L, 123456789L, 123456780L, 123456700L, 123456000L, +12345L, 12340L, 12300L, 12000L, 1L, 0, 0) +evaluateForEachItem(3, -11, -1)((scale, index) => { + checkEvaluation(TruncNumber(longInput, scale), longResults(index)) + checkEvaluation(TruncNumber(-longInput, scale), -longResults(index)) +}) + +// Int +val intInput = 123456789 +val intResults: Seq[Int] = + Seq(123456789, 123456789, 123456789, 123456789, 123456780, 123456700, 123456000, 12345, +12340, 12300, 12000, 1, 0, 0) +evaluateForEachItem(3, -11, -1)((scale, index) => { + checkEvaluation(TruncNumber(intInput, scale), intResults(index)) + checkEvaluation(TruncNumber(-intInput, scale), -intResults(index)) +}) + +// Short +val shortInput: Short = 32767 +val shortResults: Seq[Short] = + Seq(32767, 32767, 32767, 32767, 32760, 32700, 32000, 3, 0, 0) +evaluateForEachItem(3, -6, -1)((scale, index) => { + checkEvaluation(TruncNumber(shortInput, scale), shortResults(index)) + checkEvaluation(TruncNumber(-shortInput, scale), -shortResults(index)) +}) + +// Byte +val byteInput: Byte = 127 +val byteResults: Seq[Byte] = + Seq(127, 127, 127, 127, 120, 100, 0, 0) +evaluateForEachItem(3, -4, -1)((scale, index) => { + checkEvaluation(TruncNumber(byteInput, scale), byteResults(index)) + checkEvaluation(TruncNumber(-byteInput, scale), -byteResults(index))
[GitHub] [spark] gengliangwang commented on pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary
gengliangwang commented on PR #39666: URL: https://github.com/apache/spark/pull/39666#issuecomment-1398910697 @dongjoon-hyun may I have your approval for this one? So that we can work in parallel to finish the refactoring before Spark 3.4.0 rc0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39657: [SPARK-42123][SQL] Include column default values in DESCRIBE output
gengliangwang commented on code in PR #39657: URL: https://github.com/apache/spark/pull/39657#discussion_r1082955637 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala: ## @@ -645,6 +645,16 @@ case class DescribeTableCommand( } else if (isExtended) { describeFormattedTableInfo(metadata, result) } + + // If any columns have default values, append them to the result. + if (metadata.schema.fields.exists(_.metadata.contains( +ResolveDefaultColumns.CURRENT_DEFAULT_COLUMN_METADATA_KEY))) { +append(result, "", "", "") +append(result, "# Column Default Information", "", "") +metadata.schema.foreach { column => + column.getCurrentDefaultValue().map(append(result, column.name, "", _)) Review Comment: Shall we fill the data type column to make it easier to read? https://user-images.githubusercontent.com/1097932/213785998-09d5d34e-8a7b-4ba5-b8c2-6cab52fdc02a.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39657: [SPARK-42123][SQL] Include column default values in DESCRIBE output
gengliangwang commented on code in PR #39657: URL: https://github.com/apache/spark/pull/39657#discussion_r1082948320 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala: ## @@ -225,4 +246,30 @@ class DescribeTableSuite extends DescribeTableSuiteBase with CommandSuiteBase { Row("Partition Provider", "Catalog", ""))) } } + + test("DESCRIBE TABLE EXTENDED of a table with a default column value") { Review Comment: nit: shall we update `describe.sql` so that we can have a straightforward end-to-end test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers
srielau commented on code in PR #38419: URL: https://github.com/apache/spark/pull/38419#discussion_r1082942017 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala: ## @@ -937,4 +937,135 @@ class MathExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(WidthBucket(5.35, 0.024, Double.NegativeInfinity, 5L), null) checkEvaluation(WidthBucket(5.35, 0.024, Double.PositiveInfinity, 5L), null) } + +test("truncate number") { +def evaluateForEachItem(from: Int, to: Int, increment: Int)(f: (Int, Int) => Unit): Unit = { + Range(from, to, increment).zipWithIndex.foreach { case (scale, i) => +f(scale, i) + } +} + +// Decimal +val decimalInput = BigDecimal("123456789123456789.123456789") +val truncDecimalResults = Seq( + "123456789123456789.123456789", + "123456789123456789.123456789", + "123456789123456789.12345678", + "123456789123456789.1234567", + "123456789123456789.123456", + "123456789123456789.12345", + "123456789123456789.1234", + "123456789123456789.123", + "123456789123456789.12", + "123456789123456789.1", + "123456789123456789", + "123456789123456780", + "123456789123456700", + "123456789123456000", + "12345678912345", + "12345678912340", + "12345678912300", + "12345678912000", + "1234567891", + "1234567890", + "1234567800", + "1234567000", + "123456", + "123450", + "123400", + "123000", + "12", + "10", + "0", + "0") + +evaluateForEachItem(10, -20, -1)((scale, index) => { + checkEvaluation(TruncNumber(decimalInput, scale), BigDecimal(truncDecimalResults(index))) + checkEvaluation(TruncNumber(-decimalInput, scale), -BigDecimal(truncDecimalResults(index))) +}) +// Double +val negativeDouble = -12345678.1234 +val neg_DoubleResults: Seq[Double] = + Seq(-12345678.0, -12345670.0, -12345600.0, -12345000.0, -1234.0, -1230.0, +-1200.0, -1000.0, 0.0, 0.0) +val positiveDouble = 12345678.1234 +val pos_DoubleResults: Seq[Double] = + Seq(12345678.0, 12345670.0, 12345600.0, 12345000.0, 1234.0, 1230.0, 1200.0, +1000.0, 0.0, 0.0) + +evaluateForEachItem(0, -10, -1)((scale, index) => { + checkEvaluation(TruncNumber(positiveDouble, scale), pos_DoubleResults(index)) + checkEvaluation(TruncNumber(negativeDouble, scale), neg_DoubleResults(index)) +}) + +// Float +val negativeFloat = -12345678.123f +val neg_floatResults: Seq[Float] = + Seq(-12345678f, -12345670f, -12345600f, -12345000f, -1234f, -1230f, -1200f, +-1000f, 0.0f, 0.0f) +val positiveFloat = 12345678.123f +val pos_FloatResults: Seq[Float] = + Seq(12345678f, 12345670f, 12345600f, 12345000f, 1234f, 1230f, 1200f, 1000f, +0.0f, 0.0f) + +evaluateForEachItem(0, -10, -1)((scale, index) => { + checkEvaluation(TruncNumber(positiveFloat, scale), pos_FloatResults(index)) + checkEvaluation(TruncNumber(negativeFloat, scale), neg_floatResults(index)) +}) + +// Long +val longInput = 123456789L +val longResults: Seq[Long] = + Seq(123456789L, 123456789L, 123456789L, 123456789L, 123456780L, 123456700L, 123456000L, +12345L, 12340L, 12300L, 12000L, 1L, 0, 0) +evaluateForEachItem(3, -11, -1)((scale, index) => { + checkEvaluation(TruncNumber(longInput, scale), longResults(index)) + checkEvaluation(TruncNumber(-longInput, scale), -longResults(index)) +}) + +// Int +val intInput = 123456789 +val intResults: Seq[Int] = + Seq(123456789, 123456789, 123456789, 123456789, 123456780, 123456700, 123456000, 12345, +12340, 12300, 12000, 1, 0, 0) +evaluateForEachItem(3, -11, -1)((scale, index) => { + checkEvaluation(TruncNumber(intInput, scale), intResults(index)) + checkEvaluation(TruncNumber(-intInput, scale), -intResults(index)) +}) + +// Short +val shortInput: Short = 32767 +val shortResults: Seq[Short] = + Seq(32767, 32767, 32767, 32767, 32760, 32700, 32000, 3, 0, 0) +evaluateForEachItem(3, -6, -1)((scale, index) => { + checkEvaluation(TruncNumber(shortInput, scale), shortResults(index)) + checkEvaluation(TruncNumber(-shortInput, scale), -shortResults(index)) +}) + +// Byte +val byteInput: Byte = 127 +val byteResults: Seq[Byte] = + Seq(127, 127, 127, 127, 120, 100, 0, 0) +evaluateForEachItem(3, -4, -1)((scale, index) => { + checkEvaluation(TruncNumber(byteInput, scale), byteResults(index)) + checkEvaluation(TruncNumber(-byteInput, scale), -byteResults(index))
[GitHub] [spark] gengliangwang commented on a diff in pull request #39666: [SPARK-42130][UI] Handle null string values in AccumulableInfo and ProcessSummary
gengliangwang commented on code in PR #39666: URL: https://github.com/apache/spark/pull/39666#discussion_r1082936874 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -22,7 +22,12 @@ package org.apache.spark.status.protobuf; * Developer guides: * - Coding style: https://developers.google.com/protocol-buffers/docs/style * - Use int64 for job/stage IDs, in case of future extension in Spark core. - * - Use `weakIntern` on string values in create new objects during deserialization. + * - For string fields: + * - use `optional string` for protobuf definition + * - on serialization, check if the string is null to avoid NPE + * - on deserialization, set string fields as null if it is not set. Also, use `weakIntern` on + * string values in create new objects during deserialization. + * - add tests with null string inputs Review Comment: AFAIK no. Unless we modify the protoc for generating java. I think having tests for setting all string fields as nulls should be enough for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ggwiebe commented on pull request #39519: [SPARK-41995][SQL] Accept non-foldable expressions in schema_of_json
ggwiebe commented on PR #39519: URL: https://github.com/apache/spark/pull/39519#issuecomment-1398786764 This is great! I have a customer that I am helping where they are receiving many different types of messages enveloped into a structured stream. Today I cannot give a schema for the message because there are many different message types stored in this column (each row could hold a different message type, with a different schema stored in the message value field). The ability to dynamically provide different values without an a priori schema will be super useful. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #39550: [SPARK-42056][SQL][PROTOBUF] Add missing options for Protobuf functions
rangadi commented on code in PR #39550: URL: https://github.com/apache/spark/pull/39550#discussion_r1082925694 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/functions.scala: ## @@ -80,14 +81,37 @@ object functions { * * @param data * the binary column. - * @param shadedMessageClassName + * @param messageClassName * The Protobuf class name. E.g. org.spark.examples.protobuf.ExampleEvent. * The jar with these classes needs to be shaded as described above. * @since 3.4.0 */ @Experimental - def from_protobuf(data: Column, shadedMessageClassName: String): Column = { -new Column(ProtobufDataToCatalyst(data.expr, shadedMessageClassName)) + def from_protobuf(data: Column, messageClassName: String): Column = { +new Column(ProtobufDataToCatalyst(data.expr, messageClassName)) + } + + /** + * Converts a binary column of Protobuf format into its corresponding catalyst value. The + * specified Protobuf class must match the data, otherwise the behavior is + * undefined: it may fail or return arbitrary result. The jar containing Java class should be + * shaded. Specifically, `com.google.protobuf.*` should be shaded to Review Comment: @SandishKumarHN made a big update to both Scala and Python functions. PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] RyanBerti commented on pull request #39678: [SPARK-16484][SQL] Add HyperLogLogPlusPlus sketch generator/evaluator/aggregator
RyanBerti commented on PR #39678: URL: https://github.com/apache/spark/pull/39678#issuecomment-1398762642 For reference, @dtenedor worked with me on a pre-review of these changes; relevant discussions are available in this PR in my fork: https://github.com/RyanBerti/spark/pull/1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rithwik-db commented on a diff in pull request #39369: [SPARK-41775][PYTHON][ML] Adding support for PyTorch functions
rithwik-db commented on code in PR #39369: URL: https://github.com/apache/spark/pull/39369#discussion_r1082896210 ## python/pyspark/ml/torch/tests/test_distributor.py: ## @@ -349,6 +434,13 @@ def test_get_num_tasks_distributed(self) -> None: self.spark.sparkContext._conf.set("spark.task.resource.gpu.amount", "1") +# def test_end_to_end_run_distributedly(self) -> None: Review Comment: Sorry, that shouldn't have been commented out. Will quickly update adding that test back in -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] lu-wang-dl commented on a diff in pull request #39369: [SPARK-41775][PYTHON][ML] Adding support for PyTorch functions
lu-wang-dl commented on code in PR #39369: URL: https://github.com/apache/spark/pull/39369#discussion_r1082894756 ## python/pyspark/ml/torch/tests/test_distributor.py: ## @@ -349,6 +434,13 @@ def test_get_num_tasks_distributed(self) -> None: self.spark.sparkContext._conf.set("spark.task.resource.gpu.amount", "1") +# def test_end_to_end_run_distributedly(self) -> None: Review Comment: Clean up this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398718941 FYI, GitHub Action is currently on the old version but `jdk8u362` will be automatically applied in one or two weeks. I'll keep monitoring the version change. https://user-images.githubusercontent.com/9700541/213764829-44ae3b35-54e1-4420-8741-9f10e720eec0.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39677: [SPARK-42043][CONNECT][TEST][FOLLOWUP] Better env var and a few bug fixes
dongjoon-hyun commented on code in PR #39677: URL: https://github.com/apache/spark/pull/39677#discussion_r1082842120 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -121,14 +117,14 @@ object SparkConnectServerUtils { s"Fail to locate the spark connect server target folder: '${parentDir.getCanonicalPath}'. " + s"SPARK_HOME='${new File(sparkHome).getCanonicalPath}'. " + "Make sure the spark connect server jar has been built " + -"and the system property `SPARK_HOME` is set correctly.") +"and the env variable `SPARK_HOME` is set correctly.") val jars = recursiveListFiles(parentDir).filter { f => // SBT jar (f.getParentFile.getName.startsWith("scala-") && -f.getName.startsWith("spark-connect-assembly") && f.getName.endsWith("SNAPSHOT.jar")) || +f.getName.startsWith("spark-connect-assembly") && f.getName.endsWith(".jar")) || Review Comment: Thank you for the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39677: [SPARK-42043][CONNECT][TEST][FOLLOWUP] Better env var and a few bug fixes
dongjoon-hyun commented on PR #39677: URL: https://github.com/apache/spark/pull/39677#issuecomment-1398677709 Hi, @zhenlineo . When we use the same JIRA id, we need to add `[FOLLOWUP]` in the PR title. I added at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362
dongjoon-hyun closed pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362 URL: https://github.com/apache/spark/pull/39671 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39671: [SPARK-40303][DOCS] Deprecate old Java 8 versions prior to 8u362
dongjoon-hyun commented on PR #39671: URL: https://github.com/apache/spark/pull/39671#issuecomment-1398671949 Now, both Zulu and Adoptiun(Temurin) are available. Thank you all. Merged to master. https://user-images.githubusercontent.com/9700541/213759635-e5163f33-ab6e-4f02-a981-d2044cd171b2.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #39190: [SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API
srowen commented on PR #39190: URL: https://github.com/apache/spark/pull/39190#issuecomment-1398662080 FWIW, this part was last changed in https://issues.apache.org/jira/browse/SPARK-24415 to fix a different bug (CC @ankuriitg ) It might be worth re-running the simple example there to see if this retains the 'fix', but, evidently the tests added in that old change still pass here. While I'm always wary of touching this core code and I myself don't know it well, this seems fairly convincing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #39633: [SPARK-42038][SQL] SPJ: Support partially clustered distribution
sunchao commented on PR #39633: URL: https://github.com/apache/spark/pull/39633#issuecomment-1398661965 @cloud-fan the idea is similar to skew join but for v2 sources, let me try to split the code into a separate rule following your idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on pull request #39677: [SPARK-42043][CONNECT][TEST] Better env var and a few bug fixes
zhenlineo commented on PR #39677: URL: https://github.com/apache/spark/pull/39677#issuecomment-1398630857 cc @HyukjinKwon @LuciferYang Thanks for the review. I address the immediate fix in this PR. For other improvements, I've created tickets and we can add as follow ups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests
zhenlineo commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082775554 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.connect.client.util + +import java.io.{BufferedOutputStream, File} + +import scala.io.Source + +import org.scalatest.BeforeAndAfterAll +import sys.process._ + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.connect.client.SparkConnectClient +import org.apache.spark.sql.connect.common.config.ConnectCommon + +/** + * An util class to start a local spark connect server in a different process for local E2E tests. + * It is designed to start the server once but shared by all tests. It is equivalent to use the + * following command to start the connect server via command line: + * + * {{{ + * bin/spark-shell \ + * --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \ + * --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin + * }}} + * + * Set system property `SPARK_HOME` if the test is not executed from the Spark project top folder. + * Set system property `DEBUG_SC_JVM_CLIENT=true` to print the server process output in the + * console to debug server start stop problems. + */ +object SparkConnectServerUtils { + // System properties used for testing and debugging + private val SPARK_HOME = "SPARK_HOME" + private val ENV_DEBUG_SC_JVM_CLIENT = "DEBUG_SC_JVM_CLIENT" + + private val sparkHome = System.getProperty(SPARK_HOME, fileSparkHome()) + private val isDebug = System.getProperty(ENV_DEBUG_SC_JVM_CLIENT, "false").toBoolean + + // Log server start stop debug info into console + // scalastyle:off println + private[connect] def debug(msg: String): Unit = if (isDebug) println(msg) + // scalastyle:on println + private[connect] def debug(error: Throwable): Unit = if (isDebug) error.printStackTrace() + + // Server port + private[connect] val port = ConnectCommon.CONNECT_GRPC_BINDING_PORT + util.Random.nextInt(1000) + + @volatile private var stopped = false + + private lazy val sparkConnect: Process = { +debug("Starting the Spark Connect Server...") +val jar = findSparkConnectJar +val builder = Process( + new File(sparkHome, "bin/spark-shell").getCanonicalPath, + Seq( +"--jars", +jar, +"--driver-class-path", +jar, +"--conf", +"spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin", +"--conf", +s"spark.connect.grpc.binding.port=$port")) + +val io = new ProcessIO( + // Hold the input channel to the spark console to keep the console open + in => new BufferedOutputStream(in), + // Only redirect output if debug to avoid channel interruption error on process termination. + out => if (isDebug) Source.fromInputStream(out).getLines.foreach(debug), + err => if (isDebug) Source.fromInputStream(err).getLines.foreach(debug)) +val process = builder.run(io) + +// Adding JVM shutdown hook +sys.addShutdownHook(kill()) +process + } + + def start(): Unit = { +assert(!stopped) +sparkConnect Review Comment: I tried to stop the server properly by sending `:quit` to the shell process, but it works 20% of the runs. As it is too unreliable I has to use the kill. Do you know some better way to stop the shell properly? Thanks a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests
zhenlineo commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1082771233 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.connect.client.util + +import java.io.{BufferedOutputStream, File} + +import scala.io.Source + +import org.scalatest.BeforeAndAfterAll +import sys.process._ + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.connect.client.SparkConnectClient +import org.apache.spark.sql.connect.common.config.ConnectCommon + +/** + * An util class to start a local spark connect server in a different process for local E2E tests. + * It is designed to start the server once but shared by all tests. It is equivalent to use the + * following command to start the connect server via command line: + * + * {{{ + * bin/spark-shell \ + * --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \ + * --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin + * }}} + * + * Set system property `SPARK_HOME` if the test is not executed from the Spark project top folder. + * Set system property `DEBUG_SC_JVM_CLIENT=true` to print the server process output in the + * console to debug server start stop problems. + */ +object SparkConnectServerUtils { + // System properties used for testing and debugging + private val SPARK_HOME = "SPARK_HOME" + private val ENV_DEBUG_SC_JVM_CLIENT = "DEBUG_SC_JVM_CLIENT" + + private val sparkHome = System.getProperty(SPARK_HOME, fileSparkHome()) + private val isDebug = System.getProperty(ENV_DEBUG_SC_JVM_CLIENT, "false").toBoolean + + // Log server start stop debug info into console + // scalastyle:off println + private[connect] def debug(msg: String): Unit = if (isDebug) println(msg) Review Comment: In normal runs, we shall not print anything out. The prints are merely added for me to quick debugging any server start errors when setting up the tests. The Scala client needs a new logging system, which we will add in a follow-up PR. Let me know if it is a recommendation to use log4j for all test outputs. Then I will migrate these tests together. https://issues.apache.org/jira/browse/SPARK-42135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org