[GitHub] [spark] github-actions[bot] commented on pull request #36695: [SPARK-38474][CORE] Use error class in org.apache.spark.security
github-actions[bot] commented on PR #36695: URL: https://github.com/apache/spark/pull/36695#issuecomment-1319387650 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38680: [SPARK-40657][PROTOBUF][FOLLOWUP][MINOR] Add clarifying comment in ProtobufUtils
HeartSaVioR commented on PR #38680: URL: https://github.com/apache/spark/pull/38680#issuecomment-1319400995 https://github.com/rangadi/spark/actions/runs/3490416069/jobs/5847475694 Second trial of build is passing for most of jobs and pending k8s integration which I don't believe this PR will break it. Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
HyukjinKwon commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1025959335 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: Ah .. this is tricky.. Let's just probably mark this test as `ignore` for now then.. with adding some comments like some custom memory configurations would have to be applied ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingS
LuciferYang commented on PR #38704: URL: https://github.com/apache/spark/pull/38704#issuecomment-1319535893 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollecting
LuciferYang opened a new pull request, #38704: URL: https://github.com/apache/spark/pull/38704 ### What changes were proposed in this pull request? This pr ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingSuite` as default due it cannot run successfully with Spark default Java Options. ### Why are the changes needed? Avoid local test failure. ### Does this PR introduce _any_ user-facing change? No, just for test ### How was this patch tested? - Pass GA - Manual test: in my test environment, change `-Xmx4g` to `-Xmx10g`, maven and sbt can test successfully in my -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
cloud-fan commented on PR #38683: URL: https://github.com/apache/spark/pull/38683#issuecomment-1319573470 shall we change `FileSourceMetadataAttribute`? I think the metadata column (at least for file source) is always not nullable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions
LuciferYang opened a new pull request, #38705: URL: https://github.com/apache/spark/pull/38705 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
Yaohua628 commented on PR #38683: URL: https://github.com/apache/spark/pull/38683#issuecomment-1319593235 > shall we change `FileSourceMetadataAttribute`? I initially thought we could relax this field for some future cases. But yeah, you are right, it seems like it is always not null for file sources. But do you think it will cause some compatibility issues? If this `nullable` has been persisted somewhere? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join
github-actions[bot] commented on PR #37129: URL: https://github.com/apache/spark/pull/37129#issuecomment-1319387624 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #37359: [SPARK-25342][CORE][SQL]Support rolling back a result stage and rerunning all result tasks when writing files
github-actions[bot] commented on PR #37359: URL: https://github.com/apache/spark/pull/37359#issuecomment-1319387601 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module
HyukjinKwon closed pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module URL: https://github.com/apache/spark/pull/38690 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module
HyukjinKwon commented on PR #38690: URL: https://github.com/apache/spark/pull/38690#issuecomment-1319471042 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #37725: [DO-NOT-MERGE] Exceptions without error classes in SQL golden files
panbingkun commented on PR #37725: URL: https://github.com/apache/spark/pull/37725#issuecomment-1319476144 OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest
Yikun commented on PR #38698: URL: https://github.com/apache/spark/pull/38698#issuecomment-1319483376 First test mlflow 1.30.0 and if result passed, in next commits, I will append the infra full refreshed (mlfow 2.0.1) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 opened a new pull request, #38699: [SPARK-41188][CORE][ML] Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
WeichenXu123 opened a new pull request, #38699: URL: https://github.com/apache/spark/pull/38699 Signed-off-by: Weichen Xu ### What changes were proposed in this pull request? Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes. ### Why are the changes needed? This is for limiting the thread number for OpenBLAS routine to the number of cores assigned to this executor because some spark ML algorithms calls OpenBlAS via netlib-java ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
HyukjinKwon commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1026007189 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: Thanks man -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38706: [TEST ONLY] Come back to collect.foreach(send)
zhengruifeng opened a new pull request, #38706: URL: https://github.com/apache/spark/pull/38706 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
wangyum commented on PR #38676: URL: https://github.com/apache/spark/pull/38676#issuecomment-1319657140 @EnricoMi @cloud-fan Could we fix the `DeduplicateRelations`? It did not generate different expression IDs for all conflicting attributes: ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.DeduplicateRelations === Join LeftSemi Join LeftSemi :- Project [(id#4 + 1) AS id#6] :- Project [(id#4 + 1) AS id#6] : +- Deduplicate [id#4] : +- Deduplicate [id#4] : +- Project [value#1 AS id#4]: +- Project [value#1 AS id#4] :+- LocalRelation [value#1] :+- LocalRelation [value#1] +- Deduplicate [id#4] +- Deduplicate [id#4] ! +- Project [value#1 AS id#4] +- Project [value#8 AS id#4] ! +- LocalRelation [value#1]+- LocalRelation [value#8] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI
HyukjinKwon commented on PR #38611: URL: https://github.com/apache/spark/pull/38611#issuecomment-1319432329 I think mlflow got upgraded together for some reasons. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on PR #38683: URL: https://github.com/apache/spark/pull/38683#issuecomment-1319473845 cc. @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module
LuciferYang commented on PR #38690: URL: https://github.com/apache/spark/pull/38690#issuecomment-1319473917 Thanks @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
itholic commented on code in PR #38644: URL: https://github.com/apache/spark/pull/38644#discussion_r1025941541 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala: ## @@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with QueryErrorsBase { Decimal("12345678901234567890123456789012345678")) checkExceptionInExpression[ArithmeticException]( cast("123456789012345678901234567890123456789", DecimalType(38, 0)), - "Out of decimal type range") + "NUMERIC_OUT_OF_SUPPORTED_RANGE") Review Comment: Let me just test with `checkExceptionInExpression` for now, since it failed in CI when use `checkError` because `checkError` doesn't support for various test Mode such as "non-codegen mode", "codegen mode" and "unsafe mode" for now. I think we might need to similar test utils for expression test such as `checkErrorInExpression` something like that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest
Yikun commented on PR #38698: URL: https://github.com/apache/spark/pull/38698#issuecomment-1319485275 cc @HyukjinKwon @xinrong-meng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
itholic commented on PR #38644: URL: https://github.com/apache/spark/pull/38644#issuecomment-1319485107 cc @MaxGekk @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38675: [SPARK-41161][BUILD] Upgrade scala-parser-combinators to 2.1.1
LuciferYang commented on PR #38675: URL: https://github.com/apache/spark/pull/38675#issuecomment-1319538214 pass on 2.12 and 2.13 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack
sadikovi commented on code in PR #38700: URL: https://github.com/apache/spark/pull/38700#discussion_r1025989704 ## python/pyspark/serializers.py: ## @@ -357,7 +358,7 @@ def dumps(self, obj): return obj -if sys.version_info < (3, 8): +if sys.version_info < (3, 8) or os.environ.get("PYSPARK_MONKEYPATCH_NAMEDTUPLE") == "1": Review Comment: nit: I was thinking about a name like this one: `PYSPARK_ENABLE_NAMEDTUPLE_PATCH` but I am open to alternatives. ## python/pyspark/serializers.py: ## @@ -54,6 +54,7 @@ """ import sys +import os Review Comment: nit: should os come before sys? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] toujours33 opened a new pull request, #38709: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
toujours33 opened a new pull request, #38709: URL: https://github.com/apache/spark/pull/38709 ### What changes were proposed in this pull request? ExecutorAllocationManager only record count for speculative task, `stageAttemptToNumSpeculativeTasks` increment when speculative task submit, and only decrement when speculative task end. If task finished before speculative task start, the speculative task will never be scheduled, which will cause leak of `stageAttemptToNumSpeculativeTasks` and mislead the calculation of target executors. This PR fixes the issue by add task index in `SparkListenerSpeculativeTaskSubmitted` event, and record speculative task with task index, when task finished, the speculative task will also decrement. ### Why are the changes needed? To fix idle executors caused by pending speculative task from task that has been finished ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a comprehensive unit test. Pass the GA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38681: [SPARK-41165][CONNECT] Avoid hangs in the arrow collect code path
HyukjinKwon closed pull request #38681: [SPARK-41165][CONNECT] Avoid hangs in the arrow collect code path URL: https://github.com/apache/spark/pull/38681 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38694: [SPARK-41184][CONNECT] Disable flakey Fill.NA tests
HyukjinKwon closed pull request #38694: [SPARK-41184][CONNECT] Disable flakey Fill.NA tests URL: https://github.com/apache/spark/pull/38694 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38681: [SPARK-41165][CONNECT] Avoid hangs in the arrow collect code path
HyukjinKwon commented on PR #38681: URL: https://github.com/apache/spark/pull/38681#issuecomment-1319402787 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38694: [SPARK-41184][CONNECT] Disable flakey Fill.NA tests
HyukjinKwon commented on PR #38694: URL: https://github.com/apache/spark/pull/38694#issuecomment-1319402627 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets opened a new pull request, #38697: [SPARK-41118][SQL][3.3] `to_number`/`try_to_number` should return `null` when format is `null`
bersprockets opened a new pull request, #38697: URL: https://github.com/apache/spark/pull/38697 Backport of #38635 ### What changes were proposed in this pull request? When a user specifies a null format in `to_number`/`try_to_number`, return `null`, with a data type of `DecimalType.USER_DEFAULT`, rather than throwing a `NullPointerException`. Also, since the code for `ToNumber` and `TryToNumber` is virtually identical, put all common code in new abstract class `ToNumberBase` to avoid fixing the bug in two places. ### Why are the changes needed? `to_number`/`try_to_number` currently throws a `NullPointerException` when the format is `null`: ``` spark-sql> SELECT to_number('454', null); org.apache.spark.SparkException: The Spark SQL phase analysis failed with an internal error. Please, fill a bug report in, and provide the full stack trace. at org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:500) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:512) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185) ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat$lzycompute(numberFormatExpressions.scala:72) at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat(numberFormatExpressions.scala:72) at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter$lzycompute(numberFormatExpressions.scala:73) at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter(numberFormatExpressions.scala:73) at org.apache.spark.sql.catalyst.expressions.ToNumber.checkInputDataTypes(numberFormatExpressions.scala:81) ``` Also: ``` spark-sql> SELECT try_to_number('454', null); org.apache.spark.SparkException: The Spark SQL phase analysis failed with an internal error. Please, fill a bug report in, and provide the full stack trace. at org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:500) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:512) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185) ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat$lzycompute(numberFormatExpressions.scala:72) at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat(numberFormatExpressions.scala:72) at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter$lzycompute(numberFormatExpressions.scala:73) at org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter(numberFormatExpressions.scala:73) at org.apache.spark.sql.catalyst.expressions.ToNumber.checkInputDataTypes(numberFormatExpressions.scala:81) at org.apache.spark.sql.catalyst.expressions.TryToNumber.checkInputDataTypes(numberFormatExpressions.scala:146) ``` Compare to `to_binary` and `try_to_binary`: ``` spark-sql> SELECT to_binary('abc', null); NULL Time taken: 3.111 seconds, Fetched 1 row(s) spark-sql> SELECT try_to_binary('abc', null); NULL Time taken: 0.06 seconds, Fetched 1 row(s) spark-sql> ``` Also compare to `to_number` in PostgreSQL 11.18: ``` SELECT to_number('454', null) is null as a; a true ``` ### Does this PR introduce _any_ user-facing change? `to_number`/`try_to_number` with null format will now return `null` with a data type of `DecimalType.USER_DEFAULT`. ### How was this patch tested? New unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun opened a new pull request, #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest
Yikun opened a new pull request, #38698: URL: https://github.com/apache/spark/pull/38698 ### What changes were proposed in this pull request? This patch replace `list_run_infos` with `search_runs` in mlflow doctest. Since mlfow 1.29.0 (https://github.com/mlflow/mlflow/commit/fb9abf98595c1b32117aac90b0b2170ce120cd9f), `list_run_infos` is deprecated and instead by `search_runs` ### Why are the changes needed? We hit this error after infra upgrade: https://github.com/apache/spark/pull/38611#issuecomment-1319430302 ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? Test passed with mlflow 1.30.0 and 2.0.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
LuciferYang commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1025982465 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: https://github.com/apache/spark/pull/38704 ignore this case as default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack
HyukjinKwon commented on code in PR #38700: URL: https://github.com/apache/spark/pull/38700#discussion_r1026007445 ## python/pyspark/serializers.py: ## @@ -357,7 +358,7 @@ def dumps(self, obj): return obj -if sys.version_info < (3, 8): +if sys.version_info < (3, 8) or os.environ.get("PYSPARK_MONKEYPATCH_NAMEDTUPLE") == "1": Review Comment: ```suggestion if sys.version_info < (3, 8) or os.environ.get("PYSPARK_ENABLE_NAMEDTUPLE_PATCH") == "1": ``` ## python/pyspark/serializers.py: ## @@ -471,7 +472,7 @@ def loads(self, obj, encoding="bytes"): return cloudpickle.loads(obj, encoding=encoding) -if sys.version_info < (3, 8): +if sys.version_info < (3, 8) or os.environ.get("PYSPARK_MONKEYPATCH_NAMEDTUPLE") == "1": Review Comment: ```suggestion if sys.version_info < (3, 8) or os.environ.get("PYSPARK_ENABLE_NAMEDTUPLE_PATCH") == "1": ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack
HyukjinKwon commented on code in PR #38700: URL: https://github.com/apache/spark/pull/38700#discussion_r1026007386 ## python/pyspark/serializers.py: ## @@ -54,6 +54,7 @@ """ import sys +import os Review Comment: eh, it's actually fine in Python import (per PEP 8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #38680: [SPARK-40657][PROTOBUF][FOLLOWUP][MINOR] Add clarifying comment in ProtobufUtils
HeartSaVioR closed pull request #38680: [SPARK-40657][PROTOBUF][FOLLOWUP][MINOR] Add clarifying comment in ProtobufUtils URL: https://github.com/apache/spark/pull/38680 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.
LuciferYang commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1319466822 Thansks @HyukjinKwon @grundprinzip @amaliujia ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on PR #38683: URL: https://github.com/apache/spark/pull/38683#issuecomment-1319473783 Maybe simpler to apply `KnownNullable` / `KnownNotNull` against `CreateStruct` to enforce desired nullability? Please refer the change in #35543. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
LuciferYang commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1025954405 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: @liuzqt Could you tell me how to run the local test successfully with Spark default Java options? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042
panbingkun opened a new pull request, #38707: URL: https://github.com/apache/spark/pull/38707 ### What changes were proposed in this pull request? In the PR, I propose to assign a name to the error class _LEGACY_ERROR_TEMP_1042. ### Why are the changes needed? Proper names of error classes should improve user experience with Spark SQL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37725: [DO-NOT-MERGE] Exceptions without error classes in SQL golden files
LuciferYang commented on PR #37725: URL: https://github.com/apache/spark/pull/37725#issuecomment-1319616814 - SPARK-41173: https://github.com/apache/spark/pull/38705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI
Yikun commented on PR #38611: URL: https://github.com/apache/spark/pull/38611#issuecomment-1319457396 Yes, new version mlflow had some breaking change, you could first install memory-profile in the end of dockerfile(like connect). I will find sometime today to fix the doctest for new mlflow, in a separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma
cloud-fan commented on PR #38691: URL: https://github.com/apache/spark/pull/38691#issuecomment-1319496802 thanks for review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wineternity opened a new pull request, #38702: SPARK-41187 [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
wineternity opened a new pull request, #38702: URL: https://github.com/apache/spark/pull/38702 ### What changes were proposed in this pull request? Ignore the SparkListenerTaskEnd with Reason "Resubmitted" to avoid memory leak ### Why are the changes needed? For a long running spark thriftserver, LiveExecutor will be accumulated in the deadExecutors HashMap and cause message event queue processing slowly ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Added Test in thriftserver env ### The way to reproduce I try to reproduce it in spark shell, but it is a little bit handy 1. start spark-shell , set spark.dynamicAllocation.maxExecutors=2 for convient ` bin/spark-shell --driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8006"` 2. run a job with shuffle `sc.parallelize(1 to 1000, 10).map { x => Thread.sleep(1000) ; (x % 3, x) }.reduceByKey((a, b) => a + b).collect()` 3. After some ShuffleMapTask finished, kill one or two executor to let tasks resubmitted 4. check by heap dump or debug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
LuciferYang commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1025960041 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38702: SPARK-41187 [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
itholic commented on PR #38702: URL: https://github.com/apache/spark/pull/38702#issuecomment-1319566712 Can we change the JIRA format in the title such as "[SPARK-41187][CORE] ...". Check the [Spark contribution guide](https://spark.apache.org/contributing.html) also would helpful! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` for testing to `protobuf` module
LuciferYang opened a new pull request, #38708: URL: https://github.com/apache/spark/pull/38708 ### What changes were proposed in this pull request? This pr add a `log4j2.properties` file for testing to `protobuf` module as others. ### Why are the changes needed? Should print test log to `target/unit-tests.log` as other module ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Manual test run ``` build/mvn clean install -DskipTests -pl connector/protobuf -am build/mvn test -DskipTests -pl connector/protobuf ``` **Before** The log is printed to the console. **After** The log is printed to `target/unit-tests.log` . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
Yaohua628 commented on PR #38683: URL: https://github.com/apache/spark/pull/38683#issuecomment-1319636434 > If it has been persisted before (like a table), then it's totally fine to write non-nullable data to a nullable column. The optimizer may also optimize a column from nullable to non-nullable, so this will happen from time to time. Got it, that makes sense! Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38695: [TEST ONLY][DO NOT MERGE]. Test the schema of `collect`
zhengruifeng opened a new pull request, #38695: URL: https://github.com/apache/spark/pull/38695 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI
HyukjinKwon commented on PR #38611: URL: https://github.com/apache/spark/pull/38611#issuecomment-1319430302 The test failure this time seems different: ``` ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 168, in pyspark.pandas.mlflow.load_model Failed example: run_info = client.list_run_infos(exp_id)[-1] Exception raised: Traceback (most recent call last): File "/usr/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in run_info = client.list_run_infos(exp_id)[-1] AttributeError: 'MlflowClient' object has no attribute 'list_run_infos' ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 169, in pyspark.pandas.mlflow.load_model Failed example: model = load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid)) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in model = load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid)) NameError: name 'run_info' is not defined ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 171, in pyspark.pandas.mlflow.load_model Failed example: prediction_df["prediction"] = model.predict(prediction_df) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in prediction_df["prediction"] = model.predict(prediction_df) NameError: name 'model' is not defined ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in pyspark.pandas.mlflow.load_model Failed example: prediction_df Expected: x1 x2 prediction 0 2.0 4.01.31 Got: x1 x2 0 2.0 4.0 ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 178, in pyspark.pandas.mlflow.load_model Failed example: model.predict(prediction_df[["x1", "x2"]].to_pandas()) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in model.predict(prediction_df[["x1", "x2"]].to_pandas()) NameError: name 'model' is not defined ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 189, in pyspark.pandas.mlflow.load_model Failed example: y = model.predict(features) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in y = model.predict(features) NameError: name 'model' is not defined ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 198, in pyspark.pandas.mlflow.load_model Failed example: features['y'] = y Exception raised: Traceback (most recent call last): File "/usr/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in features['y'] = y NameError: name 'y' is not defined ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 200, in pyspark.pandas.mlflow.load_model Failed example: everything Expected: x1 x2 z y 0 2.0 3.0 -1 1.376932 Got: x1 x2 z 0 2.0 3.0 -1 ** 8 of 26 in pyspark.pandas.mlflow.load_model ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe,
[GitHub] [spark] HyukjinKwon commented on pull request #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest
HyukjinKwon commented on PR #38698: URL: https://github.com/apache/spark/pull/38698#issuecomment-1319487583 cc @harupy @WeichenXu123 FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack
HyukjinKwon opened a new pull request, #38700: URL: https://github.com/apache/spark/pull/38700 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/34688 that adds a switch to turn on and off the namedtuple hack. ### Why are the changes needed? There are still behaviour differences between regular pickle and Cloudpickle e.g., bug fixes from the upstream. It's safer to have a switch to turn on and off for the time being. ### Does this PR introduce _any_ user-facing change? This remains as an internal environment so ideally no. In fact the main change itself was the internal change too. ### How was this patch tested? Manually tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
LuciferYang commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1025955547 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: @HyukjinKwon @mridulm @jiangxb1987 @sadikovi @Ngone51 I failed to run this case locally with Spark default Java options. Could you run successfully? Or should we ignore this case by default? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
LuciferYang commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1025953966 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: hmm... I test this suite with **Java 8/11/17** on Linux/MacOS On Apple Silicon with following commands: - Maven: ``` build/mvn clean install -DskipTests -pl sql/core -am build/mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite ``` and ``` dev/change-scala-version.sh 2.13 build/mvn clean install -DskipTests -pl sql/core -am -Pscala-2.13 build/mvn clean test -pl sql/core -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite ``` - SBT: ``` build/sbt clean "sql/testOnly org.apache.spark.sql.DatasetLargeResultCollectingSuite" ``` ``` dev/change-scala-version.sh 2.13 build/sbt clean "sql/testOnly org.apache.spark.sql.DatasetLargeResultCollectingSuite" -Pscala-2.13 ``` All test failed with `java.lang.OutOfMemoryError: Java heap space` as follows: ``` 10:19:56.910 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1(SerializerHelper.scala:40) at org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1$adapted(SerializerHelper.scala:40) at org.apache.spark.serializer.SerializerHelper$$$Lambda$2321/1995130077.apply(Unknown Source) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853) at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709) at org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1(Utils.scala:271) at org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1$adapted(Utils.scala:271) at org.apache.spark.util.Utils$$$Lambda$2324/69671223.apply(Unknown Source) at org.apache.spark.util.Utils$.writeByteBufferImpl(Utils.scala:249) at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:271) at org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2(ChunkedByteBuffer.scala:103) at org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2$adapted(ChunkedByteBuffer.scala:103) at org.apache.spark.util.io.ChunkedByteBuffer$$Lambda$2323/1073743200.apply(Unknown Source) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.util.io.ChunkedByteBuffer.writeExternal(ChunkedByteBuffer.scala:103) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:599) ``` -- This is an automated message from the Apache Git
[GitHub] [spark] dongjoon-hyun commented on pull request #38262: [SPARK-40801][BUILD] Upgrade `Apache commons-text` to 1.10
dongjoon-hyun commented on PR #38262: URL: https://github.com/apache/spark/pull/38262#issuecomment-1319552102 The feature release branches like branch-3.3 will, generally, be maintained with bug fix releases for a period of 18 months. We usually have 3 bug fix releases. Since 3.3.1 on released on Oct 25, 3.3.2 will be February or March 2023. Apache Spark 3.4.0 preparation is going to happen in the similar timeframe. So, v3.3.2 schedule might be adjusted accordingly. https://spark.apache.org/versioning-policy.html ![Screenshot 2022-11-17 at 9 03 13 PM](https://user-images.githubusercontent.com/9700541/202622566-33d8a868-3f98-46a3-9c70-f1b328ff29b8.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.
HyukjinKwon commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1319422162 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.
HyukjinKwon closed pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect. URL: https://github.com/apache/spark/pull/38609 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078
panbingkun opened a new pull request, #38696: URL: https://github.com/apache/spark/pull/38696 ### What changes were proposed in this pull request? In the PR, I propose to assign a name to the error class _LEGACY_ERROR_TEMP_1078. ### Why are the changes needed? Proper names of error classes should improve user experience with Spark SQL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: > $ build/sbt "catalyst/testOnly *SessionCatalogSuite" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module
LuciferYang commented on PR #38690: URL: https://github.com/apache/spark/pull/38690#issuecomment-1319467631 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
itholic commented on code in PR #38644: URL: https://github.com/apache/spark/pull/38644#discussion_r1025941541 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala: ## @@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with QueryErrorsBase { Decimal("12345678901234567890123456789012345678")) checkExceptionInExpression[ArithmeticException]( cast("123456789012345678901234567890123456789", DecimalType(38, 0)), - "Out of decimal type range") + "NUMERIC_OUT_OF_SUPPORTED_RANGE") Review Comment: Let me just test with `checkExceptionInExpression` for now, since it failed in CI when use `checkError` because `checkError` doesn't support for various test Mode such as "non-codegen mode", "codegen mode" and "unsafe mode" for now. (so, it failed in CI: https://github.com/apache/spark/pull/38644#discussion_r1021424319) I think we might need to similar test utils for expression test such as `checkErrorInExpression` something like that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB
LuciferYang commented on code in PR #38064: URL: https://github.com/apache/spark/pull/38064#discussion_r1025953966 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest } } +class DatasetLargeResultCollectingSuite extends QueryTest + with SharedSparkSession { + + override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") + test("collect data with single partition larger than 2GB bytes array limit") { +// This test requires large memory and leads to OOM in Github Action so we skip it. Developer +// should verify it in local build. +assume(!sys.env.contains("GITHUB_ACTIONS")) Review Comment: hmm... I test this suite with Java 8/11/17 on Linux/MacOS On Apple Silicon with following commands: - Maven: ``` build/mvn clean install -DskipTests -pl sql/core -am build/mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite ``` and ``` dev/change-scala-version.sh 2.13 build/mvn clean install -DskipTests -pl sql/core -am -Pscala-2.13 build/mvn clean test -pl sql/core -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite ``` - SBT: ``` build/sbt clean "sql/testOnly org.apache.spark.sql.DatasetLargeResultCollectingSuite" ``` ``` dev/change-scala-version.sh 2.13 build/sbt clean "sql/testOnly org.apache.spark.sql.DatasetLargeResultCollectingSuite" -Pscala-2.13 ``` All test failed with `java.lang.OutOfMemoryError: Java heap space` as follows: ``` 10:19:56.910 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1(SerializerHelper.scala:40) at org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1$adapted(SerializerHelper.scala:40) at org.apache.spark.serializer.SerializerHelper$$$Lambda$2321/1995130077.apply(Unknown Source) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853) at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709) at org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1(Utils.scala:271) at org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1$adapted(Utils.scala:271) at org.apache.spark.util.Utils$$$Lambda$2324/69671223.apply(Unknown Source) at org.apache.spark.util.Utils$.writeByteBufferImpl(Utils.scala:249) at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:271) at org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2(ChunkedByteBuffer.scala:103) at org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2$adapted(ChunkedByteBuffer.scala:103) at org.apache.spark.util.io.ChunkedByteBuffer$$Lambda$2323/1073743200.apply(Unknown Source) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.util.io.ChunkedByteBuffer.writeExternal(ChunkedByteBuffer.scala:103) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:599) ``` -- This is an automated message from the Apache Git
[GitHub] [spark] zhengruifeng opened a new pull request, #38701: [TEST ONLY][DO NOT MERGE] Test collect after avoiding hang with arrow-collect
zhengruifeng opened a new pull request, #38701: URL: https://github.com/apache/spark/pull/38701 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mcdull-zhang opened a new pull request, #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist
mcdull-zhang opened a new pull request, #38703: URL: https://github.com/apache/spark/pull/38703 ### What changes were proposed in this pull request? For example the following statement: ```sql cache table t1 as select a from testData3 group by a; cache table t2 as select a,b from testData2 where a in (select a from t1); select key,value,b from testData t3 join t2 on t3.key=t2.a; ``` The cached t2 is not used in the third statement before pr: ```tex Project [key#13, value#14, b#24] +- SortMergeJoin [key#13], [a#23], Inner :- BroadcastHashJoin [key#13], [a#359], LeftSemi, BuildRight, false : :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14] : : +- Scan[obj#12] : +- Scan In-memory table t1 [a#359] :+- InMemoryRelation [a#359], StorageLevel(disk, memory, deserialized, 1 replicas) : +- *(2) HashAggregate(keys=[a#33], functions=[], output=[a#33]) : +- Exchange hashpartitioning(a#33, 5), ENSURE_REQUIREMENTS, [plan_id=92] :+- *(1) HashAggregate(keys=[a#33], functions=[], output=[a#33]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33] : +- Scan[obj#32] +- BroadcastHashJoin [a#23], [a#359], LeftSemi, BuildRight, false :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24] : +- Scan[obj#22] +- Scan In-memory table t1 [a#359] +- InMemoryRelation [a#359], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(2) HashAggregate(keys=[a#33], functions=[], output=[a#33]) +- Exchange hashpartitioning(a#33, 5), ENSURE_REQUIREMENTS, [plan_id=92] +- *(1) HashAggregate(keys=[a#33], functions=[], output=[a#33]) +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33] +- Scan[obj#32] ``` after pr: ```tex Project [key#13, value#14, b#358] +- BroadcastHashJoin [key#13], [a#357], Inner, BuildRight, false :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14] : +- Scan[obj#12] +- Scan In-memory table t2 [a#357, b#358] +- InMemoryRelation [a#357, b#358], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) BroadcastHashJoin [a#23], [a#261], LeftSemi, BuildRight, false :- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24] : +- Scan[obj#22] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=155] +- Scan In-memory table t1 [a#261] +- InMemoryRelation [a#261], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(2) HashAggregate(keys=[a#33], functions=[], output=[a#33]) +- Exchange hashpartitioning(a#33, 5), ENSURE_REQUIREMENTS, [plan_id=92] +- *(1) HashAggregate(keys=[a#33], functions=[], output=[a#33]) +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33] +- Scan[obj#32] ``` ### Why are the changes needed? performance improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:
[GitHub] [spark] Yaohua628 commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
Yaohua628 commented on PR #38683: URL: https://github.com/apache/spark/pull/38683#issuecomment-1319529815 > Maybe simpler to apply KnownNullable / KnownNotNull against CreateStruct to enforce desired nullability? Please refer the change in https://github.com/apache/spark/pull/35543. Wow, good point, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38638: [SPARK-41122][CONNECT] Explain API can support different modes
HyukjinKwon closed pull request #38638: [SPARK-41122][CONNECT] Explain API can support different modes URL: https://github.com/apache/spark/pull/38638 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38638: [SPARK-41122][CONNECT] Explain API can support different modes
HyukjinKwon commented on PR #38638: URL: https://github.com/apache/spark/pull/38638#issuecomment-1319539398 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`
MaxGekk commented on code in PR #38576: URL: https://github.com/apache/spark/pull/38576#discussion_r1026031578 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -1059,10 +1060,16 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB def failOnInvalidOuterReference(p: LogicalPlan): Unit = { p.expressions.foreach(checkMixedReferencesInsideAggregateExpr) if (!canHostOuter(p) && p.expressions.exists(containsOuter)) { +val exprs = new ListBuffer[String]() +for (expr <- p.expressions) { Review Comment: This can be simpler using `filter` instead of `exists` and `for` ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -1059,10 +1060,16 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB def failOnInvalidOuterReference(p: LogicalPlan): Unit = { p.expressions.foreach(checkMixedReferencesInsideAggregateExpr) if (!canHostOuter(p) && p.expressions.exists(containsOuter)) { +val exprs = new ListBuffer[String]() +for (expr <- p.expressions) { + if (containsOuter(expr)) { +exprs += toPrettySQL(expr) Review Comment: Use `toSQLExpr` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI
xinrong-meng commented on PR #38611: URL: https://github.com/apache/spark/pull/38611#issuecomment-1319423102 Do you happen to know if it's normal to fail so many times? @Yikun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma
cloud-fan closed pull request #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma URL: https://github.com/apache/spark/pull/38691 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38262: [SPARK-40801][BUILD] Upgrade `Apache commons-text` to 1.10
dongjoon-hyun commented on PR #38262: URL: https://github.com/apache/spark/pull/38262#issuecomment-1319545728 Apache Spark has a pre-defined release cadence, @vitas and @bjornjorgensen . - https://spark.apache.org/versioning-policy.html ![Screenshot 2022-11-17 at 8 56 29 PM](https://user-images.githubusercontent.com/9700541/202620165-e2645b89-dc9f-4529-86f0-eba6d74f42f8.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
cloud-fan commented on PR #38683: URL: https://github.com/apache/spark/pull/38683#issuecomment-1319609369 If it has been persisted before (like a table), then it's totally fine to write non-nullable data to a nullable. The optimizer may also optimize a column from nullable to non-nullable, so this will happen from time to time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092
panbingkun opened a new pull request, #38710: URL: https://github.com/apache/spark/pull/38710 ### What changes were proposed in this pull request? In the PR, I propose to assign a name to the error class _LEGACY_ERROR_TEMP_1092. ### Why are the changes needed? Proper names of error classes should improve user experience with Spark SQL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dengziming commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
dengziming commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1024865486 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -213,7 +213,7 @@ message Deduplicate { message LocalRelation { repeated Expression.QualifiedAttribute attributes = 1; Review Comment: I find we lack a `fromBatchWithSchemaIterator` method correspond to `toBatchWithSchemaIterator`, so I will implement one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec
ulysses-you commented on PR #38687: URL: https://github.com/apache/spark/pull/38687#issuecomment-1318482662 cc @cloud-fan @huaxingao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`
itholic commented on code in PR #38576: URL: https://github.com/apache/spark/pull/38576#discussion_r1025069390 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -1059,8 +1059,8 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB if (!canHostOuter(p) && p.expressions.exists(containsOuter)) { p.failAnalysis( errorClass = - "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE", - messageParameters = Map("treeNode" -> planToString(p))) +"UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.CORRELATED_REFERENCE", + messageParameters = Map("sqlExprs" -> p.expressions.map(_.sql).mkString(","))) Review Comment: Oh, I got it. Let me address it. Thanks!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
EnricoMi commented on PR #38676: URL: https://github.com/apache/spark/pull/38676#issuecomment-1318255174 @wangyum @cloud-fan appreciate your suggestion on how to test this bug in `LeftSemiAntiJoinPushDownSuite` (see https://github.com/apache/spark/pull/38676#issuecomment-1317220559). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38638: [SPARK-41122][CONNECT] Explain API can support different modes
HyukjinKwon commented on code in PR #38638: URL: https://github.com/apache/spark/pull/38638#discussion_r1025039998 ## python/pyspark/sql/connect/dataframe.py: ## @@ -667,12 +668,70 @@ def schema(self) -> StructType: else: return self._schema -def explain(self) -> str: +def explain( +self, extended: Optional[Union[bool, str]] = None, mode: Optional[str] = None Review Comment: Actually NVM. Let's just go ahead as is for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38678: [SPARK-41164][CONNECT] Update relations.proto to follow Connect proto development guide
HyukjinKwon closed pull request #38678: [SPARK-41164][CONNECT] Update relations.proto to follow Connect proto development guide URL: https://github.com/apache/spark/pull/38678 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38678: [SPARK-41164][CONNECT] Update relations.proto to follow Connect proto development guide
HyukjinKwon commented on PR #38678: URL: https://github.com/apache/spark/pull/38678#issuecomment-1318461338 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.
LuciferYang commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1318498330 Both pr title and description have been updated @grundprinzip Thanks ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dengziming commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
dengziming commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1024864116 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -213,7 +213,7 @@ message Deduplicate { message LocalRelation { repeated Expression.QualifiedAttribute attributes = 1; - // TODO: support local data. + repeated bytes data = 2; Review Comment: Thank you, I use `repeated bytes` in case that the batch size is lager than maxRecordsPerBatch, I think is enough to use `bytes` here since `LocalRelation` is mostly used in debugging cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.
grundprinzip commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1318500167 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38688: [WIP][SPARK-41166][TESTS] Check errorSubClass of DataTypeMismatch in *ExpressionSuites
panbingkun opened a new pull request, #38688: URL: https://github.com/apache/spark/pull/38688 ### What changes were proposed in this pull request? The pr aims to check errorSubClass of DataTypeMismatch in *ExpressionSuites ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Make user can build and test `connect` module by specifying the user-defined `protoc` and `protoc-gen-grpc
grundprinzip commented on PR #38609: URL: https://github.com/apache/spark/pull/38609#issuecomment-1318486206 Since I cannot directly edit the PR description or title, I would kindly ask you to do the following changes: Title: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect. > ### What changes were proposed in this pull request? > This PR adds a new profile named `-Puser-defined-protoc` to support that users can build and test `connect` module by specifying custom `protoc ` and `protoc-gen-grpc-java` executables. > > ### Why are the changes needed? > As described in [SPARK-40593](https://issues.apache.org/jira/browse/SPARK-40593), the latest versions of `protoc` and `protoc-gen-grpc-java` have minimum version requirements for basic libraries such as `glibc` and `glibcxx`. Because of that it is not possible to compile the `connect` module out of the box on CentOS 6 or CentOS 7. Instead the following error messages is shown: > > ``` > [ERROR] PROTOC FAILED: /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) > /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) > /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) > /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) > ``` > > ### Does this PR introduce _any_ user-facing change? > No, the way to using official pre-release `protoc` and `protoc-gen-grpc-java` binary files is activated by default. > > ### How was this patch tested? > * Pass GitHub Actions > * Manual test on CentOS6u3 and CentOS7u4 > > ```shell > export CONNECT_PROTOC_EXEC_PATH=/path-to-protoc-exe > export CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe > ./build/mvn clean install -pl connector/connect -Puser-defined-protoc -am -DskipTests > ./build/mvn clean test -pl connector/connect -Puser-defined-protoc > ``` > > and > > ```shell > export CONNECT_PROTOC_EXEC_PATH=/path-to-protoc-exe > export CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe > ./build/sbt clean "connect/compile" -Puser-defined-protoc > ./build/sbt "connect/test" -Puser-defined-protoc > ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #38651: [SPARK-41136][K8S] Shorten graceful shutdown time of ExecutorPodsSnapshotsStoreImpl to prevent blocking shutdown process
pan3793 commented on PR #38651: URL: https://github.com/apache/spark/pull/38651#issuecomment-1318542102 @dongjoon-hyun @LuciferYang would you please take a look again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer opened a new pull request, #38689: [WIP][SPARK-41171][SQL] Push down filter through window when partitionSpec is empty
beliefer opened a new pull request, #38689: URL: https://github.com/apache/spark/pull/38689 ### What changes were proposed in this pull request? Sometimes, the SQL exists filter which condition compares rank-like window functions with number. For example, `SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM Tab1 WHERE rn <= 5` We can create a `Limit(5)` and push down it as the child of `Window`. `SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM (SELECT * FROM Tab1 ORDER BY a LIMIT 5) t` ### Why are the changes needed? Improve the performance. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? New tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng opened a new pull request, #38686: URL: https://github.com/apache/spark/pull/38686 ### What changes were proposed in this pull request? Implement `DataFrame.drop` with a proto message ### Why are the changes needed? for api coverage ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request, #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec
ulysses-you opened a new pull request, #38687: URL: https://github.com/apache/spark/pull/38687 ### What changes were proposed in this pull request? Add TimeTravelSpec to the key of relation cache in AnalysisContext. ### Why are the changes needed? Correct the relation resolution for the same table but different TimeTravelSpec. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service
mridulm commented on PR #38560: URL: https://github.com/apache/spark/pull/38560#issuecomment-1318238822 I will try to get to this later this week, do let me know if you are still working on it though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished
mridulm commented on PR #37922: URL: https://github.com/apache/spark/pull/37922#issuecomment-1318239964 I will try to get to this later this week, do let me know if you are still working on it/have pending comments to address. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order
cloud-fan commented on code in PR #38684: URL: https://github.com/apache/spark/pull/38684#discussion_r1024884862 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -371,7 +371,8 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { } val finalFilters = normalizedFilters.map(projectionFunc) - val withFilter = finalFilters.foldRight[LogicalPlan](scanRelation)((cond, plan) => { + // bottom-most filters are put in the left of the list. + val withFilter = finalFilters.foldLeft[LogicalPlan](scanRelation)((cond, plan) => { Review Comment: ```suggestion val withFilter = finalFilters.foldLeft[LogicalPlan](scanRelation)((plan, cond) => { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan opened a new pull request, #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma
cloud-fan opened a new pull request, #38691: URL: https://github.com/apache/spark/pull/38691 ### What changes were proposed in this pull request? This PR fixes a long-standing parser bug in Spark: `JOIN` should take precedence over comma when combining relations. For example, `FROM t1, t2 JOIN t3` should result to `t1 X (t2 X t3)`. However, it's `(t1 X t2) X t3` today. You can easily verify this behavior in other databases by running the query `SELECT * FROM t1, t2 JOIN t3 ON t1.c and t3.c`. It should fail as `t1.c` is not available when joining t2 and t3. I tested MySQL, Oracle and PostgreSQL, all of them fail, but Spark can run this query due to the wrong join order. However, this bug has a large impact: it changes the join order which can lead to worse performance or unexpected analysis errors. To be safe, this PR hides the fix under a new ANSI sub-config. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No, as the fix is turned off by default. ### How was this patch tested? new tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module
LuciferYang opened a new pull request, #38690: URL: https://github.com/apache/spark/pull/38690 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
LuciferYang commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1025353875 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala: ## @@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest { } } + test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") { +val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet") Review Comment: Can we reuse https://github.com/apache/spark/blob/f01a8db4bcf09c4975029e85722053ff82f8a355/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala#L462-L467 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order
cloud-fan commented on PR #38684: URL: https://github.com/apache/spark/pull/38684#issuecomment-1318837045 thanks for review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38678: [SPARK-41164][CONNECT] Update relations.proto to follow Connect proto development guide
cloud-fan commented on code in PR #38678: URL: https://github.com/apache/spark/pull/38678#discussion_r1025196133 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -106,24 +113,39 @@ message Project { // // For example, `SELECT ABS(-1)` is valid plan without an input plan. Relation input = 1; + + // (Required) A Project requires at least one expression. repeated Expression expressions = 3; } // Relation that applies a boolean expression `condition` on each row of `input` to produce // the output result. message Filter { + // (Required) Input relation for a Filter. Relation input = 1; + + // (Required) A Filter must have a condition expression. Expression condition = 2; } // Relation of type [[Join]]. // // `left` and `right` must be present. message Join { + // (Required) Left input relation for a Join. Relation left = 1; + + // (Required) Right input relation for a Join. Relation right = 2; + + // (Optional) The join condition. Could be unset when `using_columns` is utilized. + // + // This field does not co-exist with using_columns. Expression join_condition = 3; Review Comment: why is this not optional? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #37630: [SPARK-40193][SQL] Merge subquery plans with different filters
peter-toth commented on PR #37630: URL: https://github.com/apache/spark/pull/37630#issuecomment-1318682439 > @peter-toth Could you fix the conflicts again? Sure, done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma
cloud-fan commented on PR #38691: URL: https://github.com/apache/spark/pull/38691#issuecomment-1318810977 cc @viirya @dongjoon-hyun @gengliangwang @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order
cloud-fan closed pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order URL: https://github.com/apache/spark/pull/38684 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org