[GitHub] [spark] amaliujia commented on pull request #38345: [SPARK-40879][CONNECT] Support Join UsingColumns in proto
amaliujia commented on PR #38345: URL: https://github.com/apache/spark/pull/38345#issuecomment-1287635120 R: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38345: [SPARK-40879][CONNECT] Support Join UsingColumns in proto
amaliujia opened a new pull request, #38345: URL: https://github.com/apache/spark/pull/38345 ### What changes were proposed in this pull request? I was working on refactoring Connect proto tests from Catalyst DSL to DataFrame API, and identified that Join in Connect does not support `UsingColumns`. This is a gap between the Connect proto and DataFrame API. This also blocks the refactoring work because without `UsingColumns`, there is no compatible DataFrame Join API that we can covert existing tests to. This PR adds the support for Join's `UsingColumns`. ### Why are the changes needed? 1. Improve API coverage. 2. Unblock testing refactoring. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38338: [SPARK-40878][INFRA] pin 'grpcio==1.48.1' 'protobuf==4.21.6'
amaliujia commented on PR #38338: URL: https://github.com/apache/spark/pull/38338#issuecomment-1287632262 Thanks for working on this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38338: [SPARK-40878][INFRA] pin 'grpcio==1.48.1' 'protobuf==4.21.6'
zhengruifeng commented on PR #38338: URL: https://github.com/apache/spark/pull/38338#issuecomment-1287607808 Merged into master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #38338: [SPARK-40878][INFRA] pin 'grpcio==1.48.1' 'protobuf==4.21.6'
zhengruifeng closed pull request #38338: [SPARK-40878][INFRA] pin 'grpcio==1.48.1' 'protobuf==4.21.6' URL: https://github.com/apache/spark/pull/38338 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38344: [SPARK-40777][SQL][SPARK-PROTOBUF] : Protobuf import support and move error-classes.
AmplabJenkins commented on PR #38344: URL: https://github.com/apache/spark/pull/38344#issuecomment-1287607239 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][SPARK-PROTOBUF] : Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1002318758 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala: ## @@ -62,14 +63,16 @@ object SchemaConverters { case STRING => Some(StringType) case BYTE_STRING => Some(BinaryType) case ENUM => Some(StringType) - case MESSAGE if fd.getMessageType.getName == "Duration" => + case MESSAGE if fd.getMessageType.getName == "Duration" && Review Comment: @rangadi My assumption is that users should be able to use the Timestamp and Duration message types with different fields. ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala: ## @@ -196,27 +195,32 @@ private[sql] object ProtobufUtils extends Logging { fileDescriptorSet = DescriptorProtos.FileDescriptorSet.parseFrom(dscFile) } catch { case ex: InvalidProtocolBufferException => -// TODO move all the exceptions to core/src/main/resources/error/error-classes.json -throw new RuntimeException("Error parsing descriptor byte[] into Descriptor object", ex) +throw QueryCompilationErrors.descrioptorParseError(ex) case ex: IOException => -throw new RuntimeException( - "Error reading Protobuf descriptor file at path: " + -descFilePath, - ex) +throw QueryCompilationErrors.cannotFindDescriptorFileError(descFilePath, ex) } -val descriptorProto: DescriptorProtos.FileDescriptorProto = fileDescriptorSet.getFile(0) +val descriptorProto: DescriptorProtos.FileDescriptorProto = Review Comment: @rangadi This handles the protobuf import; please let me know if you know of a better approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN opened a new pull request, #38344: [SPARK-40777][SQL][SPARK-PROTOBUF] : Protobuf import support and move error-classes.
SandishKumarHN opened a new pull request, #38344: URL: https://github.com/apache/spark/pull/38344 This is the follow-up PR to https://github.com/apache/spark/pull/37972 and https://github.com/apache/spark/pull/38212 ### What changes were proposed in this pull request? 1. Move spark-protobuf error classes to the spark error-classes framework(core/src/main/resources/error/error-classes.json). 2. Support protobuf imports 3. validate protobuf timestamp and duration types. ### Why are the changes needed? N/A ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Existing tests should cover the validation of this PR. CC: @rangadi @mposdev21 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38343: [DONT MERGE] Test upgrade v6
Yikun commented on PR #38343: URL: https://github.com/apache/spark/pull/38343#issuecomment-1287577729 https://user-images.githubusercontent.com/1736354/197311334-c200479d-bc62-4749-a0d0-11acc0a3c5cc.png;> Notify workflow works well, see resluts set right: https://github.com/apache/spark/pull/38343/checks?check_run_id=9041518603 Let's wait update status result -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38342: [SPARK-40870][INFRA] Upgrade docker actions to cleanup warning
Yikun commented on PR #38342: URL: https://github.com/apache/spark/pull/38342#issuecomment-1287576267 Just a rebase, let's wait CI passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun opened a new pull request, #38342: [SPARK-40870][INFRA] Upgrade docker actions to cleanup warning
Yikun opened a new pull request, #38342: URL: https://github.com/apache/spark/pull/38342 ### What changes were proposed in this pull request? Upgrade docker actions to cleanup warning ### Why are the changes needed? - docker/setup-qemu-action from v1 to v2: https://github.com/docker/setup-qemu-action/releases/tag/v2.0.0: Cleanup node 12 warning https://github.com/docker/setup-qemu-action/releases/tag/v2.1.0: Cleanup set-ouput save-state waring - docker/setup-buildx-action from v1 to v2: https://github.com/docker/setup-buildx-action/releases/tag/v2.0.0: Cleanup node 12 warning https://github.com/docker/setup-buildx-action/releases/tag/v2.1.0: Cleanup set-ouput save-state waring - docker/build-push-action from v2 to v3 https://github.com/docker/build-push-action/releases/tag/v3.0.0: Cleanup node 12 warning https://github.com/docker/build-push-action/releases/tag/v3.2.0: Cleanup set-ouput save-state waring - docker/login-action from v1 to v2 https://github.com/docker/login-action/releases/tag/v2.0.0: Cleanup node 12 warning https://github.com/docker/login-action/releases/tag/v2.1.0: Cleanup set-ouput save-state waring ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled
HyukjinKwon closed pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled URL: https://github.com/apache/spark/pull/38334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled
HyukjinKwon commented on PR #38334: URL: https://github.com/apache/spark/pull/38334#issuecomment-1287575193 I am merging this to master but is the issue only in master branch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow
HyukjinKwon closed pull request #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow URL: https://github.com/apache/spark/pull/38341 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow
HyukjinKwon commented on PR #38341: URL: https://github.com/apache/spark/pull/38341#issuecomment-1287574881 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun opened a new pull request, #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow
Yikun opened a new pull request, #38341: URL: https://github.com/apache/spark/pull/38341 ### What changes were proposed in this pull request? Upgrade actions/github-scripts from v3 to v6 and fix notify workflow ### Why are the changes needed? Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. - Since github-script V5, change from `github.*` to `github.rest.*`, but `request`, `paginate` are unchanged. see also https://github.com/actions/github-script#breaking-changes-in-v5 - Since github-script V6, upgrade node12 to node16 ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? - Due to `pull_request_target`, the current PR is not in effect, we can only do test on local : https://github.com/Yikun/spark/pull/181 Notify works well: https://user-images.githubusercontent.com/1736354/197310102-6c709716-8a99-422d-8d38-3f770b6925f0.png;> Update status set to failed as expeceted: https://user-images.githubusercontent.com/1736354/197310119-30332769-0553-4ffa-816c-97a5ec0b3c27.png;> And `See test results` set right. https://github.com/Yikun/spark/pull/181/checks?check_run_id=9029035780 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow
Yikun commented on PR #38341: URL: https://github.com/apache/spark/pull/38341#issuecomment-1287572112 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38340: [SPARK-40877][SQL] Reimplement `crosstab` with dataframe operations
zhengruifeng opened a new pull request, #38340: URL: https://github.com/apache/spark/pull/38340 ### What changes were proposed in this pull request? Reimplement `crosstab` with dataframe operations ### Why are the changes needed? 1, existing implement collect all distinct `col1, col2` pairs to driver, while `pivot` only collect distinct `col2`; 2, do not truncate the sql plan by RDD operations ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs and manually check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] Add check command in Readme
zhengruifeng commented on PR #38339: URL: https://github.com/apache/spark/pull/38339#issuecomment-1287564103 Merged into master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] Add check command in Readme
zhengruifeng closed pull request #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] Add check command in Readme URL: https://github.com/apache/spark/pull/38339 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002289255 ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,8 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation +2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0` +3. Run `./dev/generate_protos.sh` Review Comment: oops I copied this from my local command which is not under the spark repo root directory. @zhengruifeng maybe you BTW update it in the PR that you pin dependency versions. ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,8 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation +2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0` +3. Run `./dev/generate_protos.sh` Review Comment: oops I copied this from my local command which is not under the spark repo root directory. @zhengruifeng maybe you can BTW update it in the PR that you pin dependency versions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
zhengruifeng commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002289219 ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,8 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation +2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0` +3. Run `./dev/generate_protos.sh` Review Comment: I fix it in https://github.com/apache/spark/pull/38339 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] Add check command in Readme
zhengruifeng opened a new pull request, #38339: URL: https://github.com/apache/spark/pull/38339 ### What changes were proposed in this pull request? 1, fix the code gen cmd; 2, add the check cmd ### Why are the changes needed? fix the wrong cmd ### Does this PR introduce _any_ user-facing change? yes, user will use correct cmd and know how to check ### How was this patch tested? doc only -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38338: [TEST] pin 'grpcio==1.48.1' 'protobuf==4.21.6'
zhengruifeng opened a new pull request, #38338: URL: https://github.com/apache/spark/pull/38338 ### What changes were proposed in this pull request? pin 'grpcio==1.48.1' 'protobuf==4.21.6' ### Why are the changes needed? to avoid code generation conflict due to package upgrade ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
zhengruifeng commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002285288 ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,8 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation +2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0` +3. Run `./dev/generate_protos.sh` Review Comment: @amaliujia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
zhengruifeng commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002284563 ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,8 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation +2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0` +3. Run `./dev/generate_protos.sh` Review Comment: oops, should be `./connector/connect/dev/generate_protos.sh` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on pull request #38337: [SPARK-39404][SS][3.3] Minor fix for querying _metadata in streaming
Yaohua628 commented on PR #38337: URL: https://github.com/apache/spark/pull/38337#issuecomment-1287559471 cc: @HeartSaVioR @felipepessoto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 opened a new pull request, #38337: [SPARK-39404][SS][3.3] Minor fix for querying _metadata in streaming
Yaohua628 opened a new pull request, #38337: URL: https://github.com/apache/spark/pull/38337 ### What changes were proposed in this pull request? Cherry-pick https://github.com/apache/spark/pull/36801 ### Why are the changes needed? Cherry-pick https://github.com/apache/spark/pull/36801 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided
github-actions[bot] closed pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided URL: https://github.com/apache/spark/pull/36893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37153: [SPARK-26052] Add type comments to exposed Prometheus metrics
github-actions[bot] closed pull request #37153: [SPARK-26052] Add type comments to exposed Prometheus metrics URL: https://github.com/apache/spark/pull/37153 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
zhengruifeng commented on PR #38335: URL: https://github.com/apache/spark/pull/38335#issuecomment-1287552170 merged into master, thank you @amaliujia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
zhengruifeng closed pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client URL: https://github.com/apache/spark/pull/38335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002270515 ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,7 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation Review Comment: makes sense. I updated this doc with versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
zhengruifeng commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002269337 ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,7 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation Review Comment: `buf`'s version in CI is `1.8.0`, the latest on is `1.9.0`. I think we also need to mention the version ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,7 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation Review Comment: `buf`'s version in CI is `1.8.0`, the latest one is `1.9.0`. I think we also need to mention the version -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
zhengruifeng commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002269147 ## python/pyspark/sql/connect/README.md: ## @@ -49,3 +49,7 @@ To use the release version of Spark Connect: ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic' ``` +## Generate proto generated files for the Python client +1. Install `buf`: https://docs.buf.build/installation Review Comment: I think we also need to mention the related python packages: grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38313: [SPARK-40849][SS] Async log purge
zhengruifeng commented on PR #38313: URL: https://github.com/apache/spark/pull/38313#issuecomment-1287547087 the failed python code gen check is unrelated to this PR, please rebase to make CI green -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db opened a new pull request, #38336: [SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery
allisonwang-db opened a new pull request, #38336: URL: https://github.com/apache/spark/pull/38336 ### What changes were proposed in this pull request? This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to support non-aggregated one-row subquery. In CheckAnalysis, we allow three types of correlated scalar subquery patterns: 1. SubqueryAlias/Project + Aggregate 2. SubqueryAlias/Project + Filter + Aggregate 3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1) https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856 We should support the thrid case in `splitSubquery` to avoid `Unexpected operator` exceptions. ### Why are the changes needed? To fix an issue with correlated subquery rewrite. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
sunchao commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1002187497 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala: ## @@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter( } else { TimestampNTZType } + case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS => Review Comment: Make sense. I'm OK with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002181146 ## dev/check-codegen-python.py: ## @@ -75,7 +75,9 @@ def check_connect_protos(): else: fail( "Generated files for pyspark-connect are out of sync! " -"Please run ./connector/connect/dev/generate_protos.sh" +"If you have touched files under connector/connect/src/main/protobuf, " +"please run ./connector/connect/dev/generate_protos.sh. " +"If you haven't touched any file above, please rebase your PR against main branch." Review Comment: This is to help pass CI/CD. CI/CD sync to latest main branch if I understand it right then we hit the scenario above: https://user-images.githubusercontent.com/1938382/197287176-e92cf3f5-7dc8-4273-bc8d-e3e0272327b3.png;> If don't consider CI/CD, the local development has no issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002181146 ## dev/check-codegen-python.py: ## @@ -75,7 +75,9 @@ def check_connect_protos(): else: fail( "Generated files for pyspark-connect are out of sync! " -"Please run ./connector/connect/dev/generate_protos.sh" +"If you have touched files under connector/connect/src/main/protobuf, " +"please run ./connector/connect/dev/generate_protos.sh. " +"If you haven't touched any file above, please rebase your PR against main branch." Review Comment: This is to help pass CI/CD. CI/CD sync to latest main branch if I understand it right: https://user-images.githubusercontent.com/1938382/197287176-e92cf3f5-7dc8-4273-bc8d-e3e0272327b3.png;> If don't consider CI/CD, the local development has no issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
rangadi commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002179251 ## dev/check-codegen-python.py: ## @@ -75,7 +75,9 @@ def check_connect_protos(): else: fail( "Generated files for pyspark-connect are out of sync! " -"Please run ./connector/connect/dev/generate_protos.sh" +"If you have touched files under connector/connect/src/main/protobuf, " +"please run ./connector/connect/dev/generate_protos.sh. " +"If you haven't touched any file above, please rebase your PR against main branch." Review Comment: Is this for CI/CD or for local tests? Rebasing helps local branch, yes, not for CI/CD. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002175956 ## dev/check-codegen-python.py: ## @@ -75,7 +75,9 @@ def check_connect_protos(): else: fail( "Generated files for pyspark-connect are out of sync! " -"Please run ./connector/connect/dev/generate_protos.sh" +"If you have touched files under connector/connect/src/main/protobuf, " +"please run ./connector/connect/dev/generate_protos.sh. " +"If you haven't touched any file above, please rebase your PR against main branch." Review Comment: Here are two layers of things: proto and its generated files. This script generates files from the proto itself, and proto might be updated in the main branch, so that the generated files in the sky now is different from the files in this local branch thus the check fails. When merging with the main branch, the generated files might not encounter conflicts but I guess that does not mean it matches when one try to re-generate it from the proto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002175956 ## dev/check-codegen-python.py: ## @@ -75,7 +75,9 @@ def check_connect_protos(): else: fail( "Generated files for pyspark-connect are out of sync! " -"Please run ./connector/connect/dev/generate_protos.sh" +"If you have touched files under connector/connect/src/main/protobuf, " +"please run ./connector/connect/dev/generate_protos.sh. " +"If you haven't touched any file above, please rebase your PR against main branch." Review Comment: This script generates files from the proto itself, and proto might be updated in the main branch, so that the generated files in the sky now is different from the files in this local branch thus the check fails. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
rangadi commented on code in PR #38335: URL: https://github.com/apache/spark/pull/38335#discussion_r1002173180 ## dev/check-codegen-python.py: ## @@ -75,7 +75,9 @@ def check_connect_protos(): else: fail( "Generated files for pyspark-connect are out of sync! " -"Please run ./connector/connect/dev/generate_protos.sh" +"If you have touched files under connector/connect/src/main/protobuf, " +"please run ./connector/connect/dev/generate_protos.sh. " +"If you haven't touched any file above, please rebase your PR against main branch." Review Comment: How does rebase help? I think ci/cd merges with master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia commented on PR #38335: URL: https://github.com/apache/spark/pull/38335#issuecomment-1287367746 R: @zhengruifeng @HyukjinKwon cc: @HeartSaVioR @rangadi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client
amaliujia opened a new pull request, #38335: URL: https://github.com/apache/spark/pull/38335 ### What changes were proposed in this pull request? 1. Improve README for how to install `buf` dependency and then run the proto generated file script. 2. Improve the message for out-of-sync check script. ### Why are the changes needed? Improve developer experience either when touching Connect proto files or not touch those files but having master branch out of sync with local branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing checks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`
amaliujia commented on code in PR #38318: URL: https://github.com/apache/spark/pull/38318#discussion_r1002094592 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) { transformRelation(rel.getInput)) } + private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = { +val child = transformRelation(rel.getInput) + +rel.getFunctionCase match { + case proto.DataFrameFunction.FunctionCase.SUMMARY => +StatFunctions + .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq) Review Comment: +1! I don't know how to add a new plan. It would be very useful to have a PR as an example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38320: [SPARK-40857] [CONNECT] Enable configurable GPRC Interceptors.
amaliujia commented on PR #38320: URL: https://github.com/apache/spark/pull/38320#issuecomment-1287316245 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.
amaliujia commented on PR #38300: URL: https://github.com/apache/spark/pull/38300#issuecomment-1287280054 Thanks for working on this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r1002043949 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.write; + +import java.io.IOException; + +import org.apache.spark.annotation.Experimental; + +/** + * A data writer returned by {@link DeltaWriterFactory#createWriter(int, long)} and is + * responsible for writing a delta of rows. + * + * @since 3.4.0 + */ +@Experimental +public interface DeltaWriter extends DataWriter { Review Comment: Let me take a closer look on Monday. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jerrypeng commented on pull request #38313: [SPARK-40849][SS] Async log purge
jerrypeng commented on PR #38313: URL: https://github.com/apache/spark/pull/38313#issuecomment-1287254708 thanks @HeartSaVioR ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on pull request #36801: [SPARK-39404][SS] Minor fix for querying `_metadata` in streaming
Yaohua628 commented on PR #36801: URL: https://github.com/apache/spark/pull/36801#issuecomment-1287241165 > @HeartSaVioR @Yaohua628, I don't see this commit in 3.3.1-rc4 branch, while we have more recent commits, e.g.: [946a960](https://github.com/apache/spark/commit/946a96022a54f298fb6081102434fe575126266b) in RC4 > > I'm wondering if any particular reason to don't include this fix. > > Thanks Ah, good catch! I guess we missed merging it to 3.3. I will have a backport PR shortly cc @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] felipepessoto commented on pull request #36801: [SPARK-39404][SS] Minor fix for querying `_metadata` in streaming
felipepessoto commented on PR #36801: URL: https://github.com/apache/spark/pull/36801#issuecomment-1287233897 @HeartSaVioR @Yaohua628, I don't see this commit in 3.3.1-rc4 branch, while we have more recent commits, e.g.: https://github.com/apache/spark/commit/946a96022a54f298fb6081102434fe575126266b in RC4 I'm wondering if any particular reason to don't include this fix. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] philwalk commented on pull request #38228: [SPARK-40739][SPARK-40738] Fixes for cygwin/msys2/mingw sbt build and bash scripts
philwalk commented on PR #38228: URL: https://github.com/apache/spark/pull/38228#issuecomment-1287213144 When I try to add repository via "Manage Actions Access", I get this: `Can't grant permissions to unknown collaborator ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] philwalk commented on pull request #38228: [SPARK-40739][SPARK-40738] Fixes for cygwin/msys2/mingw sbt build and bash scripts
philwalk commented on PR #38228: URL: https://github.com/apache/spark/pull/38228#issuecomment-1287202796 > @vitaliili-db Did you set any permission by yourself manually before? I just have a try (fork other apache project) the default permission is `Read and Write` by default. The `Workflow permissions` are already set for the default: "Read and Write permissions". I will do more experiments ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tomtongue commented on a diff in pull request #38268: [SPARK-40804][SQL] Missing handling a catalog name in destination tables in RenameTableExec
tomtongue commented on code in PR #38268: URL: https://github.com/apache/spark/pull/38268#discussion_r997128018 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableRenameSuite.scala: ## @@ -34,4 +34,15 @@ class AlterTableRenameSuite extends command.AlterTableRenameSuiteBase with Comma } } } + + test("include catalog in the destination table") { +withNamespaceAndTable("ns", "dst_tbl", catalog) { dst => + val src = dst.replace("dst", "src") + sql(s"CREATE TABLE $src (c0 INT) $defaultUsing") + sql(s"INSERT INTO $src SELECT 0") + + sql(s"ALTER TABLE $src RENAME TO $catalog.ns.dst_tbl") Review Comment: Thanks for checking this. I added the different catalog case, update the current test and pushed them. @amaliujia could you check the latest one when you have time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tomtongue commented on a diff in pull request #38268: [SPARK-40804][SQL] Missing handling a catalog name in destination tables in RenameTableExec
tomtongue commented on code in PR #38268: URL: https://github.com/apache/spark/pull/38268#discussion_r997128018 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableRenameSuite.scala: ## @@ -34,4 +34,15 @@ class AlterTableRenameSuite extends command.AlterTableRenameSuiteBase with Comma } } } + + test("include catalog in the destination table") { +withNamespaceAndTable("ns", "dst_tbl", catalog) { dst => + val src = dst.replace("dst", "src") + sql(s"CREATE TABLE $src (c0 INT) $defaultUsing") + sql(s"INSERT INTO $src SELECT 0") + + sql(s"ALTER TABLE $src RENAME TO $catalog.ns.dst_tbl") Review Comment: Thanks for checking this. I added the different catalog case, update the current test and pushed them. @amaliujia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a diff in pull request #38277: [SPARK-40815][SQL] Introduce DelegateSymlinkTextInputFormat to handle empty splits when "spark.hadoopRDD.ignoreEmptySplits" is enab
sunchao commented on code in PR #38277: URL: https://github.com/apache/spark/pull/38277#discussion_r1001963440 ## sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java: ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; + +import org.apache.hadoop.fs.ContentSummary; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.Reporter; + +/** + * Delegate for SymlinkTextInputFormat, created to address SPARK-40815. + * Fixes an issue where SymlinkTextInputFormat returns empty splits which could result in + * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled. + * + * In this class, we update the split start and length to match the target file input thus fixing + * the issue. + */ +@SuppressWarnings("deprecation") +public class DelegateSymlinkTextInputFormat extends SymlinkTextInputFormat { + + public static class DelegateSymlinkTextInputSplit extends FileSplit { +private final SymlinkTextInputSplit split; + +public DelegateSymlinkTextInputSplit() { + super((Path) null, 0, 0, (String[]) null); + split = new SymlinkTextInputSplit(); +} + +public DelegateSymlinkTextInputSplit(Path symlinkPath, SymlinkTextInputSplit split) throws IOException { + // It is fine to set start and length to the target file split because + // SymlinkTextInputFormat maintains 1-1 mapping between SymlinkTextInputSplit and FileSplit. + super(symlinkPath, +split.getTargetSplit().getStart(), +split.getTargetSplit().getLength(), +split.getTargetSplit().getLocations()); + this.split = split; +} + +/** + * Returns delegate input split. + */ +private SymlinkTextInputSplit getSplit() { + return split; +} + +@Override +public void write(DataOutput out) throws IOException { + super.write(out); Review Comment: hmm won't this call `FileSplit.write` twice? ## sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java: ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; + +import org.apache.hadoop.fs.ContentSummary; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.Reporter; + +/** + * Delegate for SymlinkTextInputFormat, created to address SPARK-40815. + * Fixes an issue where SymlinkTextInputFormat returns empty splits which could result in + * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled. + * + * In this class, we update the split start and length to match the target file input thus fixing + * the issue. + */ +@SuppressWarnings("deprecation") +public class
[GitHub] [spark] peter-toth opened a new pull request, #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled
peter-toth opened a new pull request, #38334: URL: https://github.com/apache/spark/pull/38334 ### What changes were proposed in this pull request? This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script: ``` bin/pyspark --conf spark.io.encryption.enabled=true ... bar = {"a": "aa", "b": "bb"} foo = spark.sparkContext.broadcast(bar) spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "") spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect() ``` fails with: ``` 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command command = serializer._read_with_length(file) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length return self.loads(obj) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads return cloudpickle.loads(obj, encoding=encoding) EOFError: Ran out of input ``` The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code: https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420 the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side. Please note that the example above works in Spark 3.3 without this fix. That is because https://github.com/apache/spark/pull/36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`: https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242 changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that https://github.com/apache/spark/pull/36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does currently, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs). ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #38262: [SPARK-40801][BUILD] Upgrade `Apache commons-text` to 1.10
bjornjorgensen commented on PR #38262: URL: https://github.com/apache/spark/pull/38262#issuecomment-1287127138 @dongjoon-hyun and @wangyum this is a [9.8 CRITICAL](https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?name=CVE-2022-42889=AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H=3.1=NIST) so yours [are starting to ask if we can get this to other branches](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-40861) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38324: [Temp] Protobuf generate V2 and V3 protos and extend tests.
AmplabJenkins commented on PR #38324: URL: https://github.com/apache/spark/pull/38324#issuecomment-1287089218 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38330: [SPARK-40868][SQL] Avoid introducing too many partitions when bucketed scan disabled by sql planner
AmplabJenkins commented on PR #38330: URL: https://github.com/apache/spark/pull/38330#issuecomment-1287089093 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38327: [SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client
AmplabJenkins commented on PR #38327: URL: https://github.com/apache/spark/pull/38327#issuecomment-1287089151 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38331: [SPARK-40869][K8S] Resource name prefix should not start with a -
AmplabJenkins commented on PR #38331: URL: https://github.com/apache/spark/pull/38331#issuecomment-1287089036 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38332: [WIP][SPARK-40750][SQL] Migrate type check failures of math expressions onto error classes
AmplabJenkins commented on PR #38332: URL: https://github.com/apache/spark/pull/38332#issuecomment-1287088983 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
AmplabJenkins commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1287088925 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
mridulm commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1287007388 +CC @otterc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] codewithmehul commented on pull request #31147: [SPARK-32380][SQL]fixed spark3.0 access hive table while data in hbase problem
codewithmehul commented on PR #31147: URL: https://github.com/apache/spark/pull/31147#issuecomment-1287007207 Is this change available in official release? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard metrics 4.2.12
srowen commented on PR #38328: URL: https://github.com/apache/spark/pull/38328#issuecomment-1286999833 MErged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard metrics 4.2.12
srowen closed pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard metrics 4.2.12 URL: https://github.com/apache/spark/pull/38328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38329: [SPARK-40865][BUILD] Upgrade jodatime to 2.12.0
srowen commented on PR #38329: URL: https://github.com/apache/spark/pull/38329#issuecomment-1286998754 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #38329: [SPARK-40865][BUILD] Upgrade jodatime to 2.12.0
srowen closed pull request #38329: [SPARK-40865][BUILD] Upgrade jodatime to 2.12.0 URL: https://github.com/apache/spark/pull/38329 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LucaCanali commented on a diff in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Python UDFs using SQL metrics
LucaCanali commented on code in PR #33559: URL: https://github.com/apache/spark/pull/33559#discussion_r1001798239 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala: ## @@ -94,6 +94,8 @@ case class AggregateInPandasExec( } override protected def doExecute(): RDD[InternalRow] = { +metrics // force lazy init at driver Review Comment: Good point, I have now removed it, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on a diff in pull request #37075: [SparkConnect] Initial Protobuf Definitions
steveloughran commented on code in PR #37075: URL: https://github.com/apache/spark/pull/37075#discussion_r1001761745 ## pom.xml: ## @@ -116,7 +116,7 @@ 2.17.2 3.3.3 -2.5.0 +3.21.1 Review Comment: shaded, yes. if unshaded, then if you update protobuf.jar, all .class files compiled with the older version of protobuf are unlikely to link. let alone work -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38288: [SPARK-40821][SQL][CORE][PYTHON][SS] Introduce window_time function to extract event time from the window column
HeartSaVioR commented on code in PR #38288: URL: https://github.com/apache/spark/pull/38288#discussion_r1001755349 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -4201,6 +4219,73 @@ object SessionWindowing extends Rule[LogicalPlan] { } } +/** + * Resolves the window_time expression which extracts the correct window time from the + * window column generated as the output of the window aggregating operators. The + * window column is of type struct { start: TimestampType, end: TimestampType }. + * The correct window time for further aggregations is window.end - 1. + * */ +object ResolveWindowTime extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { +case p: LogicalPlan if p.children.size == 1 => + val child = p.children.head + val windowTimeExpressions = +p.expressions.flatMap(_.collect { case w: WindowTime => w }).toSet + + if (windowTimeExpressions.size == 1 && Review Comment: The test case should construct two different windows and call window_time per each window to "fail". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38288: [SPARK-40821][SQL][CORE][PYTHON][SS] Introduce window_time function to extract event time from the window column
HeartSaVioR commented on code in PR #38288: URL: https://github.com/apache/spark/pull/38288#discussion_r1001749087 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameTimeWindowingSuite.scala: ## @@ -575,4 +575,61 @@ class DataFrameTimeWindowingSuite extends QueryTest with SharedSparkSession { validateWindowColumnInSchema(schema2, "window") } } + + test("window_time function on raw window column") { +val df = Seq( + ("2016-03-27 19:38:18"), ("2016-03-27 19:39:25") +).toDF("time") + +checkAnswer( + df.select(window($"time", "10 seconds").as("window")) +.select( + $"window.end".cast("string"), + window_time($"window").cast("string") +), + Seq( +Row("2016-03-27 19:38:20", "2016-03-27 19:38:19.99"), +Row("2016-03-27 19:39:30", "2016-03-27 19:39:29.99") + ) +) + } + + test("2 window_time functions on raw window column") { +val df = Seq( + ("2016-03-27 19:38:18"), ("2016-03-27 19:39:25") +).toDF("time") + +checkAnswer( + df.select(window($"time", "10 seconds").as("window")) +.select( + $"window.end".cast("string"), + window_time($"window").cast("string").as("window1"), Review Comment: OK this was pretty straightforward... Refer the definition of WindowTime. `case class WindowTime(windowColumn: Expression)` We apply `.toSet` to collect the instances with deduplication, so as long as the windowColumn is resolved as the same, they are considered to the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaoyajun02 opened a new pull request, #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
gaoyajun02 opened a new pull request, #38333: URL: https://github.com/apache/spark/pull/38333 ### What changes were proposed in this pull request? A large number of shuffle tests in our cluster show that bad nodes with chunk corruption appear have a probability of fetching zero-size shuffleChunks. In this case, we can fall back to original shuffle blocks. ### Why are the changes needed? When the reduce task obtains the merge data with a zero-size buf error, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.
HyukjinKwon closed pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection. URL: https://github.com/apache/spark/pull/38300 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38332: [SPARK-40750][SQL] Migrate type check failures of math expressions onto error classes
panbingkun opened a new pull request, #38332: URL: https://github.com/apache/spark/pull/38332 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - Add new UT. - Update Existed UT. - Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.
HyukjinKwon commented on PR #38300: URL: https://github.com/apache/spark/pull/38300#issuecomment-1286853167 All related tests passed. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`
zhengruifeng closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary` URL: https://github.com/apache/spark/pull/38318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`
zhengruifeng commented on code in PR #38318: URL: https://github.com/apache/spark/pull/38318#discussion_r1001680211 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) { transformRelation(rel.getInput)) } + private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = { +val child = transformRelation(rel.getInput) + +rel.getFunctionCase match { + case proto.DataFrameFunction.FunctionCase.SUMMARY => +StatFunctions + .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq) Review Comment: yes, then it will have more optimization space. let us add new plan for it. thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add missing s prefix to enable string interpolation
HyukjinKwon closed pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add missing s prefix to enable string interpolation URL: https://github.com/apache/spark/pull/38307 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add missing s prefix to enable string interpolation
HyukjinKwon commented on PR #38307: URL: https://github.com/apache/spark/pull/38307#issuecomment-1286821880 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add missing s prefix to enable string interpolation
EnricoMi commented on PR #38307: URL: https://github.com/apache/spark/pull/38307#issuecomment-1286806013 Rebased with master, all green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`
zhengruifeng commented on PR #38310: URL: https://github.com/apache/spark/pull/38310#issuecomment-1286804042 thank you guys -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tobiasstadler opened a new pull request, #38331: [SPARK-40869][K8S] Resource name prefix should not start with a -
tobiasstadler opened a new pull request, #38331: URL: https://github.com/apache/spark/pull/38331 ### What changes were proposed in this pull request? Strip leading - from resource name prefix ### Why are the changes needed? leading - are not allowed for resource name prefix (especially spark.kubernetes.executor.podNamePrefix) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xingchaozh opened a new pull request, #38330: [SPARK-40868][SQL] Avoid introducing too many partitions when bucketed scan disabled by sql planner
xingchaozh opened a new pull request, #38330: URL: https://github.com/apache/spark/pull/38330 ### What changes were proposed in this pull request? For bucket tables with huge size, lots of partitions maybe generated if bucketed scan disabled by sql planner. We can add one limit(default as BUCKETING_MAX_BUCKETS) to reduce the partitions for non-bucketed scan. ### Why are the changes needed? Avoid too many tasks introduced in single stage. Before ![image](https://user-images.githubusercontent.com/7522130/197170079-35ee13f2-1d6c-425b-bfdd-5c4ac3c8297c.png) After ![image](https://user-images.githubusercontent.com/7522130/197170156-556f839b-8fde-4f29-862b-5a6d90bd0a8e.png) ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on PR #38223: URL: https://github.com/apache/spark/pull/38223#issuecomment-1286734504 I have also limited the schema string in the error message to 1024 characters / missing and unexpected columns to 5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1001606273 ## python/pyspark/sql/pandas/serializers.py: ## @@ -216,7 +216,7 @@ def _create_batch(self, series): series = [series] series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) -def create_array(s, t): +def create_array(s, t, n): Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] plokhotnyuk commented on pull request #37604: [DON'T MERGE] Try to replace all `json4s` with `Jackson`
plokhotnyuk commented on PR #37604: URL: https://github.com/apache/spark/pull/37604#issuecomment-1286732347 > [Jackson is much safer than json4s](https://github.com/json4s/json4s/search?q=Denial+of+Service=open=issues) > I was wrong, it seems that jackson-core is vulnerable too: https://github.com/FasterXML/jackson-module-scala/issues/609 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1001564996 ## python/pyspark/sql/pandas/serializers.py: ## @@ -216,7 +216,7 @@ def _create_batch(self, series): series = [series] series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) -def create_array(s, t): +def create_array(s, t, n): Review Comment: But `s[s.columns[i]].rename(field.name)` fixes that. Thanks for the idea! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1001556844 ## python/pyspark/sql/pandas/serializers.py: ## @@ -216,7 +216,7 @@ def _create_batch(self, series): series = [series] series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) -def create_array(s, t): +def create_array(s, t, n): Review Comment: I know, this parameter is ugly as it is only used for the exception, but `s.name` is `1` instead of `mean` (test `test_apply_in_pandas_returning_incompatible_type`) when I apply this change. Looks like `s[field.name]` and `s[s.columns[i]]` returns a `Series` that thinks it's name is `1`. Sad. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on pull request #17: [SPARK-40864] Remove pip/setuptools dynamic upgrade
Yikun commented on PR #17: URL: https://github.com/apache/spark-docker/pull/17#issuecomment-1286673991 @HyukjinKwon Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun closed pull request #17: [SPARK-40864] Remove pip/setuptools dynamic upgrade
Yikun closed pull request #17: [SPARK-40864] Remove pip/setuptools dynamic upgrade URL: https://github.com/apache/spark-docker/pull/17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard metrics 4.2.12
LuciferYang commented on PR #38328: URL: https://github.com/apache/spark/pull/38328#issuecomment-1286665503 ``` [info] ProtobufCatalystDataConversionSuite: [info] - single StructType(StructField(int32_type,IntegerType,true)) with seed 167 *** FAILED *** (39 milliseconds) [info] Incorrect evaluation (codegen off): from_protobuf(to_protobuf([0], /home/runner/work/spark/spark/connector/protobuf/target/scala-2.12/test-classes/protobuf/catalyst_types.desc, IntegerMsg), /home/runner/work/spark/spark/connector/protobuf/target/scala-2.12/test-classes/protobuf/catalyst_types.desc, IntegerMsg), actual: [null], expected: [0] (ExpressionEvalHelper.scala:209) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564) [info] at org.scalatest.Assertions.fail(Assertions.scala:933) [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:209) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199) [info] at org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkEvaluationWithoutCodegen(ProtobufCatalystDataConversionSuite.scala:33) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82) [info] at org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkEvaluation(ProtobufCatalystDataConversionSuite.scala:33) [info] at org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkResult(ProtobufCatalystDataConversionSuite.scala:43) [info] at org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:122) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:66) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:66) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) [info] at org.scalatest.Suite.run(Suite.scala:1114) [info] at org.scalatest.Suite.run$(Suite.scala:1096) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at
[GitHub] [spark-docker] dcoliversun opened a new pull request, #19: [SPARK-40855] Add CONTRIBUTING.md for apache/spark-docker
dcoliversun opened a new pull request, #19: URL: https://github.com/apache/spark-docker/pull/19 ### What changes were proposed in this pull request? This PR aims to add `CONTRIBUTING.md` for apache/spark-docker. ### Why are the changes needed? Better to briefly explain how to contribute DOI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ![image](https://user-images.githubusercontent.com/44011673/197155544-bfae0c70-ee01-44b0-851d-ed5c288129d9.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.
grundprinzip commented on code in PR #38300: URL: https://github.com/apache/spark/pull/38300#discussion_r1001541171 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -57,21 +60,67 @@ class SparkConnectStreamHandler(responseObserver: StreamObserver[Response]) exte processRows(request.getClientId, rows) } - private def processRows(clientId: String, rows: DataFrame) = { + private[connect] def processRows(clientId: String, rows: DataFrame): Unit = { val timeZoneId = SQLConf.get.sessionLocalTimeZone -val schema = - ByteString.copyFrom(ArrowUtils.toArrowSchema(rows.schema, timeZoneId).toByteArray) - -val textSchema = rows.schema.fields.map(f => f.name).mkString("|") -val data = rows.collect().map(x => x.toSeq.mkString("|")).mkString("\n") -val bbb = proto.Response.CSVBatch.newBuilder - .setRowCount(-1) - .setData(textSchema ++ "\n" ++ data) - .build() -val response = proto.Response.newBuilder().setClientId(clientId).setCsvBatch(bbb).build() -// Send all the data -responseObserver.onNext(response) +// Only process up to 10MB of data. +val sb = new StringBuilder +var rowCount = 0 +rows.toJSON + .toLocalIterator() Review Comment: I moved this back to `collect()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38320: [SPARK-40857] [CONNECT] Enable configurable GPRC Interceptors.
grundprinzip commented on code in PR #38320: URL: https://github.com/apache/spark/pull/38320#discussion_r1001531871 ## core/src/main/resources/error/error-classes.json: ## @@ -698,6 +698,23 @@ ], "sqlState" : "22023" }, + "SPARK_CONNECT" : { Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on code in PR #38312: URL: https://github.com/apache/spark/pull/38312#discussion_r1001531184 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala: ## @@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter( } else { TimestampNTZType } + case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS => Review Comment: So this is not a valid workaround. @sunchao how is your feeling about restoring earlier useful behaviour? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org