[GitHub] [spark] amaliujia commented on pull request #38345: [SPARK-40879][CONNECT] Support Join UsingColumns in proto

2022-10-21 Thread GitBox


amaliujia commented on PR #38345:
URL: https://github.com/apache/spark/pull/38345#issuecomment-1287635120

   R: @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #38345: [SPARK-40879][CONNECT] Support Join UsingColumns in proto

2022-10-21 Thread GitBox


amaliujia opened a new pull request, #38345:
URL: https://github.com/apache/spark/pull/38345

   
   
   ### What changes were proposed in this pull request?
   
   I was working on refactoring Connect proto tests from Catalyst DSL to 
DataFrame API, and identified that Join in Connect does not support 
`UsingColumns`. This is a gap between the Connect proto and DataFrame API. This 
also blocks the refactoring work because without `UsingColumns`, there is no 
compatible DataFrame Join API that we can covert existing tests to. 
   
   This PR adds the support for Join's `UsingColumns`.
   
   ### Why are the changes needed?
   
   1. Improve API coverage.
   2. Unblock testing refactoring. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38338: [SPARK-40878][INFRA] pin 'grpcio==1.48.1' 'protobuf==4.21.6'

2022-10-21 Thread GitBox


amaliujia commented on PR #38338:
URL: https://github.com/apache/spark/pull/38338#issuecomment-1287632262

   Thanks for working on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38338: [SPARK-40878][INFRA] pin 'grpcio==1.48.1' 'protobuf==4.21.6'

2022-10-21 Thread GitBox


zhengruifeng commented on PR #38338:
URL: https://github.com/apache/spark/pull/38338#issuecomment-1287607808

   Merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #38338: [SPARK-40878][INFRA] pin 'grpcio==1.48.1' 'protobuf==4.21.6'

2022-10-21 Thread GitBox


zhengruifeng closed pull request #38338: [SPARK-40878][INFRA] pin 
'grpcio==1.48.1' 'protobuf==4.21.6'
URL: https://github.com/apache/spark/pull/38338


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38344: [SPARK-40777][SQL][SPARK-PROTOBUF] : Protobuf import support and move error-classes.

2022-10-21 Thread GitBox


AmplabJenkins commented on PR #38344:
URL: https://github.com/apache/spark/pull/38344#issuecomment-1287607239

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][SPARK-PROTOBUF] : Protobuf import support and move error-classes.

2022-10-21 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1002318758


##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##
@@ -62,14 +63,16 @@ object SchemaConverters {
   case STRING => Some(StringType)
   case BYTE_STRING => Some(BinaryType)
   case ENUM => Some(StringType)
-  case MESSAGE if fd.getMessageType.getName == "Duration" =>
+  case MESSAGE if fd.getMessageType.getName == "Duration" &&

Review Comment:
   @rangadi My assumption is that users should be able to use the Timestamp and 
Duration message types with different fields. 



##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala:
##
@@ -196,27 +195,32 @@ private[sql] object ProtobufUtils extends Logging {
   fileDescriptorSet = DescriptorProtos.FileDescriptorSet.parseFrom(dscFile)
 } catch {
   case ex: InvalidProtocolBufferException =>
-// TODO move all the exceptions to 
core/src/main/resources/error/error-classes.json
-throw new RuntimeException("Error parsing descriptor byte[] into 
Descriptor object", ex)
+throw QueryCompilationErrors.descrioptorParseError(ex)
   case ex: IOException =>
-throw new RuntimeException(
-  "Error reading Protobuf descriptor file at path: " +
-descFilePath,
-  ex)
+throw 
QueryCompilationErrors.cannotFindDescriptorFileError(descFilePath, ex)
 }
 
-val descriptorProto: DescriptorProtos.FileDescriptorProto = 
fileDescriptorSet.getFile(0)
+val descriptorProto: DescriptorProtos.FileDescriptorProto =

Review Comment:
   @rangadi This handles the protobuf import; please let me know if you know of 
a better approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN opened a new pull request, #38344: [SPARK-40777][SQL][SPARK-PROTOBUF] : Protobuf import support and move error-classes.

2022-10-21 Thread GitBox


SandishKumarHN opened a new pull request, #38344:
URL: https://github.com/apache/spark/pull/38344

   This is the follow-up PR to https://github.com/apache/spark/pull/37972 and 
https://github.com/apache/spark/pull/38212
   
   ### What changes were proposed in this pull request?
   1. Move spark-protobuf error classes to the spark error-classes 
framework(core/src/main/resources/error/error-classes.json). 
   2. Support protobuf imports
   3. validate protobuf timestamp and duration types.  
   
   
   ### Why are the changes needed?
   N/A
   
   ### Does this PR introduce _any_ user-facing change?
   None
   
   ### How was this patch tested?
   Existing tests should cover the validation of this PR.  
   
   CC: @rangadi @mposdev21 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #38343: [DONT MERGE] Test upgrade v6

2022-10-21 Thread GitBox


Yikun commented on PR #38343:
URL: https://github.com/apache/spark/pull/38343#issuecomment-1287577729

   https://user-images.githubusercontent.com/1736354/197311334-c200479d-bc62-4749-a0d0-11acc0a3c5cc.png;>
   
   Notify workflow works well, see resluts set right: 
https://github.com/apache/spark/pull/38343/checks?check_run_id=9041518603
   
   Let's wait update status result


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #38342: [SPARK-40870][INFRA] Upgrade docker actions to cleanup warning

2022-10-21 Thread GitBox


Yikun commented on PR #38342:
URL: https://github.com/apache/spark/pull/38342#issuecomment-1287576267

   Just a rebase, let's wait CI passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun opened a new pull request, #38342: [SPARK-40870][INFRA] Upgrade docker actions to cleanup warning

2022-10-21 Thread GitBox


Yikun opened a new pull request, #38342:
URL: https://github.com/apache/spark/pull/38342

   ### What changes were proposed in this pull request?
   Upgrade docker actions to cleanup warning
   
   
   ### Why are the changes needed?
   - docker/setup-qemu-action from v1 to v2:
 https://github.com/docker/setup-qemu-action/releases/tag/v2.0.0: Cleanup 
node 12 warning
 https://github.com/docker/setup-qemu-action/releases/tag/v2.1.0: Cleanup 
set-ouput save-state waring
   - docker/setup-buildx-action from v1 to v2:
 https://github.com/docker/setup-buildx-action/releases/tag/v2.0.0: Cleanup 
node 12 warning
 https://github.com/docker/setup-buildx-action/releases/tag/v2.1.0: Cleanup 
set-ouput save-state waring
   - docker/build-push-action from v2 to v3
 https://github.com/docker/build-push-action/releases/tag/v3.0.0: Cleanup 
node 12 warning
 https://github.com/docker/build-push-action/releases/tag/v3.2.0: Cleanup 
set-ouput save-state waring
   - docker/login-action from v1 to v2
 https://github.com/docker/login-action/releases/tag/v2.0.0: Cleanup node 
12 warning
 https://github.com/docker/login-action/releases/tag/v2.1.0: Cleanup 
set-ouput save-state waring
   
   ### Does this PR introduce _any_ user-facing change?
   No, dev only
   
   
   ### How was this patch tested?
   CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

2022-10-21 Thread GitBox


HyukjinKwon closed pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in 
Python UDFs when encryption enabled
URL: https://github.com/apache/spark/pull/38334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

2022-10-21 Thread GitBox


HyukjinKwon commented on PR #38334:
URL: https://github.com/apache/spark/pull/38334#issuecomment-1287575193

   I am merging this to master but is the issue only in master branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow

2022-10-21 Thread GitBox


HyukjinKwon closed pull request #38341: [SPARK-40871][INFRA] Upgrade 
actions/github-script to v6 and fix notify workflow
URL: https://github.com/apache/spark/pull/38341


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow

2022-10-21 Thread GitBox


HyukjinKwon commented on PR #38341:
URL: https://github.com/apache/spark/pull/38341#issuecomment-1287574881

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun opened a new pull request, #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow

2022-10-21 Thread GitBox


Yikun opened a new pull request, #38341:
URL: https://github.com/apache/spark/pull/38341

   ### What changes were proposed in this pull request?
   Upgrade actions/github-scripts from v3 to v6 and fix notify workflow
   
   
   ### Why are the changes needed?
   Node.js 12 actions are deprecated. For more information see: 
https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/.
 
   
   - Since github-script V5,  change from `github.*` to `github.rest.*`, but 
`request`, `paginate` are unchanged.
 see also https://github.com/actions/github-script#breaking-changes-in-v5
   - Since github-script V6, upgrade node12 to node16
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, dev only
   
   
   ### How was this patch tested?
   - Due to `pull_request_target`, the current PR is not in effect, we can only 
do test on local : https://github.com/Yikun/spark/pull/181
   
   Notify works well:
   https://user-images.githubusercontent.com/1736354/197310102-6c709716-8a99-422d-8d38-3f770b6925f0.png;>
   
   Update status set to failed as expeceted:
   https://user-images.githubusercontent.com/1736354/197310119-30332769-0553-4ffa-816c-97a5ec0b3c27.png;>
   
   And `See test results` set right.
   https://github.com/Yikun/spark/pull/181/checks?check_run_id=9029035780
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #38341: [SPARK-40871][INFRA] Upgrade actions/github-script to v6 and fix notify workflow

2022-10-21 Thread GitBox


Yikun commented on PR #38341:
URL: https://github.com/apache/spark/pull/38341#issuecomment-1287572112

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #38340: [SPARK-40877][SQL] Reimplement `crosstab` with dataframe operations

2022-10-21 Thread GitBox


zhengruifeng opened a new pull request, #38340:
URL: https://github.com/apache/spark/pull/38340

   ### What changes were proposed in this pull request?
   Reimplement `crosstab` with dataframe operations
   
   
   ### Why are the changes needed?
   1, existing implement collect all distinct `col1, col2` pairs to driver, 
while `pivot` only collect distinct `col2`;
   2, do not truncate the sql plan by RDD operations
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing UTs and manually check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] Add check command in Readme

2022-10-21 Thread GitBox


zhengruifeng commented on PR #38339:
URL: https://github.com/apache/spark/pull/38339#issuecomment-1287564103

   Merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] Add check command in Readme

2022-10-21 Thread GitBox


zhengruifeng closed pull request #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] 
Add check command in Readme
URL: https://github.com/apache/spark/pull/38339


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002289255


##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,8 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation
+2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0`
+3. Run `./dev/generate_protos.sh`

Review Comment:
   oops I copied this from my local command which is not under the spark repo 
root directory. 
   
   @zhengruifeng maybe you BTW update it in the PR that you pin dependency 
versions.



##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,8 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation
+2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0`
+3. Run `./dev/generate_protos.sh`

Review Comment:
   oops I copied this from my local command which is not under the spark repo 
root directory. 
   
   @zhengruifeng maybe you can BTW update it in the PR that you pin dependency 
versions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


zhengruifeng commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002289219


##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,8 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation
+2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0`
+3. Run `./dev/generate_protos.sh`

Review Comment:
   I fix it in https://github.com/apache/spark/pull/38339



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #38339: [SPARK-40796][CONNECT][DOC][FOLLOW-UP] Add check command in Readme

2022-10-21 Thread GitBox


zhengruifeng opened a new pull request, #38339:
URL: https://github.com/apache/spark/pull/38339

   ### What changes were proposed in this pull request?
   1, fix the code gen cmd;
   2, add the check cmd
   
   
   ### Why are the changes needed?
   fix the wrong cmd
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, user will use correct cmd and know how to check
   
   ### How was this patch tested?
   doc only
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #38338: [TEST] pin 'grpcio==1.48.1' 'protobuf==4.21.6'

2022-10-21 Thread GitBox


zhengruifeng opened a new pull request, #38338:
URL: https://github.com/apache/spark/pull/38338

   ### What changes were proposed in this pull request?
   pin 'grpcio==1.48.1' 'protobuf==4.21.6'
   
   ### Why are the changes needed?
   to avoid code generation conflict due to package upgrade
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


zhengruifeng commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002285288


##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,8 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation
+2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0`
+3. Run `./dev/generate_protos.sh`

Review Comment:
   @amaliujia 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


zhengruifeng commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002284563


##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,8 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation
+2. Run `pip install grpcio==1.48.1 protobuf==4.21.6 mypy-protobuf==3.3.0`
+3. Run `./dev/generate_protos.sh`

Review Comment:
   oops, should be `./connector/connect/dev/generate_protos.sh`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yaohua628 commented on pull request #38337: [SPARK-39404][SS][3.3] Minor fix for querying _metadata in streaming

2022-10-21 Thread GitBox


Yaohua628 commented on PR #38337:
URL: https://github.com/apache/spark/pull/38337#issuecomment-1287559471

   cc: @HeartSaVioR @felipepessoto


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yaohua628 opened a new pull request, #38337: [SPARK-39404][SS][3.3] Minor fix for querying _metadata in streaming

2022-10-21 Thread GitBox


Yaohua628 opened a new pull request, #38337:
URL: https://github.com/apache/spark/pull/38337

   ### What changes were proposed in this pull request?
   Cherry-pick https://github.com/apache/spark/pull/36801
   
   ### Why are the changes needed?
   Cherry-pick https://github.com/apache/spark/pull/36801
   
   
   ### Does this PR introduce _any_ user-facing change?
   N/A
   
   
   ### How was this patch tested?
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

2022-10-21 Thread GitBox


github-actions[bot] closed pull request #36893: [SPARK-39494][PYTHON] Support 
`createDataFrame` from a list of scalars when schema is not provided
URL: https://github.com/apache/spark/pull/36893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #37153: [SPARK-26052] Add type comments to exposed Prometheus metrics

2022-10-21 Thread GitBox


github-actions[bot] closed pull request #37153: [SPARK-26052] Add type comments 
to exposed Prometheus metrics
URL: https://github.com/apache/spark/pull/37153


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


zhengruifeng commented on PR #38335:
URL: https://github.com/apache/spark/pull/38335#issuecomment-1287552170

   merged into master, thank you @amaliujia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


zhengruifeng closed pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] 
Improve README for proto generated files in Connect Python client
URL: https://github.com/apache/spark/pull/38335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002270515


##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,7 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation

Review Comment:
   makes sense. I updated this doc with versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


zhengruifeng commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002269337


##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,7 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation

Review Comment:
   `buf`'s version in CI is `1.8.0`, the latest on is `1.9.0`.  I think we also 
need to mention the version



##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,7 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation

Review Comment:
   `buf`'s version in CI is `1.8.0`, the latest one is `1.9.0`.  I think we 
also need to mention the version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


zhengruifeng commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002269147


##
python/pyspark/sql/connect/README.md:
##
@@ -49,3 +49,7 @@ To use the release version of Spark Connect:
 ./python/run-tests --testnames 'pyspark.sql.tests.connect.test_connect_basic'
 ```
 
+## Generate proto generated files for the Python client
+1. Install `buf`: https://docs.buf.build/installation

Review Comment:
   I think we also need to mention the related python packages: 
   
   grpcio==1.48.1
   protobuf==4.21.6
   mypy-protobuf==3.3.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38313: [SPARK-40849][SS] Async log purge

2022-10-21 Thread GitBox


zhengruifeng commented on PR #38313:
URL: https://github.com/apache/spark/pull/38313#issuecomment-1287547087

   the failed python code gen check is unrelated to this PR, please rebase to 
make CI green


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db opened a new pull request, #38336: [SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery

2022-10-21 Thread GitBox


allisonwang-db opened a new pull request, #38336:
URL: https://github.com/apache/spark/pull/38336

   
   
   ### What changes were proposed in this pull request?
   This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to 
support non-aggregated one-row subquery.
   
   In CheckAnalysis, we allow three types of correlated scalar subquery 
patterns:
   1. SubqueryAlias/Project + Aggregate
   2. SubqueryAlias/Project + Filter + Aggregate
   3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1)
   
   
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856
   
   We should support the thrid case in `splitSubquery` to avoid `Unexpected 
operator` exceptions.
   
   ### Why are the changes needed?
   To fix an issue with correlated subquery rewrite.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   New unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-10-21 Thread GitBox


sunchao commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1002187497


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
 } else {
   TimestampNTZType
 }
+  case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit 
== TimeUnit.NANOS =>

Review Comment:
   Make sense. I'm OK with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002181146


##
dev/check-codegen-python.py:
##
@@ -75,7 +75,9 @@ def check_connect_protos():
 else:
 fail(
 "Generated files for pyspark-connect are out of sync! "
-"Please run ./connector/connect/dev/generate_protos.sh"
+"If you have touched files under 
connector/connect/src/main/protobuf, "
+"please run ./connector/connect/dev/generate_protos.sh. "
+"If you haven't touched any file above, please rebase your PR 
against main branch."

Review Comment:
   This is to help pass CI/CD. CI/CD sync to latest main branch if I understand 
it right then we hit the scenario above: 
   
   https://user-images.githubusercontent.com/1938382/197287176-e92cf3f5-7dc8-4273-bc8d-e3e0272327b3.png;>
   
   
   If don't consider CI/CD, the local development has no issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002181146


##
dev/check-codegen-python.py:
##
@@ -75,7 +75,9 @@ def check_connect_protos():
 else:
 fail(
 "Generated files for pyspark-connect are out of sync! "
-"Please run ./connector/connect/dev/generate_protos.sh"
+"If you have touched files under 
connector/connect/src/main/protobuf, "
+"please run ./connector/connect/dev/generate_protos.sh. "
+"If you haven't touched any file above, please rebase your PR 
against main branch."

Review Comment:
   This is to help pass CI/CD. CI/CD sync to latest main branch if I understand 
it right: 
   
   https://user-images.githubusercontent.com/1938382/197287176-e92cf3f5-7dc8-4273-bc8d-e3e0272327b3.png;>
   
   
   If don't consider CI/CD, the local development has no issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rangadi commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


rangadi commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002179251


##
dev/check-codegen-python.py:
##
@@ -75,7 +75,9 @@ def check_connect_protos():
 else:
 fail(
 "Generated files for pyspark-connect are out of sync! "
-"Please run ./connector/connect/dev/generate_protos.sh"
+"If you have touched files under 
connector/connect/src/main/protobuf, "
+"please run ./connector/connect/dev/generate_protos.sh. "
+"If you haven't touched any file above, please rebase your PR 
against main branch."

Review Comment:
   Is this for CI/CD or for local tests? Rebasing helps local branch, yes, not 
for CI/CD.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002175956


##
dev/check-codegen-python.py:
##
@@ -75,7 +75,9 @@ def check_connect_protos():
 else:
 fail(
 "Generated files for pyspark-connect are out of sync! "
-"Please run ./connector/connect/dev/generate_protos.sh"
+"If you have touched files under 
connector/connect/src/main/protobuf, "
+"please run ./connector/connect/dev/generate_protos.sh. "
+"If you haven't touched any file above, please rebase your PR 
against main branch."

Review Comment:
   Here are two layers of things: proto and its generated files.
   
   This script generates files from the proto itself, and proto might be 
updated in the main branch, so that the generated files in the sky now is 
different from the files in this local branch thus the check fails.
   
   When merging with the main branch, the generated files might not encounter 
conflicts but I guess that does not mean it matches when one try to re-generate 
it from the proto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002175956


##
dev/check-codegen-python.py:
##
@@ -75,7 +75,9 @@ def check_connect_protos():
 else:
 fail(
 "Generated files for pyspark-connect are out of sync! "
-"Please run ./connector/connect/dev/generate_protos.sh"
+"If you have touched files under 
connector/connect/src/main/protobuf, "
+"please run ./connector/connect/dev/generate_protos.sh. "
+"If you haven't touched any file above, please rebase your PR 
against main branch."

Review Comment:
   This script generates files from the proto itself, and proto might be 
updated in the main branch, so that the generated files in the sky now is 
different from the files in this local branch thus the check fails.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rangadi commented on a diff in pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


rangadi commented on code in PR #38335:
URL: https://github.com/apache/spark/pull/38335#discussion_r1002173180


##
dev/check-codegen-python.py:
##
@@ -75,7 +75,9 @@ def check_connect_protos():
 else:
 fail(
 "Generated files for pyspark-connect are out of sync! "
-"Please run ./connector/connect/dev/generate_protos.sh"
+"If you have touched files under 
connector/connect/src/main/protobuf, "
+"please run ./connector/connect/dev/generate_protos.sh. "
+"If you haven't touched any file above, please rebase your PR 
against main branch."

Review Comment:
   How does rebase help? I think ci/cd merges with master.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia commented on PR #38335:
URL: https://github.com/apache/spark/pull/38335#issuecomment-1287367746

   R: @zhengruifeng @HyukjinKwon 
   cc: @HeartSaVioR @rangadi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #38335: [SPARK-40796][CONNECT][FOLLOW-UP] Improve README for proto generated files in Connect Python client

2022-10-21 Thread GitBox


amaliujia opened a new pull request, #38335:
URL: https://github.com/apache/spark/pull/38335

   
   
   ### What changes were proposed in this pull request?
   
   1. Improve README for how to install `buf` dependency and then run the proto 
generated file script.
   2. Improve the message for out-of-sync check script.
   
   ### Why are the changes needed?
   
   Improve developer experience either when touching Connect proto files or not 
touch those files but having master branch out of sync with local branch.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing checks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

2022-10-21 Thread GitBox


amaliujia commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1002094592


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: 
SparkSession) {
   transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): 
LogicalPlan = {
+val child = transformRelation(rel.getInput)
+
+rel.getFunctionCase match {
+  case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+StatFunctions
+  .summary(Dataset.ofRows(session, child), 
rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   +1! 
   
   I don't know how to add a new plan. It would be very useful to have a PR as 
an example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38320: [SPARK-40857] [CONNECT] Enable configurable GPRC Interceptors.

2022-10-21 Thread GitBox


amaliujia commented on PR #38320:
URL: https://github.com/apache/spark/pull/38320#issuecomment-1287316245

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.

2022-10-21 Thread GitBox


amaliujia commented on PR #38300:
URL: https://github.com/apache/spark/pull/38300#issuecomment-1287280054

   Thanks for working on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-10-21 Thread GitBox


aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r1002043949


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java:
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.write;
+
+import java.io.IOException;
+
+import org.apache.spark.annotation.Experimental;
+
+/**
+ * A data writer returned by {@link DeltaWriterFactory#createWriter(int, 
long)} and is
+ * responsible for writing a delta of rows.
+ *
+ * @since 3.4.0
+ */
+@Experimental
+public interface DeltaWriter extends DataWriter {

Review Comment:
   Let me take a closer look on Monday.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerrypeng commented on pull request #38313: [SPARK-40849][SS] Async log purge

2022-10-21 Thread GitBox


jerrypeng commented on PR #38313:
URL: https://github.com/apache/spark/pull/38313#issuecomment-1287254708

   thanks @HeartSaVioR !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yaohua628 commented on pull request #36801: [SPARK-39404][SS] Minor fix for querying `_metadata` in streaming

2022-10-21 Thread GitBox


Yaohua628 commented on PR #36801:
URL: https://github.com/apache/spark/pull/36801#issuecomment-1287241165

   > @HeartSaVioR @Yaohua628, I don't see this commit in 3.3.1-rc4 branch, 
while we have more recent commits, e.g.: 
[946a960](https://github.com/apache/spark/commit/946a96022a54f298fb6081102434fe575126266b)
 in RC4
   > 
   > I'm wondering if any particular reason to don't include this fix.
   > 
   > Thanks
   
   Ah, good catch! I guess we missed merging it to 3.3. I will have a backport 
PR shortly cc @HeartSaVioR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] felipepessoto commented on pull request #36801: [SPARK-39404][SS] Minor fix for querying `_metadata` in streaming

2022-10-21 Thread GitBox


felipepessoto commented on PR #36801:
URL: https://github.com/apache/spark/pull/36801#issuecomment-1287233897

   @HeartSaVioR @Yaohua628, I don't see this commit in 3.3.1-rc4 branch, while 
we have more recent commits, e.g.: 
https://github.com/apache/spark/commit/946a96022a54f298fb6081102434fe575126266b 
in RC4
   
   I'm wondering if any particular reason to don't include this fix.
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] philwalk commented on pull request #38228: [SPARK-40739][SPARK-40738] Fixes for cygwin/msys2/mingw sbt build and bash scripts

2022-10-21 Thread GitBox


philwalk commented on PR #38228:
URL: https://github.com/apache/spark/pull/38228#issuecomment-1287213144

   When I try to add repository via "Manage Actions Access", I get this:
   `Can't grant permissions to unknown collaborator ` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] philwalk commented on pull request #38228: [SPARK-40739][SPARK-40738] Fixes for cygwin/msys2/mingw sbt build and bash scripts

2022-10-21 Thread GitBox


philwalk commented on PR #38228:
URL: https://github.com/apache/spark/pull/38228#issuecomment-1287202796

   > @vitaliili-db Did you set any permission by yourself manually before? I 
just have a try (fork other apache project) the default permission is `Read and 
Write` by default.
   
   The `Workflow permissions` are already set for the default: "Read and Write 
permissions".
   I will do more experiments ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tomtongue commented on a diff in pull request #38268: [SPARK-40804][SQL] Missing handling a catalog name in destination tables in RenameTableExec

2022-10-21 Thread GitBox


tomtongue commented on code in PR #38268:
URL: https://github.com/apache/spark/pull/38268#discussion_r997128018


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableRenameSuite.scala:
##
@@ -34,4 +34,15 @@ class AlterTableRenameSuite extends 
command.AlterTableRenameSuiteBase with Comma
   }
 }
   }
+
+  test("include catalog in the destination table") {
+withNamespaceAndTable("ns", "dst_tbl", catalog) { dst =>
+  val src = dst.replace("dst", "src")
+  sql(s"CREATE TABLE $src (c0 INT) $defaultUsing")
+  sql(s"INSERT INTO $src SELECT 0")
+
+  sql(s"ALTER TABLE $src RENAME TO $catalog.ns.dst_tbl")

Review Comment:
   Thanks for checking this. I added the different catalog case, update the 
current test and pushed them. @amaliujia could you check the latest one when 
you have time?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tomtongue commented on a diff in pull request #38268: [SPARK-40804][SQL] Missing handling a catalog name in destination tables in RenameTableExec

2022-10-21 Thread GitBox


tomtongue commented on code in PR #38268:
URL: https://github.com/apache/spark/pull/38268#discussion_r997128018


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableRenameSuite.scala:
##
@@ -34,4 +34,15 @@ class AlterTableRenameSuite extends 
command.AlterTableRenameSuiteBase with Comma
   }
 }
   }
+
+  test("include catalog in the destination table") {
+withNamespaceAndTable("ns", "dst_tbl", catalog) { dst =>
+  val src = dst.replace("dst", "src")
+  sql(s"CREATE TABLE $src (c0 INT) $defaultUsing")
+  sql(s"INSERT INTO $src SELECT 0")
+
+  sql(s"ALTER TABLE $src RENAME TO $catalog.ns.dst_tbl")

Review Comment:
   Thanks for checking this. I added the different catalog case, update the 
current test and pushed them. @amaliujia



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a diff in pull request #38277: [SPARK-40815][SQL] Introduce DelegateSymlinkTextInputFormat to handle empty splits when "spark.hadoopRDD.ignoreEmptySplits" is enab

2022-10-21 Thread GitBox


sunchao commented on code in PR #38277:
URL: https://github.com/apache/spark/pull/38277#discussion_r1001963440


##
sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java:
##
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+import org.apache.hadoop.fs.ContentSummary;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+
+/**
+ * Delegate for SymlinkTextInputFormat, created to address SPARK-40815.
+ * Fixes an issue where SymlinkTextInputFormat returns empty splits which 
could result in
+ * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled.
+ *
+ * In this class, we update the split start and length to match the target 
file input thus fixing
+ * the issue.
+ */
+@SuppressWarnings("deprecation")
+public class DelegateSymlinkTextInputFormat extends SymlinkTextInputFormat {
+
+  public static class DelegateSymlinkTextInputSplit extends FileSplit {
+private final SymlinkTextInputSplit split;
+
+public DelegateSymlinkTextInputSplit() {
+  super((Path) null, 0, 0, (String[]) null);
+  split = new SymlinkTextInputSplit();
+}
+
+public DelegateSymlinkTextInputSplit(Path symlinkPath, 
SymlinkTextInputSplit split) throws IOException {
+  // It is fine to set start and length to the target file split because
+  // SymlinkTextInputFormat maintains 1-1 mapping between 
SymlinkTextInputSplit and FileSplit.
+  super(symlinkPath,
+split.getTargetSplit().getStart(),
+split.getTargetSplit().getLength(),
+split.getTargetSplit().getLocations());
+  this.split = split;
+}
+
+/**
+ * Returns delegate input split.
+ */
+private SymlinkTextInputSplit getSplit() {
+  return split;
+}
+
+@Override
+public void write(DataOutput out) throws IOException {
+  super.write(out);

Review Comment:
   hmm won't this call `FileSplit.write` twice?



##
sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java:
##
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+import org.apache.hadoop.fs.ContentSummary;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+
+/**
+ * Delegate for SymlinkTextInputFormat, created to address SPARK-40815.
+ * Fixes an issue where SymlinkTextInputFormat returns empty splits which 
could result in
+ * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled.
+ *
+ * In this class, we update the split start and length to match the target 
file input thus fixing
+ * the issue.
+ */
+@SuppressWarnings("deprecation")
+public class 

[GitHub] [spark] peter-toth opened a new pull request, #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

2022-10-21 Thread GitBox


peter-toth opened a new pull request, #38334:
URL: https://github.com/apache/spark/pull/38334

   ### What changes were proposed in this pull request?
   This PR fixes a bug in broadcast handling `PythonRunner` when encryption is 
enabed. Due to this bug the following pyspark script:
   ```
   bin/pyspark --conf spark.io.encryption.enabled=true
   
   ...
   
   bar = {"a": "aa", "b": "bb"}
   foo = spark.sparkContext.broadcast(bar)
   spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
   spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
   ```
   fails with:
   ```
   22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 
0)/ 1]
   org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
 File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 811, in main
   func, profiler, deserializer, serializer = read_command(pickleSer, 
infile)
 File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 87, in read_command
   command = serializer._read_with_length(file)
 File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 173, in _read_with_length
   return self.loads(obj)
 File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 471, in loads
   return cloudpickle.loads(obj, encoding=encoding)
   EOFError: Ran out of input
   ```
   The reason for this failure is that we have multiple Python UDF referencing 
the same broadcast and in the current code:
   
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
   the number of broadcasts (`cnt`) is correct (1) but the broadcast id is 
serialized 2 times from JVM to Python ruining the next item that Python expects 
from JVM side.
   
   Please note that the example above works in Spark 3.3 without this fix. That 
is because https://github.com/apache/spark/pull/36121 in Spark 3.4 modified 
`ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
   
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
   changed from `Stream` to `Vector`. When `broadcastVars` (and so 
`idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id 
is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is 
called before the 2nd item is calculated in `broadcastVars.flatMap`). But that 
doesn't mean that https://github.com/apache/spark/pull/36121 introduced the 
regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast 
data 2 times (which `EncryptedPythonBroadcastServer` does currently, but it is 
not noticed) as it could fail other cases when there are more than 1 broadcast 
used in UDFs). 
   
   ### Why are the changes needed?
   To fix a bug.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Added new UT.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bjornjorgensen commented on pull request #38262: [SPARK-40801][BUILD] Upgrade `Apache commons-text` to 1.10

2022-10-21 Thread GitBox


bjornjorgensen commented on PR #38262:
URL: https://github.com/apache/spark/pull/38262#issuecomment-1287127138

   @dongjoon-hyun and @wangyum this is a [9.8 
CRITICAL](https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?name=CVE-2022-42889=AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H=3.1=NIST)
 so yours [are starting to ask if we can get this to other 
branches](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-40861) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38324: [Temp] Protobuf generate V2 and V3 protos and extend tests.

2022-10-21 Thread GitBox


AmplabJenkins commented on PR #38324:
URL: https://github.com/apache/spark/pull/38324#issuecomment-1287089218

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38330: [SPARK-40868][SQL] Avoid introducing too many partitions when bucketed scan disabled by sql planner

2022-10-21 Thread GitBox


AmplabJenkins commented on PR #38330:
URL: https://github.com/apache/spark/pull/38330#issuecomment-1287089093

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38327: [SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client

2022-10-21 Thread GitBox


AmplabJenkins commented on PR #38327:
URL: https://github.com/apache/spark/pull/38327#issuecomment-1287089151

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38331: [SPARK-40869][K8S] Resource name prefix should not start with a -

2022-10-21 Thread GitBox


AmplabJenkins commented on PR #38331:
URL: https://github.com/apache/spark/pull/38331#issuecomment-1287089036

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38332: [WIP][SPARK-40750][SQL] Migrate type check failures of math expressions onto error classes

2022-10-21 Thread GitBox


AmplabJenkins commented on PR #38332:
URL: https://github.com/apache/spark/pull/38332#issuecomment-1287088983

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-10-21 Thread GitBox


AmplabJenkins commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1287088925

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-10-21 Thread GitBox


mridulm commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1287007388

   +CC @otterc 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] codewithmehul commented on pull request #31147: [SPARK-32380][SQL]fixed spark3.0 access hive table while data in hbase problem

2022-10-21 Thread GitBox


codewithmehul commented on PR #31147:
URL: https://github.com/apache/spark/pull/31147#issuecomment-1287007207

   Is this change available in official release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard metrics 4.2.12

2022-10-21 Thread GitBox


srowen commented on PR #38328:
URL: https://github.com/apache/spark/pull/38328#issuecomment-1286999833

   MErged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard metrics 4.2.12

2022-10-21 Thread GitBox


srowen closed pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard 
metrics 4.2.12
URL: https://github.com/apache/spark/pull/38328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #38329: [SPARK-40865][BUILD] Upgrade jodatime to 2.12.0

2022-10-21 Thread GitBox


srowen commented on PR #38329:
URL: https://github.com/apache/spark/pull/38329#issuecomment-1286998754

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #38329: [SPARK-40865][BUILD] Upgrade jodatime to 2.12.0

2022-10-21 Thread GitBox


srowen closed pull request #38329: [SPARK-40865][BUILD] Upgrade jodatime to 
2.12.0
URL: https://github.com/apache/spark/pull/38329


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LucaCanali commented on a diff in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Python UDFs using SQL metrics

2022-10-21 Thread GitBox


LucaCanali commented on code in PR #33559:
URL: https://github.com/apache/spark/pull/33559#discussion_r1001798239


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala:
##
@@ -94,6 +94,8 @@ case class AggregateInPandasExec(
   }
 
   override protected def doExecute(): RDD[InternalRow] = {
+metrics // force lazy init at driver

Review Comment:
   Good point, I have now removed  it, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on a diff in pull request #37075: [SparkConnect] Initial Protobuf Definitions

2022-10-21 Thread GitBox


steveloughran commented on code in PR #37075:
URL: https://github.com/apache/spark/pull/37075#discussion_r1001761745


##
pom.xml:
##
@@ -116,7 +116,7 @@
 2.17.2
 
 3.3.3
-2.5.0
+3.21.1

Review Comment:
   shaded, yes. if unshaded, then if you update protobuf.jar, all .class files 
compiled with the older version of protobuf are unlikely to link. let alone work



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38288: [SPARK-40821][SQL][CORE][PYTHON][SS] Introduce window_time function to extract event time from the window column

2022-10-21 Thread GitBox


HeartSaVioR commented on code in PR #38288:
URL: https://github.com/apache/spark/pull/38288#discussion_r1001755349


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -4201,6 +4219,73 @@ object SessionWindowing extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Resolves the window_time expression which extracts the correct window time 
from the
+ * window column generated as the output of the window aggregating operators. 
The
+ * window column is of type struct { start: TimestampType, end: TimestampType 
}.
+ * The correct window time for further aggregations is window.end - 1.
+ * */
+object ResolveWindowTime extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp 
{
+case p: LogicalPlan if p.children.size == 1 =>
+  val child = p.children.head
+  val windowTimeExpressions =
+p.expressions.flatMap(_.collect { case w: WindowTime => w }).toSet
+
+  if (windowTimeExpressions.size == 1 &&

Review Comment:
   The test case should construct two different windows and call window_time 
per each window to "fail".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38288: [SPARK-40821][SQL][CORE][PYTHON][SS] Introduce window_time function to extract event time from the window column

2022-10-21 Thread GitBox


HeartSaVioR commented on code in PR #38288:
URL: https://github.com/apache/spark/pull/38288#discussion_r1001749087


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameTimeWindowingSuite.scala:
##
@@ -575,4 +575,61 @@ class DataFrameTimeWindowingSuite extends QueryTest with 
SharedSparkSession {
   validateWindowColumnInSchema(schema2, "window")
 }
   }
+
+  test("window_time function on raw window column") {
+val df = Seq(
+  ("2016-03-27 19:38:18"), ("2016-03-27 19:39:25")
+).toDF("time")
+
+checkAnswer(
+  df.select(window($"time", "10 seconds").as("window"))
+.select(
+  $"window.end".cast("string"),
+  window_time($"window").cast("string")
+),
+  Seq(
+Row("2016-03-27 19:38:20", "2016-03-27 19:38:19.99"),
+Row("2016-03-27 19:39:30", "2016-03-27 19:39:29.99")
+  )
+)
+  }
+
+  test("2 window_time functions on raw window column") {
+val df = Seq(
+  ("2016-03-27 19:38:18"), ("2016-03-27 19:39:25")
+).toDF("time")
+
+checkAnswer(
+  df.select(window($"time", "10 seconds").as("window"))
+.select(
+  $"window.end".cast("string"),
+  window_time($"window").cast("string").as("window1"),

Review Comment:
   OK this was pretty straightforward... Refer the definition of WindowTime.
   
   `case class WindowTime(windowColumn: Expression)`
   
   We apply `.toSet` to collect the instances with deduplication, so as long as 
the windowColumn is resolved as the same, they are considered to the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 opened a new pull request, #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-10-21 Thread GitBox


gaoyajun02 opened a new pull request, #38333:
URL: https://github.com/apache/spark/pull/38333

   ### What changes were proposed in this pull request?
   A large number of shuffle tests in our cluster show that bad nodes with 
chunk corruption appear have a probability of fetching zero-size shuffleChunks. 
In this case, we can fall back to original shuffle blocks.
   
   
   ### Why are the changes needed?
   When the reduce task obtains the merge data with a zero-size buf error, we 
let it fall back to original shuffle block. After verification, these blocks 
can be read successfully without shuffle retry.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.

2022-10-21 Thread GitBox


HyukjinKwon closed pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON 
encoding until we have Arrow collection.
URL: https://github.com/apache/spark/pull/38300


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #38332: [SPARK-40750][SQL] Migrate type check failures of math expressions onto error classes

2022-10-21 Thread GitBox


panbingkun opened a new pull request, #38332:
URL: https://github.com/apache/spark/pull/38332

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   ### How was this patch tested?
   - Add new UT.
   - Update Existed UT.
   - Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.

2022-10-21 Thread GitBox


HyukjinKwon commented on PR #38300:
URL: https://github.com/apache/spark/pull/38300#issuecomment-1286853167

   All related tests passed.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

2022-10-21 Thread GitBox


zhengruifeng closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] 
Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`
URL: https://github.com/apache/spark/pull/38318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

2022-10-21 Thread GitBox


zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1001680211


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: 
SparkSession) {
   transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): 
LogicalPlan = {
+val child = transformRelation(rel.getInput)
+
+rel.getFunctionCase match {
+  case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+StatFunctions
+  .summary(Dataset.ofRows(session, child), 
rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   yes, then it will have more optimization space. let us add new plan for it. 
thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add missing s prefix to enable string interpolation

2022-10-21 Thread GitBox


HyukjinKwon closed pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add 
missing s prefix to enable string interpolation
URL: https://github.com/apache/spark/pull/38307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add missing s prefix to enable string interpolation

2022-10-21 Thread GitBox


HyukjinKwon commented on PR #38307:
URL: https://github.com/apache/spark/pull/38307#issuecomment-1286821880

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #38307: [MINOR][CORE][SQL][FOLLOWUP] Add missing s prefix to enable string interpolation

2022-10-21 Thread GitBox


EnricoMi commented on PR #38307:
URL: https://github.com/apache/spark/pull/38307#issuecomment-1286806013

   Rebased with master, all green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

2022-10-21 Thread GitBox


zhengruifeng commented on PR #38310:
URL: https://github.com/apache/spark/pull/38310#issuecomment-1286804042

   thank you guys


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tobiasstadler opened a new pull request, #38331: [SPARK-40869][K8S] Resource name prefix should not start with a -

2022-10-21 Thread GitBox


tobiasstadler opened a new pull request, #38331:
URL: https://github.com/apache/spark/pull/38331

   ### What changes were proposed in this pull request?
   Strip leading - from resource name prefix
   
   
   ### Why are the changes needed?
   leading - are not allowed for resource name prefix (especially 
spark.kubernetes.executor.podNamePrefix)
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Unit test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xingchaozh opened a new pull request, #38330: [SPARK-40868][SQL] Avoid introducing too many partitions when bucketed scan disabled by sql planner

2022-10-21 Thread GitBox


xingchaozh opened a new pull request, #38330:
URL: https://github.com/apache/spark/pull/38330

   ### What changes were proposed in this pull request?
   
   For bucket tables with huge size, lots of partitions maybe generated if 
bucketed scan disabled by sql planner. 
   
   We can add one limit(default as BUCKETING_MAX_BUCKETS) to reduce the 
partitions for non-bucketed scan.
   
   ### Why are the changes needed?
   Avoid too many tasks introduced in single stage.
   
   Before
   
![image](https://user-images.githubusercontent.com/7522130/197170079-35ee13f2-1d6c-425b-bfdd-5c4ac3c8297c.png)
   
   After
   
![image](https://user-images.githubusercontent.com/7522130/197170156-556f839b-8fde-4f29-862b-5a6d90bd0a8e.png)
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   ### How was this patch tested?
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-21 Thread GitBox


EnricoMi commented on PR #38223:
URL: https://github.com/apache/spark/pull/38223#issuecomment-1286734504

   I have also limited the schema string in the error message to 1024 
characters / missing and unexpected columns to 5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-21 Thread GitBox


EnricoMi commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1001606273


##
python/pyspark/sql/pandas/serializers.py:
##
@@ -216,7 +216,7 @@ def _create_batch(self, series):
 series = [series]
 series = ((s, None) if not isinstance(s, (list, tuple)) else s for s 
in series)
 
-def create_array(s, t):
+def create_array(s, t, n):

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] plokhotnyuk commented on pull request #37604: [DON'T MERGE] Try to replace all `json4s` with `Jackson`

2022-10-21 Thread GitBox


plokhotnyuk commented on PR #37604:
URL: https://github.com/apache/spark/pull/37604#issuecomment-1286732347

   > [Jackson is much safer than 
json4s](https://github.com/json4s/json4s/search?q=Denial+of+Service=open=issues)
   > 
   
   I was wrong, it seems that jackson-core is vulnerable too: 
https://github.com/FasterXML/jackson-module-scala/issues/609


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-21 Thread GitBox


EnricoMi commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1001564996


##
python/pyspark/sql/pandas/serializers.py:
##
@@ -216,7 +216,7 @@ def _create_batch(self, series):
 series = [series]
 series = ((s, None) if not isinstance(s, (list, tuple)) else s for s 
in series)
 
-def create_array(s, t):
+def create_array(s, t, n):

Review Comment:
   But `s[s.columns[i]].rename(field.name)` fixes that. Thanks for the idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-21 Thread GitBox


EnricoMi commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1001556844


##
python/pyspark/sql/pandas/serializers.py:
##
@@ -216,7 +216,7 @@ def _create_batch(self, series):
 series = [series]
 series = ((s, None) if not isinstance(s, (list, tuple)) else s for s 
in series)
 
-def create_array(s, t):
+def create_array(s, t, n):

Review Comment:
   I know, this parameter is ugly as it is only used for the exception, but 
`s.name` is `1` instead of `mean` (test 
`test_apply_in_pandas_returning_incompatible_type`) when I apply this change.
   
   Looks like `s[field.name]` and `s[s.columns[i]]` returns a `Series` that 
thinks it's name is `1`. Sad.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on pull request #17: [SPARK-40864] Remove pip/setuptools dynamic upgrade

2022-10-21 Thread GitBox


Yikun commented on PR #17:
URL: https://github.com/apache/spark-docker/pull/17#issuecomment-1286673991

   @HyukjinKwon Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun closed pull request #17: [SPARK-40864] Remove pip/setuptools dynamic upgrade

2022-10-21 Thread GitBox


Yikun closed pull request #17: [SPARK-40864] Remove pip/setuptools dynamic 
upgrade
URL: https://github.com/apache/spark-docker/pull/17


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38328: [SPARK-40863][BUILD] Upgrade dropwizard metrics 4.2.12

2022-10-21 Thread GitBox


LuciferYang commented on PR #38328:
URL: https://github.com/apache/spark/pull/38328#issuecomment-1286665503

   ```
   [info] ProtobufCatalystDataConversionSuite:
   [info] - single StructType(StructField(int32_type,IntegerType,true)) with 
seed 167 *** FAILED *** (39 milliseconds)
   [info]   Incorrect evaluation (codegen off): from_protobuf(to_protobuf([0], 
/home/runner/work/spark/spark/connector/protobuf/target/scala-2.12/test-classes/protobuf/catalyst_types.desc,
 IntegerMsg), 
/home/runner/work/spark/spark/connector/protobuf/target/scala-2.12/test-classes/protobuf/catalyst_types.desc,
 IntegerMsg), actual: [null], expected: [0] (ExpressionEvalHelper.scala:209)
   [info]   org.scalatest.exceptions.TestFailedException:
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
   [info]   at org.scalatest.Assertions.fail(Assertions.scala:933)
   [info]   at org.scalatest.Assertions.fail$(Assertions.scala:929)
   [info]   at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564)
   [info]   at 
org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:209)
   [info]   at 
org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199)
   [info]   at 
org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkEvaluationWithoutCodegen(ProtobufCatalystDataConversionSuite.scala:33)
   [info]   at 
org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87)
   [info]   at 
org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82)
   [info]   at 
org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkEvaluation(ProtobufCatalystDataConversionSuite.scala:33)
   [info]   at 
org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkResult(ProtobufCatalystDataConversionSuite.scala:43)
   [info]   at 
org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:122)
   [info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
   [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
   [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
   [info]   at 
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
   [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
   [info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:66)
   [info]   at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
   [info]   at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
   [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:66)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
   [info]   at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
   [info]   at scala.collection.immutable.List.foreach(List.scala:431)
   [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
   [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
   [info]   at org.scalatest.Suite.run(Suite.scala:1114)
   [info]   at org.scalatest.Suite.run$(Suite.scala:1096)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
   [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
   [info]   at 

[GitHub] [spark-docker] dcoliversun opened a new pull request, #19: [SPARK-40855] Add CONTRIBUTING.md for apache/spark-docker

2022-10-21 Thread GitBox


dcoliversun opened a new pull request, #19:
URL: https://github.com/apache/spark-docker/pull/19

   
   
   ### What changes were proposed in this pull request?
   
   This PR aims to add `CONTRIBUTING.md` for apache/spark-docker.
   
   ### Why are the changes needed?
   
   Better to briefly explain how to contribute DOI.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   
![image](https://user-images.githubusercontent.com/44011673/197155544-bfae0c70-ee01-44b0-851d-ed5c288129d9.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38300: [SPARK-40854] [CONNECT] Use proper JSON encoding until we have Arrow collection.

2022-10-21 Thread GitBox


grundprinzip commented on code in PR #38300:
URL: https://github.com/apache/spark/pull/38300#discussion_r1001541171


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala:
##
@@ -57,21 +60,67 @@ class SparkConnectStreamHandler(responseObserver: 
StreamObserver[Response]) exte
 processRows(request.getClientId, rows)
   }
 
-  private def processRows(clientId: String, rows: DataFrame) = {
+  private[connect] def processRows(clientId: String, rows: DataFrame): Unit = {
 val timeZoneId = SQLConf.get.sessionLocalTimeZone
-val schema =
-  ByteString.copyFrom(ArrowUtils.toArrowSchema(rows.schema, 
timeZoneId).toByteArray)
-
-val textSchema = rows.schema.fields.map(f => f.name).mkString("|")
-val data = rows.collect().map(x => x.toSeq.mkString("|")).mkString("\n")
-val bbb = proto.Response.CSVBatch.newBuilder
-  .setRowCount(-1)
-  .setData(textSchema ++ "\n" ++ data)
-  .build()
-val response = 
proto.Response.newBuilder().setClientId(clientId).setCsvBatch(bbb).build()
 
-// Send all the data
-responseObserver.onNext(response)
+// Only process up to 10MB of data.
+val sb = new StringBuilder
+var rowCount = 0
+rows.toJSON
+  .toLocalIterator()

Review Comment:
   I moved this back to `collect()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38320: [SPARK-40857] [CONNECT] Enable configurable GPRC Interceptors.

2022-10-21 Thread GitBox


grundprinzip commented on code in PR #38320:
URL: https://github.com/apache/spark/pull/38320#discussion_r1001531871


##
core/src/main/resources/error/error-classes.json:
##
@@ -698,6 +698,23 @@
 ],
 "sqlState" : "22023"
   },
+  "SPARK_CONNECT" : {

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-10-21 Thread GitBox


EnricoMi commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1001531184


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##
@@ -271,6 +271,8 @@ class ParquetToSparkSchemaConverter(
 } else {
   TimestampNTZType
 }
+  case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit 
== TimeUnit.NANOS =>

Review Comment:
   So this is not a valid workaround. @sunchao how is your feeling about 
restoring earlier useful behaviour?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >