[GitHub] [spark] LuciferYang commented on pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455028122

   For check, I add a `./build/mvn -version` before in `java-11-17` GA task 
without this pr:
   
   https://github.com/LuciferYang/spark/actions/runs/4328951282/jobs/7559134224
   
   And it print as follows:
   
   https://user-images.githubusercontent.com/1475305/222950899-11ca9796-11b4-400f-9a89-5f22b9367b2b.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub


beliefer commented on code in PR #39091:
URL: https://github.com/apache/spark/pull/39091#discussion_r1125655121


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -598,3 +599,18 @@ message ToSchema {
   // The Sever side will update the dataframe with this schema.
   DataType schema = 2;
 }
+
+// Collect arbitrary (named) metrics from a dataset.
+message CollectMetrics {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) Name of the metrics.
+  string name = 2;
+
+  // (Required) The metric sequence.
+  repeated Expression metrics = 3;
+
+  // (Optional) The identity whether Observation are used.
+  optional bool is_observation = 4;

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40278: [SPARK-42670][BUILD] Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40278:
URL: https://github.com/apache/spark/pull/40278#discussion_r1125655202


##
pom.xml:
##
@@ -2957,7 +2957,7 @@
   ${test.java.home}
   -DmyKey=yourValue
 
-
+

Review Comment:
   Are you using maven 3.9.0? There should be no this warning log in 3.8.7.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40278: [SPARK-42670][BUILD] Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40278:
URL: https://github.com/apache/spark/pull/40278#discussion_r1125660591


##
pom.xml:
##
@@ -2957,7 +2957,7 @@
   ${test.java.home}
   -DmyKey=yourValue
 
-
+

Review Comment:
   And I think we should do this change after finish SPARK-42380 if there is no 
this warning log with maven 3.8.7



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-05 Thread via GitHub


itholic commented on code in PR #40282:
URL: https://github.com/apache/spark/pull/40282#discussion_r1125617055


##
python/docs/source/development/errors.rst:
##
@@ -0,0 +1,92 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Error conditions in PySpark
+===
+
+This is a list of common, named error conditions returned by PySpark which are 
defined at `error_classes.py 
`_.
+
+When writing PySpark errors, developers must use an error condition from the 
list. If an appropriate error condition is not available, add a new one into 
the list. For more information, please refer to `Contributing Error and 
Exception 
`_.
+
+++--+

Review Comment:
   If there are more PySpark error classes in the future, it might be good to 
automate the table generation from error_classes.py.
   
   But for now, I manually list up the currently confirmed list of error 
classes for quick inclusion in the 3.4 release.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-05 Thread via GitHub


itholic commented on code in PR #40282:
URL: https://github.com/apache/spark/pull/40282#discussion_r1125617055


##
python/docs/source/development/errors.rst:
##
@@ -0,0 +1,92 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Error conditions in PySpark
+===
+
+This is a list of common, named error conditions returned by PySpark which are 
defined at `error_classes.py 
`_.
+
+When writing PySpark errors, developers must use an error condition from the 
list. If an appropriate error condition is not available, add a new one into 
the list. For more information, please refer to `Contributing Error and 
Exception 
`_.
+
+++--+

Review Comment:
   If there are more PySpark error classes in the future, it might be good to 
automate this.
   
   But for now, I list up the currently confirmed list of error classes for 
quick inclusion in the 3.4 release.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-05 Thread via GitHub


itholic commented on code in PR #40282:
URL: https://github.com/apache/spark/pull/40282#discussion_r1125617055


##
python/docs/source/development/errors.rst:
##
@@ -0,0 +1,92 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Error conditions in PySpark
+===
+
+This is a list of common, named error conditions returned by PySpark which are 
defined at `error_classes.py 
`_.
+
+When writing PySpark errors, developers must use an error condition from the 
list. If an appropriate error condition is not available, add a new one into 
the list. For more information, please refer to `Contributing Error and 
Exception 
`_.
+
+++--+

Review Comment:
   If there are more PySpark error classes in the future, it might be good to 
automate this.
   
   But for now, I manually list up the currently confirmed list of error 
classes for quick inclusion in the 3.4 release.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #40284: [SPARK-42674][BUILD] Upgrade scalafmt from 3.7.1 to 3.7.2

2023-03-05 Thread via GitHub


panbingkun opened a new pull request, #40284:
URL: https://github.com/apache/spark/pull/40284

   ### What changes were proposed in this pull request?
   The pr aims to upgrade scalafmt from 3.7.1 to 3.7.2
   
   ### Why are the changes needed?
   A. Release note:
   > https://github.com/scalameta/scalafmt/releases
   
   B. V3.7.1 VS V3.7.2
   > https://github.com/scalameta/scalafmt/compare/v3.7.1...v3.7.2
   
   C. Bring bug fix & some improvement:
   https://user-images.githubusercontent.com/15246973/222950202-76fea081-3b0b-458b-b963-248e472107f4.png;>
   https://user-images.githubusercontent.com/15246973/222950245-47dab2a4-df28-4b31-9a05-0e1ccc8b5acc.png;>
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40273: [SPARK-42668][SS] Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort

2023-03-05 Thread via GitHub


HeartSaVioR commented on PR #40273:
URL: https://github.com/apache/spark/pull/40273#issuecomment-1455022742

   Mind retriggering the build, please? Probably simplest way to do is pushing 
an empty commit. You can retrigger the build in your fork but it won't be 
reflected here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455041804

   no error message with this pr 
https://pipelines.actions.githubusercontent.com/serviceHosts/c184045e-b556-4e78-b8ef-fb37b2eda9a3/_apis/pipelines/1/runs/63900/signedlogcontent/13?urlExpires=2023-03-05T09%3A51%3A45.4115558Z=HMACV1=XozaYUDlJf1l1lrkE7RBU03%2Fy4e0uIIs%2B4VlRuLa5so%3D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


beliefer commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455071764

   @LuciferYang Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


beliefer commented on code in PR #40275:
URL: https://github.com/apache/spark/pull/40275#discussion_r1125652836


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -1228,6 +1228,22 @@ object functions {
   def map_from_arrays(keys: Column, values: Column): Column =
 Column.fn("map_from_arrays", keys, values)
 
+  /**
+   * Marks a DataFrame as small enough for use in broadcast joins.
+   *
+   * The following example marks the right DataFrame for broadcast hash join 
using `joinKey`.
+   * {{{
+   *   // left and right are DataFrames
+   *   left.join(broadcast(right), "joinKey")
+   * }}}
+   *
+   * @group normal_funcs
+   * @since 3.4.0
+   */
+  def broadcast[T](df: Dataset[T]): Dataset[T] = {
+df.hint("broadcast")

Review Comment:
   Yes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #40278: [SPARK-42670][BUILD] Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-05 Thread via GitHub


panbingkun commented on code in PR #40278:
URL: https://github.com/apache/spark/pull/40278#discussion_r1125659474


##
pom.xml:
##
@@ -2957,7 +2957,7 @@
   ${test.java.home}
   -DmyKey=yourValue
 
-
+

Review Comment:
   Yes,
   https://user-images.githubusercontent.com/15246973/222961036-5568645a-c681-4889-b7cf-08a2f1ecb0fc.png;>
   
   https://user-images.githubusercontent.com/15246973/222961017-5e1e4cf8-19a3-44f8-8484-6791b66d34d3.png;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40285: [SPARK-42675][CONNECT][TESTS] Drop temp view after test `test temp view`

2023-03-05 Thread via GitHub


LuciferYang opened a new pull request, #40285:
URL: https://github.com/apache/spark/pull/40285

   ### What changes were proposed in this pull request?
   This pr just drop temp view after test which create by `test temp view` in 
`ClientE2ETestSuite`.
   
   
   ### Why are the changes needed?
   Drop temp view after test to clean up resource.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40274:
URL: https://github.com/apache/spark/pull/40274#issuecomment-1455094955

   On the whole, it is good for me. There is only one question. Spark still 
uses maven for version release and deploy. But after this pr, the E2E test 
change to use sbt assembly server jar instead of maven shaded server jar for 
testing, which may weaken the maven test. We may need other ways to ensure the 
correctness of maven shaded server jar.
   
   In the future, we may use sbt to completely replace maven(should not be in 
Spark 3.4.0), including version release, deploy and other help tools, which 
will no longer be a problem at that time.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-05 Thread via GitHub


itholic opened a new pull request, #40282:
URL: https://github.com/apache/spark/pull/40282

   ### What changes were proposed in this pull request?
   
   This PR proposes to document the PySpark error class list into a part of 
[Development](https://spark.apache.org/docs/latest/api/python/development/index.html)
 page.
   
   
   ### Why are the changes needed?
   
   To make it easier for developers to find error classes.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's documentation,
   
   
   ### How was this patch tested?
   
   
   Manually build docs by running `make clean html`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455027425

   for example:
   
   - https://github.com/apache/spark/actions/runs/4329598884/jobs/7560252913
   - https://github.com/apache/spark/actions/runs/4329598884/jobs/7560252970
   - https://github.com/apache/spark/actions/runs/4329004998/jobs/7559232726
   - https://github.com/apache/spark/actions/runs/4329004998/jobs/7559232794
   
   https://user-images.githubusercontent.com/1475305/222950677-95c48561-924d-45c7-a59b-23f66a997af3.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


beliefer commented on code in PR #40275:
URL: https://github.com/apache/spark/pull/40275#discussion_r1125651656


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala:
##
@@ -489,17 +489,31 @@ class ClientE2ETestSuite extends RemoteSparkSession {
   }
 
   test("ambiguous joins") {
+spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
 val left = spark.range(100).select(col("id"), rand(10).as("a"))
 val right = spark.range(100).select(col("id"), rand(12).as("a"))
 val joined = left.join(right, left("id") === 
right("id")).select(left("id"), right("a"))
 assert(joined.schema.catalogString === "struct")
+testCapturedStdOut(joined.explain(), "BroadcastHashJoin")
+spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")
 
 val joined2 = left
   .join(right, left.colRegex("id") === right.colRegex("id"))
   .select(left("id"), right("a"))
 assert(joined2.schema.catalogString === "struct")
   }
 
+  test("broadcast join") {
+spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
+val left = spark.range(100).select(col("id"), rand(10).as("a"))
+val right = spark.range(100).select(col("id"), rand(12).as("a"))
+val joined =
+  left.join(broadcast(right), left("id") === 
right("id")).select(left("id"), right("a"))
+assert(joined.schema.catalogString === "struct")
+testCapturedStdOut(joined.explain(), "BroadcastHashJoin")
+spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")

Review Comment:
   @LuciferYang It's good idea. Let's do it later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40278: [SPARK-42670][BUILD] Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40278:
URL: https://github.com/apache/spark/pull/40278#discussion_r1125660591


##
pom.xml:
##
@@ -2957,7 +2957,7 @@
   ${test.java.home}
   -DmyKey=yourValue
 
-
+

Review Comment:
   And I think we should do this change after finish SPARK-42380



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40236: [SPARK-38735][SQL][TESTS] Add tests for the error class: INTERNAL_ERROR

2023-03-05 Thread via GitHub


itholic commented on code in PR #40236:
URL: https://github.com/apache/spark/pull/40236#discussion_r1125682909


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala:
##
@@ -765,6 +770,58 @@ class QueryExecutionErrorsSuite
   )
 }
   }
+
+  test("INTERNAL_ERROR: Calling eval on Unevaluable expression") {
+val e = intercept[SparkException] {
+  Parameter("foo").eval()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map("message" -> "Cannot evaluate expression: 
parameter(foo)"),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Calling doGenCode on unresolved") {
+val e = intercept[SparkException] {
+  val ctx = new CodegenContext
+  Grouping(Parameter("foo")).genCode(ctx)
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map(
+"message" -> ("Cannot generate code for expression: " +
+  "grouping(parameter(foo))")),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Calling terminate on UnresolvedGenerator") {
+val e = intercept[SparkException] {
+  UnresolvedGenerator(FunctionIdentifier("foo"), Seq.empty).terminate()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map("message" -> "Cannot terminate expression: 'foo()"),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Initializing JavaBean with non existing method") {
+val e = intercept[SparkException] {
+  val initializeWithNonexistingMethod = InitializeJavaBean(
+Literal.fromObject(new java.util.LinkedList[Int]),
+Map("nonexistent" -> Literal(1)))
+  initializeWithNonexistingMethod.eval()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map(
+"message" -> ("""A method named "nonexistent" is not declared in """ +
+  "any enclosing class nor any supertype")),
+  sqlState = "XX000")

Review Comment:
   It's fine if it's already tested :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455043698

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


beliefer commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455071084

   @LuciferYang I want support the similar `withSQLConf`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40278: [SPARK-42670][BUILD] Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40278:
URL: https://github.com/apache/spark/pull/40278#discussion_r1125659843


##
pom.xml:
##
@@ -2957,7 +2957,7 @@
   ${test.java.home}
   -DmyKey=yourValue
 
-
+

Review Comment:
   I think we don't  need to make this change for the time being. Apache Spark 
has not promised to use maven 3.9.0 for build and test
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40274:
URL: https://github.com/apache/spark/pull/40274#issuecomment-1455105130

   There is another problem that needs to be confirmed, which may not related 
to current pr: if other Suites inherit `RemoteSparkSession`, they will share 
the same connect server, right? (`SparkConnectServerUtils` is an object, so 
`SparkConnect` will only submit once)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] FurcyPin commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub


FurcyPin commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125695676


##
python/pyspark/sql/functions.py:
##
@@ -22,20 +22,10 @@
 import sys
 import functools
 import warnings
-from typing import (
-Any,
-cast,

Review Comment:
   I agree that the change seemed quite cumbersome. It made me wish Python had 
some kind of "private import" keyword to handle such cases more easily.
   
   I agree with you that `cast` is the only name that might be confusing 
(perhaps `overload` too, but all the other names start with an uppercase). The 
`functions` module feels a little special to me because it is the module I use 
the most as a Spark user, it's definitely a public API. The 201 other modules 
don't require such change.
   
   It's out of the scope of this MR, but perhaps for the long term you could 
consider reorganizing this module [the same way as I did in one of my 
projects](https://github.com/FurcyPin/spark-frame/blob/main/spark_frame/transformations.py),
 which had two advantages:
   - the code of each method was isolated in a separate file (that would 
prevent having a 10 000-lines files)
   - there was no import pollution
   
   For now, I'll do as you suggest: only handle `typing.cast` as a special case 
and add a unit test to make sure it does not get imported again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40283: [SPARK-42673][BUILD] Ban 3.9.x for Spark Maven build

2023-03-05 Thread via GitHub


LuciferYang opened a new pull request, #40283:
URL: https://github.com/apache/spark/pull/40283

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


beliefer commented on code in PR #40275:
URL: https://github.com/apache/spark/pull/40275#discussion_r1125652564


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala:
##
@@ -489,17 +489,31 @@ class ClientE2ETestSuite extends RemoteSparkSession {
   }
 
   test("ambiguous joins") {
+spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
 val left = spark.range(100).select(col("id"), rand(10).as("a"))
 val right = spark.range(100).select(col("id"), rand(12).as("a"))
 val joined = left.join(right, left("id") === 
right("id")).select(left("id"), right("a"))
 assert(joined.schema.catalogString === "struct")
+testCapturedStdOut(joined.explain(), "BroadcastHashJoin")
+spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")
 
 val joined2 = left
   .join(right, left.colRegex("id") === right.colRegex("id"))
   .select(left("id"), right("a"))
 assert(joined2.schema.catalogString === "struct")
   }
 
+  test("broadcast join") {
+spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
+val left = spark.range(100).select(col("id"), rand(10).as("a"))
+val right = spark.range(100).select(col("id"), rand(12).as("a"))
+val joined =
+  left.join(broadcast(right), left("id") === 
right("id")).select(left("id"), right("a"))
+assert(joined.schema.catalogString === "struct")
+testCapturedStdOut(joined.explain(), "BroadcastHashJoin")

Review Comment:
   Yeah.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] the8thC commented on a diff in pull request #40236: [SPARK-38735][SQL][TESTS] Add tests for the error class: INTERNAL_ERROR

2023-03-05 Thread via GitHub


the8thC commented on code in PR #40236:
URL: https://github.com/apache/spark/pull/40236#discussion_r1125670340


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala:
##
@@ -765,6 +770,58 @@ class QueryExecutionErrorsSuite
   )
 }
   }
+
+  test("INTERNAL_ERROR: Calling eval on Unevaluable expression") {
+val e = intercept[SparkException] {
+  Parameter("foo").eval()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map("message" -> "Cannot evaluate expression: 
parameter(foo)"),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Calling doGenCode on unresolved") {
+val e = intercept[SparkException] {
+  val ctx = new CodegenContext
+  Grouping(Parameter("foo")).genCode(ctx)
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map(
+"message" -> ("Cannot generate code for expression: " +
+  "grouping(parameter(foo))")),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Calling terminate on UnresolvedGenerator") {
+val e = intercept[SparkException] {
+  UnresolvedGenerator(FunctionIdentifier("foo"), Seq.empty).terminate()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map("message" -> "Cannot terminate expression: 'foo()"),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Initializing JavaBean with non existing method") {
+val e = intercept[SparkException] {
+  val initializeWithNonexistingMethod = InitializeJavaBean(
+Literal.fromObject(new java.util.LinkedList[Int]),
+Map("nonexistent" -> Literal(1)))
+  initializeWithNonexistingMethod.eval()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map(
+"message" -> ("""A method named "nonexistent" is not declared in """ +
+  "any enclosing class nor any supertype")),
+  sqlState = "XX000")

Review Comment:
   @itholic Yes, thank you. I actually noticed that there's already a test 
present for this error in RateStreamProviderSuite.scala:
   
https://github.com/apache/spark/blob/edafe266144c5c70852491fef9bb6907a001b286/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamProviderSuite.scala#L217-L237
   
   I'm not sure if it makes sense to move it or do anything more at all with 
this error, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson opened a new pull request, #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub


ivoson opened a new pull request, #40286:
URL: https://github.com/apache/spark/pull/40286

   ### What changes were proposed in this pull request?
   Currently a stage will be resubmitted in a few scenarios:
   1. Task failed with `FetchFailed` will trigger stage re-submit;
   2. Barrier task failed;
   3. Shuffle data loss due to executor/host decommissioned;
   
   For the first 2 scenarios, there is a config 
`spark.stage.maxConsecutiveAttempts` to limit the retry times. While for the 
3rd scenario, there'll be potential risks for inifinite retry if there are 
always executors hosting the shuffle data from successful tasks got 
killed/lost, the stage will be re-run again and again.
   
   To avoid the potential risk, the proposal in this PR is to add a new config 
`spark.stage.maxConsecutiveAttempts` to limit the overall max attempts number 
for each stage, the stage will be aborted once the retry times beyond the 
limitation.
   
   ### Why are the changes needed?
   To avoid the potential risks for stage infinite retry.
   
   ### Does this PR introduce _any_ user-facing change?
   Added limitation for stage retry times, so jobs may fail if they need to 
retry for mutiplte times beyond the limitation.
   
   ### How was this patch tested?
   Added new UT.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] FurcyPin commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub


FurcyPin commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125698656


##
python/pyspark/sql/functions.py:
##
@@ -22,20 +22,10 @@
 import sys
 import functools
 import warnings
-from typing import (
-Any,
-cast,

Review Comment:
   In the end, I went for `from typing import cast as _cast` which makes the 
intent even more explicit, I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp view after test `test temp view`

2023-03-05 Thread via GitHub


wangyum commented on PR #40285:
URL: https://github.com/apache/spark/pull/40285#issuecomment-1455258629

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38736: [SPARK-41214][SQL] - SQL Metrics are missing from Spark UI when AQE for Cached DataFrame is enabled

2023-03-05 Thread via GitHub


github-actions[bot] commented on PR #38736:
URL: https://github.com/apache/spark/pull/38736#issuecomment-1455262719

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #36265: [SPARK-38951][SQL] Aggregate aliases override field names in ResolveAggregateFunctions

2023-03-05 Thread via GitHub


github-actions[bot] closed pull request #36265: [SPARK-38951][SQL] Aggregate 
aliases override field names in ResolveAggregateFunctions
URL: https://github.com/apache/spark/pull/36265


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum closed pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp view after test `test temp view`

2023-03-05 Thread via GitHub


wangyum closed pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp 
view after test `test temp view`
URL: https://github.com/apache/spark/pull/40285


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer opened a new pull request, #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub


beliefer opened a new pull request, #40287:
URL: https://github.com/apache/spark/pull/40287

   ### What changes were proposed in this pull request?
   UnresolvedNamedLambdaVariable do not need unique names in python. We already 
did this for the scala client, and it is good to have parity between the two 
implementations.
   
   
   ### Why are the changes needed?
   Try to avoid unique names for UnresolvedNamedLambdaVariable.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature
   
   
   ### How was this patch tested?
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40255: [SPARK-42558][CONNECT] Implement `DataFrameStatFunctions` except `bloomFilter` functions

2023-03-05 Thread via GitHub


hvanhovell closed pull request #40255: [SPARK-42558][CONNECT] Implement 
`DataFrameStatFunctions` except `bloomFilter` functions
URL: https://github.com/apache/spark/pull/40255


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub


cloud-fan commented on PR #40064:
URL: https://github.com/apache/spark/pull/40064#issuecomment-1455335925

   @Yikf can you help to open a backport PR for 3.2/3.3? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-05 Thread via GitHub


hvanhovell commented on PR #40217:
URL: https://github.com/apache/spark/pull/40217#issuecomment-1455351159

   @panbingkun can you update the CompatibilitySuite?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect

2023-03-05 Thread via GitHub


amaliujia commented on PR #40228:
URL: https://github.com/apache/spark/pull/40228#issuecomment-1455359011

   @hvanhovell waiting for CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub


Yikf commented on PR #40064:
URL: https://github.com/apache/spark/pull/40064#issuecomment-1455364691

   > @Yikf can you help to open a backport PR for 3.2/3.3? Thanks!
   
   Sure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] anishshri-db commented on pull request #40273: [SPARK-42668][SS] Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort

2023-03-05 Thread via GitHub


anishshri-db commented on PR #40273:
URL: https://github.com/apache/spark/pull/40273#issuecomment-1455371384

   > Mind retriggering the build, please? Probably simplest way to do is 
pushing an empty commit. You can retrigger the build in your fork but it won't 
be reflected here.
   
   Sure done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125837371


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125837371


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+

[GitHub] [spark] wangyum commented on pull request #38358: [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering

2023-03-05 Thread via GitHub


wangyum commented on PR #38358:
URL: https://github.com/apache/spark/pull/38358#issuecomment-1455371977

   @EnricoMi It seems it will remove the table location if a 
`java.lang.ArithmeticException` is thrown after this change.
   How to reproduce:
   ```scala
   import org.apache.hadoop.fs.{FileSystem, Path}
   
   import org.apache.spark.sql.QueryTest
   import org.apache.spark.sql.catalyst.TableIdentifier
   
   sql("CREATE TABLE IF NOT EXISTS spark32_overwrite(amt1 int) STORED AS ORC")
   sql("CREATE TABLE IF NOT EXISTS spark32_overwrite2(amt1 long) STORED AS ORC")
   sql("INSERT OVERWRITE TABLE spark32_overwrite2 select 644164")
   sql("set spark.sql.ansi.enabled=true")
   
   val loc =
 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("spark32_overwrite")).location
   
   val fs = FileSystem.get(loc, spark.sparkContext.hadoopConfiguration)
   println("Location exists: " + fs.exists(new Path(loc)))
   
   try {
 sql("INSERT OVERWRITE TABLE spark32_overwrite select amt1 from " +
   "(select cast(amt1 as int) as amt1 from spark32_overwrite2 distribute by 
amt1)")
   } finally {
 println("Location exists: " + fs.exists(new Path(loc)))
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-05 Thread via GitHub


beliefer commented on PR #40291:
URL: https://github.com/apache/spark/pull/40291#issuecomment-1455384866

   @hvanhovell It seems that add test cases no way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] anishshri-db commented on pull request #40292: [SPARK-42676] Write temp checkpoints for streaming queries to local filesystem even if default FS is set differently

2023-03-05 Thread via GitHub


anishshri-db commented on PR #40292:
URL: https://github.com/apache/spark/pull/40292#issuecomment-1455397903

   @HeartSaVioR - please take a look. Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-05 Thread via GitHub


hvanhovell commented on PR #40291:
URL: https://github.com/apache/spark/pull/40291#issuecomment-1455425240

   hmmm - let me think about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40281: [SPARK-41497][CORE][Follow UP]Modify config `spark.rdd.cache.visibilityTracking.enabled` support version to 3.5.0

2023-03-05 Thread via GitHub


HyukjinKwon closed pull request #40281: [SPARK-41497][CORE][Follow UP]Modify 
config `spark.rdd.cache.visibilityTracking.enabled` support version to 3.5.0
URL: https://github.com/apache/spark/pull/40281


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub


itholic commented on PR #40271:
URL: https://github.com/apache/spark/pull/40271#issuecomment-1455275958

   Looks good otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub


itholic commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125771590


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   How about add a `test_cast` with practical cases?
   
   Seems like the test for `functions.cast` is missing (And that's also the 
root reason why we haven't noticed this is wrong until now)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub


beliefer commented on code in PR #39091:
URL: https://github.com/apache/spark/pull/39091#discussion_r1125777299


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -338,6 +340,22 @@ class SparkConnectPlanner(session: SparkSession) {
 }
   }
 
+  private def transformCollectMetrics(rel: proto.CollectMetrics): LogicalPlan 
= {
+val metrics = rel.getMetricsList.asScala.map { expr =>
+  Column(transformExpression(expr))
+}
+
+if (rel.getIsObservation) {

Review Comment:
   @hvanhovell is_observation has been removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40279: [MINOR][CONNECT] Remove unused protobuf imports to eliminate build warnings

2023-03-05 Thread via GitHub


hvanhovell closed pull request #40279: [MINOR][CONNECT] Remove unused protobuf 
imports to eliminate build warnings
URL: https://github.com/apache/spark/pull/40279


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub


hvanhovell commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125817796


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/LiteralValueProtoConverter.scala:
##
@@ -130,4 +138,61 @@ object LiteralValueProtoConverter {
   case o => throw new Exception(s"Unsupported value type: $o")
 }
   }
+
+  private def toArrayData(array: proto.Expression.Literal.Array): Any = {
+def makeArrayData[T](converter: proto.Expression.Literal => T)(implicit
+tag: ClassTag[T]): Array[T] = {

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub


hvanhovell commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125820525


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+

[GitHub] [spark] Yikf opened a new pull request, #40289: [SPARK-42478][SQL][3.2] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub


Yikf opened a new pull request, #40289:
URL: https://github.com/apache/spark/pull/40289

   This is a backport of https://github.com/apache/spark/pull/40064
   
   
   ### What changes were proposed in this pull request?
   
   Make a serializable jobTrackerId instead of a non-serializable JobID in 
FileWriterFactory
   
   ### Why are the changes needed?
   
   [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make 
consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a 
serializable issue, JobId is non-serializable.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub


beliefer commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455384317

   @hvanhovell After my test, `python/run-tests --testnames 
'pyspark.sql.connect.dataframe'` will not passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125852404


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125852404


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub


LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125852404


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+

[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-05 Thread via GitHub


beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125854126


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +141,21 @@ message Read {
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+// (Required) JDBC URL.
+string url = 1;
+
+// (Required) Name of the table in the external database.
+string table = 2;
+
+// (Optional) Condition in the where clause for each partition.
+repeated string predicates = 3;

Review Comment:
   But the transform path is very different from DataSource.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] anishshri-db opened a new pull request, #40292: [SPARK-42676] Write temp checkpoints for streaming queries to local filesystem even if default FS is set differently

2023-03-05 Thread via GitHub


anishshri-db opened a new pull request, #40292:
URL: https://github.com/apache/spark/pull/40292

   ### What changes were proposed in this pull request?
   Write temp checkpoints for streaming queries to local filesystem even if 
default FS is set differently
   
   ### Why are the changes needed?
   We have seen cases where the default FS could be a remote file system and 
since the path for streaming checkpoints is not specified explcitily, this 
could cause pileup under 2 cases:
   
   - query exits with exception and the flag to force checkpoint removal is not 
set
   - driver/cluster terminates without query being terminated gracefully
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Verified that the checkpoint is resolved and written to the local FS
   
   ```
   23/03/04 01:42:49 INFO ResolveWriteToStream: Checkpoint root 
file:/local_disk0/tmp/temporary-c97ab8bd-6b03-4c28-93ea-751d30a2d3f9 resolved 
to file:/local_disk0/tmp/temporary-c97ab8bd-6b03-4c28-93ea-751d30a2d3f9.
   ...
   23/03/04 01:46:37 INFO MicroBatchExecution: [queryId = 66c4c] Deleting 
checkpoint file:/local_disk0/tmp/temporary-c97ab8bd-6b03-4c28-93ea-751d30a2d3f9.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect

2023-03-05 Thread via GitHub


zhengruifeng commented on PR #40228:
URL: https://github.com/apache/spark/pull/40228#issuecomment-1455466444

   merged into master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub


pan3793 commented on code in PR #40283:
URL: https://github.com/apache/spark/pull/40283#discussion_r1125930947


##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   I'm change `-lt` to `-ne`, and always respect `maven.version` defined in 
`pom.xml`



##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   I mean change `-lt` to `-ne`, and always respect `maven.version` defined in 
`pom.xml`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-05 Thread via GitHub


HyukjinKwon commented on PR #40282:
URL: https://github.com/apache/spark/pull/40282#issuecomment-1455270795

   cc @MaxGekk and @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40284: [SPARK-42674][BUILD] Upgrade scalafmt from 3.7.1 to 3.7.2

2023-03-05 Thread via GitHub


HyukjinKwon closed pull request #40284: [SPARK-42674][BUILD] Upgrade scalafmt 
from 3.7.1 to 3.7.2
URL: https://github.com/apache/spark/pull/40284


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub


itholic commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125771590


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   How about add a `test_cast`?
   
   Seems like the test for `functions.cast` is missing (And that's also the 
root reason why we haven't noticed this is wrong until now)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub


beliefer commented on PR #39091:
URL: https://github.com/apache/spark/pull/39091#issuecomment-1455279364

   > @beliefer can you please remove the is_observation code path? And take 
another look at the protocol. Otherwise I think it looks good.
   
   is_observation code path has been removed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-05 Thread via GitHub


hvanhovell commented on code in PR #40280:
URL: https://github.com/apache/spark/pull/40280#discussion_r1125800378


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -115,7 +115,7 @@ class SparkSession private[sql] (
   private def createDataset[T](encoder: AgnosticEncoder[T], data: 
Iterator[T]): Dataset[T] = {
 newDataset(encoder) { builder =>
   val localRelationBuilder = builder.getLocalRelationBuilder
-.setSchema(encoder.schema.catalogString)

Review Comment:
   json is ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-05 Thread via GitHub


hvanhovell closed pull request #40280: [SPARK-42671][CONNECT] Fix bug for 
createDataFrame from complex type schema
URL: https://github.com/apache/spark/pull/40280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


hvanhovell closed pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to 
functions
URL: https://github.com/apache/spark/pull/40275


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40255: [SPARK-42558][CONNECT] Implement `DataFrameStatFunctions` except `bloomFilter` functions

2023-03-05 Thread via GitHub


hvanhovell commented on PR #40255:
URL: https://github.com/apache/spark/pull/40255#issuecomment-1455323028

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


hvanhovell commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455321694

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub


hvanhovell commented on PR #39091:
URL: https://github.com/apache/spark/pull/39091#issuecomment-1455327845

   Merging to master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub


hvanhovell closed pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement 
`DataFrame.observe`
URL: https://github.com/apache/spark/pull/39091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40254:
URL: https://github.com/apache/spark/pull/40254#issuecomment-1455328473

   friendly ping @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40270: [SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.

2023-03-05 Thread via GitHub


hvanhovell commented on code in PR #40270:
URL: https://github.com/apache/spark/pull/40270#discussion_r1125815690


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -781,3 +782,10 @@ message FrameMap {
   CommonInlineUserDefinedFunction func = 2;
 }
 
+message WithSequenceColumn {

Review Comment:
   Well my argument against this is that it is just a project with a specific 
type of expression attached to it. There is no need to complicate the protocol, 
it is just that.
   
   As for what this does, and this comment is more aimed at the original PR, 
two things:
   - The IDs are not stable at all. This is an order based, and well we 
basically do not guarantee that the order is stable during processing.
   - The IDs can contain gaps or duplicates if any of the shuffles of the input 
contains a non-deterministic column, and a retry of one of the input 
tasks/stages occurs. This is a result of the double scanning that 
RDD.zipWithIndex requires.
   - Finally two scans can be slow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #40288: [SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-05 Thread via GitHub


itholic opened a new pull request, #40288:
URL: https://github.com/apache/spark/pull/40288

   ### What changes were proposed in this pull request?
   
   This PR proposes to add a brief description of Spark Connect to the PySpark 
main page.
   
   https://user-images.githubusercontent.com/44108233/223006571-42fccf6f-cb7b-479c-9f11-5c246b442fac.png;>
   
   
   ### Why are the changes needed?
   
   Spark Connect is a new and experimental feature of PySpark that enables 
Spark to run anywhere and work with different data stores and services. Adding 
a brief description of Spark Connect to the main page will inform users about 
this feature and how it can benefit their use cases. Additionally, the note 
about the experimental nature of this feature will help users understand the 
potential risks of using Spark Connect in production environments.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No for API usage, but this PR introduces a user-facing documents by adding a 
new section about Spark Connect to the PySpark main page.
   
   
   ### How was this patch tested?
   
   To ensure that the documentation builds correctly, the `make clean html` 
command was executed to test the build process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-05 Thread via GitHub


hvanhovell commented on PR #40217:
URL: https://github.com/apache/spark/pull/40217#issuecomment-1455348717

   @panbingkun can you update your PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40288: [SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-05 Thread via GitHub


itholic commented on PR #40288:
URL: https://github.com/apache/spark/pull/40288#issuecomment-1455348864

   cc @tgravescs since this is a Spark Connect introduction including note 
about built in authentication you mentioned in JIRA ticket before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-05 Thread via GitHub


hvanhovell commented on code in PR #40217:
URL: https://github.com/apache/spark/pull/40217#discussion_r1125825287


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionSuite.scala:
##
@@ -0,0 +1,377 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.connect.client.util.QueryTest
+
+class DataFrameNaFunctionSuite extends QueryTest {

Review Comment:
   Is this a line for line copy of the original test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub


beliefer commented on PR #39091:
URL: https://github.com/apache/spark/pull/39091#issuecomment-1455360592

   @hvanhovell @grundprinzip @HyukjinKwon @zhengruifeng @amaliujia Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-05 Thread via GitHub


hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125835789


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +141,21 @@ message Read {
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+// (Required) JDBC URL.
+string url = 1;
+
+// (Required) Name of the table in the external database.
+string table = 2;
+
+// (Optional) Condition in the where clause for each partition.
+repeated string predicates = 3;

Review Comment:
   Can we just put the predicates into the DataSource message?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub


hvanhovell commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455366786

   @HyukjinKwon @zhengruifeng the rationale for this change is that analyzer 
takes care of making lambda variables unique.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub


beliefer commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455392063

   > I guess we will need to rewrite the lamda function in spark connect 
planner.
   
   Yeah.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub


pan3793 commented on code in PR #40283:
URL: https://github.com/apache/spark/pull/40283#discussion_r1125927078


##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   why not use exact match here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40284: [SPARK-42674][BUILD] Upgrade scalafmt from 3.7.1 to 3.7.2

2023-03-05 Thread via GitHub


HyukjinKwon commented on PR #40284:
URL: https://github.com/apache/spark/pull/40284#issuecomment-1455270404

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


beliefer commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455280706

   ping @HyukjinKwon @zhengruifeng @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-05 Thread via GitHub


beliefer commented on PR #40277:
URL: https://github.com/apache/spark/pull/40277#issuecomment-1455280396

   ping @hvanhovell @HyukjinKwon @dongjoon-hyun  cc @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #40262: [SPARK-42651][SQL] Optimize global sort to driver sort

2023-03-05 Thread via GitHub


ulysses-you commented on PR #40262:
URL: https://github.com/apache/spark/pull/40262#issuecomment-1455303198

   cc @cloud-fan @viirya thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub


mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790378


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2479,4 +2479,14 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val STAGE_MAX_ATTEMPTS =
+ConfigBuilder("spark.stage.maxAttempts")
+  .doc("The max attempts for a stage, the spark job will be aborted if any 
of its stages is " +
+"resubmitted multiple times beyond the limitation. The value should be 
no less " +
+"than `spark.stage.maxConsecutiveAttempts` which defines the max 
attempts for " +
+"fetch failures.")
+  .version("3.5.0")
+  .intConf
+  .createWithDefault(16)

Review Comment:
   Since this is a behavior change, let us make this an optional paraeter - and 
preserve current behavior when not configured.
   We can change this to a default value in a future release.
   
   Given cascading stage retries, particularly due to `INDETERMINATE` stage, 
this minimizes application failures.



##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -232,6 +232,13 @@ private[spark] class DAGScheduler(
 sc.getConf.getInt("spark.stage.maxConsecutiveAttempts",
   DAGScheduler.DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS)
 
+  /**
+   * Max stage attempts allowed before a stage is aborted.
+   */
+  private[scheduler] val maxStageAttempts: Int = {
+Math.max(maxConsecutiveStageAttempts, 
sc.getConf.get(config.STAGE_MAX_ATTEMPTS))

Review Comment:
   Modify this suitably to return an `Option`.



##
core/src/main/scala/org/apache/spark/scheduler/Stage.scala:
##
@@ -70,6 +70,7 @@ private[scheduler] abstract class Stage(
 
   /** The ID to use for the next new attempt for this stage. */
   private var nextAttemptId: Int = 0
+  private[scheduler] def getNextAttemptId(): Int = nextAttemptId

Review Comment:
   `getNextAttemptId()` -> `getNextAttemptId`



##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -406,12 +406,13 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 blockManagerMaster = spy(new MyBlockManagerMaster(sc.getConf))
 doNothing().when(blockManagerMaster).updateRDDBlockVisibility(any(), any())
 scheduler = new MyDAGScheduler(
+scheduler = spy(new MyDAGScheduler(

Review Comment:
   This should cause a compilation failure - remove the prev line ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub


mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790750


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -232,6 +232,13 @@ private[spark] class DAGScheduler(
 sc.getConf.getInt("spark.stage.maxConsecutiveAttempts",
   DAGScheduler.DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS)
 
+  /**
+   * Max stage attempts allowed before a stage is aborted.
+   */
+  private[scheduler] val maxStageAttempts: Int = {
+Math.max(maxConsecutiveStageAttempts, 
sc.getConf.get(config.STAGE_MAX_ATTEMPTS))

Review Comment:
   Modify this suitably to return an `Option` if using optin



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub


mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790378


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2479,4 +2479,14 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val STAGE_MAX_ATTEMPTS =
+ConfigBuilder("spark.stage.maxAttempts")
+  .doc("The max attempts for a stage, the spark job will be aborted if any 
of its stages is " +
+"resubmitted multiple times beyond the limitation. The value should be 
no less " +
+"than `spark.stage.maxConsecutiveAttempts` which defines the max 
attempts for " +
+"fetch failures.")
+  .version("3.5.0")
+  .intConf
+  .createWithDefault(16)

Review Comment:
   Since this is a behavior change, let us make this an optional parameter - 
and preserve current behavior when not configured (or make default int max 
value).
   We can change this to a more restrictive default value in a future release.
   
   Given cascading stage retries and deployments where decom is not applicable 
(yarn for example), particularly due to `INDETERMINATE` stage, this minimizes 
application failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub


mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790378


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2479,4 +2479,14 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val STAGE_MAX_ATTEMPTS =
+ConfigBuilder("spark.stage.maxAttempts")
+  .doc("The max attempts for a stage, the spark job will be aborted if any 
of its stages is " +
+"resubmitted multiple times beyond the limitation. The value should be 
no less " +
+"than `spark.stage.maxConsecutiveAttempts` which defines the max 
attempts for " +
+"fetch failures.")
+  .version("3.5.0")
+  .intConf
+  .createWithDefault(16)

Review Comment:
   Since this is a behavior change, let us make this an optional parameter - 
and preserve current behavior when not configured (or make default int max 
value).
   We can change this to a more restrictive default value in a future release.
   
   Given cascading stage retries, particularly due to `INDETERMINATE` stage, 
this minimizes application failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp view after test `test temp view`

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40285:
URL: https://github.com/apache/spark/pull/40285#issuecomment-1455325164

   Thanks @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40255: [SPARK-42558][CONNECT] Implement `DataFrameStatFunctions` except `bloomFilter` functions

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40255:
URL: https://github.com/apache/spark/pull/40255#issuecomment-1455324716

   Thanks @hvanhovell @HyukjinKwon @zhengruifeng @amaliujia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub


srowen closed pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard 
metrics 4.2.17
URL: https://github.com/apache/spark/pull/40254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub


srowen commented on PR #40254:
URL: https://github.com/apache/spark/pull/40254#issuecomment-1455341403

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub


LuciferYang commented on PR #40254:
URL: https://github.com/apache/spark/pull/40254#issuecomment-1455349598

   Thanks @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect

2023-03-05 Thread via GitHub


hvanhovell commented on PR #40228:
URL: https://github.com/apache/spark/pull/40228#issuecomment-1455352755

   @amaliujia can you update the PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub


beliefer commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455357573

   @hvanhovell @LuciferYang Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >