[GitHub] [spark] LuciferYang commented on a diff in pull request #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel

2023-03-21 Thread via GitHub


LuciferYang commented on code in PR #40510:
URL: https://github.com/apache/spark/pull/40510#discussion_r1144269110


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -54,6 +54,20 @@ message UserContext {
   repeated google.protobuf.Any extensions = 999;
 }
 
+// StorageLevel for persisting Datasets/Tables.
+message StorageLevel {

Review Comment:
   friendly ping @ueshin @HyukjinKwon 
   
   When I try to import `import "spark/connect/base.proto";` in `catalog.proto` 
to reuse `StorageLevel` for `message CacheTable`, I found the following compile 
messages:
   
   ```
   spark/connect/base.proto:23:1: File recursively imports itself: 
spark/connect/base.proto -> spark/connect/commands.proto -> 
spark/connect/relations.proto -> spark/connect/catalog.proto -> 
spark/connect/base.proto
   spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not 
found or had errors.
   spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
   spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be 
defined in "spark/connect/types.proto", which is not imported by 
"spark/connect/catalog.proto".  To use it here, please add the necessary import.
   spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was 
not found or had errors.
   spark/connect/relations.proto:84:5: "Catalog" is not defined.
   spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" 
was not found or had errors.
   spark/connect/commands.proto:63:3: "Relation" is not defined.
   spark/connect/commands.proto:81:3: "Relation" is not defined.
   spark/connect/commands.proto:142:3: "Relation" is not defined.
   spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not 
found or had errors.
   spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was 
not found or had errors.
   
   ```
   
   So should we move `StorageLevel` to a more suitable place?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client

2023-03-21 Thread via GitHub


LuciferYang commented on code in PR #40516:
URL: https://github.com/apache/spark/pull/40516#discussion_r1144250578


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2771,22 +2771,86 @@ class Dataset[T] private[sql] (
 new DataFrameWriterV2[T](table, this)
   }
 
+  /**
+   * Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
+   *
+   * @group basic
+   * @since 3.4.0

Review Comment:
   Make `@since 3.4.0` first



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client

2023-03-21 Thread via GitHub


LuciferYang commented on PR #40516:
URL: https://github.com/apache/spark/pull/40516#issuecomment-1478944069

   cc @HyukjinKwon @hvanhovell FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Scala connect client

2023-03-21 Thread via GitHub


LuciferYang opened a new pull request, #40516:
URL: https://github.com/apache/spark/pull/40516

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ritikam2 commented on pull request #40116: [SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect

2023-03-21 Thread via GitHub


ritikam2 commented on PR #40116:
URL: https://github.com/apache/spark/pull/40116#issuecomment-1478932824

   Perhaps the following would be better solution. Instead of looking for star 
any  UnresolvedFunction should have UnresolvedAlias.  Any comments?
   
 `private[this] def alias(expr: Expression): NamedExpression = expr match {
   case expr: NamedExpression => expr
   case a: AggregateExpression if 
a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
 UnresolvedAlias(a, Some(Column.generateAlias))
   case expr: Expression =>
  if (expr.isInstanceOf[UnresolvedFunction]) {
   UnresolvedAlias(expr, None)
 } else {
Alias(expr, toPrettySQL(expr))()
 }
 }`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478924572

   @mridulm 
   yep,it`s me
   Username: StoveM
   Full name: Fencheng Mei


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478924279

   > 
   
   yep,it`s me 
   Username:StoveM
   Full name:   Fencheng Mei


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax

2023-03-21 Thread via GitHub


srowen commented on PR #40504:
URL: https://github.com/apache/spark/pull/40504#issuecomment-1478906144

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax

2023-03-21 Thread via GitHub


srowen closed pull request #40504: [SPARK-42880][DOCS] Update 
running-on-yarn.md to log4j2 syntax
URL: https://github.com/apache/spark/pull/40504


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


mridulm commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478905577

   Is your jira id `StoveM` @Stove-hust  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType

2023-03-21 Thread via GitHub


cloud-fan commented on code in PR #40512:
URL: https://github.com/apache/spark/pull/40512#discussion_r1144213535


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala:
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.types
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+
+object DataTypeUtils {
+  /**
+   * Check if `this` and `other` are the same data type when ignoring 
nullability
+   * (`StructField.nullable`, `ArrayType.containsNull`, and 
`MapType.valueContainsNull`).
+   */
+  def sameType(left: DataType, right: DataType): Boolean =
+if (SQLConf.get.caseSensitiveAnalysis) {
+  equalsIgnoreNullability(left, right)
+} else {
+  equalsIgnoreCaseAndNullability(left, right)
+}
+
+  /**
+   * Compares two types, ignoring nullability of ArrayType, MapType, 
StructType.
+   */
+  def equalsIgnoreNullability(left: DataType, right: DataType): Boolean = {

Review Comment:
   nvm, it was a static method before this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType

2023-03-21 Thread via GitHub


cloud-fan commented on code in PR #40512:
URL: https://github.com/apache/spark/pull/40512#discussion_r1144213026


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala:
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.types
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+
+object DataTypeUtils {
+  /**
+   * Check if `this` and `other` are the same data type when ignoring 
nullability
+   * (`StructField.nullable`, `ArrayType.containsNull`, and 
`MapType.valueContainsNull`).
+   */
+  def sameType(left: DataType, right: DataType): Boolean =
+if (SQLConf.get.caseSensitiveAnalysis) {
+  equalsIgnoreNullability(left, right)
+} else {
+  equalsIgnoreCaseAndNullability(left, right)
+}
+
+  /**
+   * Compares two types, ignoring nullability of ArrayType, MapType, 
StructType.
+   */
+  def equalsIgnoreNullability(left: DataType, right: DataType): Boolean = {

Review Comment:
   does this method require any non-public APIs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF

2023-03-21 Thread via GitHub


cloud-fan closed pull request #40397: [SPARK-42052][SQL] Codegen Support for 
HiveSimpleUDF
URL: https://github.com/apache/spark/pull/40397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF

2023-03-21 Thread via GitHub


cloud-fan commented on PR #40397:
URL: https://github.com/apache/spark/pull/40397#issuecomment-1478896479

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-21 Thread via GitHub


beliefer commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1144208941


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: 
proto.Expression) extends Logg
* @group df_ops
* @since 3.4.0
*/
-  def explain(extended: Boolean): Unit = {
+  def explain(extended: Boolean)(implicit spark: SparkSession): Unit = {

Review Comment:
   I understand what you mean roughly. We don't need follow SQL, but the output 
reflects the hierarchy of the tree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40514: [SPARK-41233][CONNECT][PYTHON] Add array_prepend to Spark Connect Python client

2023-03-21 Thread via GitHub


zhengruifeng commented on PR #40514:
URL: https://github.com/apache/spark/pull/40514#issuecomment-1478884594

   would you mind adding a simple test here? 
   
https://github.com/apache/spark/blob/149e020a5ca88b2db9c56a9d48e0c1c896b57069/python/pyspark/sql/tests/connect/test_connect_function.py#L1077-L1089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-21 Thread via GitHub


hvanhovell commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1144196759


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: 
proto.Expression) extends Logg
* @group df_ops
* @since 3.4.0
*/
-  def explain(extended: Boolean): Unit = {
+  def explain(extended: Boolean)(implicit spark: SparkSession): Unit = {

Review Comment:
   @beliefer no we can't be fully consistent with SQL. We are already deviating 
from what the original implementation does when you set `extended = true`. I 
just want something that is more readable.
   
   As for the implementation. The catalyst dependency is going away. Just 
create a bit of tree traversal logic that nicely renders the expression tree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40413: [SPARK-42786][Connect] Typed Select

2023-03-21 Thread via GitHub


hvanhovell closed pull request #40413: [SPARK-42786][Connect] Typed Select
URL: https://github.com/apache/spark/pull/40413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40413: [SPARK-42786][Connect] Typed Select

2023-03-21 Thread via GitHub


hvanhovell commented on PR #40413:
URL: https://github.com/apache/spark/pull/40413#issuecomment-1478874829

   Merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40368: [SPARK-42748][CONNECT] Server-side Artifact Management

2023-03-21 Thread via GitHub


hvanhovell commented on PR #40368:
URL: https://github.com/apache/spark/pull/40368#issuecomment-1478874581

   @vicennial can you update?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF

2023-03-21 Thread via GitHub


panbingkun commented on PR #40397:
URL: https://github.com/apache/spark/pull/40397#issuecomment-1478870787

   @cloud-fan Can we merge it to master? After it I will try to refactor 
HiveGenericUDTF & HiveUDAFFunction. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell opened a new pull request, #40515: [SPARK-42884][CONNECT] Add Ammonite REPL integration

2023-03-21 Thread via GitHub


hvanhovell opened a new pull request, #40515:
URL: https://github.com/apache/spark/pull/40515

   ### What changes were proposed in this pull request?
   This PR adds Ammonite REPL integration for Spark Connect. This has a couple 
of benefits:
   - It makes it a lot less cumbersome for users to start a spark connect REPL. 
You don't have to add custom scripts, and you can use `coursier` to launch a 
fully function REPL for you.
   - It adds REPL integration for to the actual build. This makes it easier to 
validate the code we add is actually working.
   
   
   ### Why are the changes needed?
   A REPL is arguably the first entry point for a lot of users.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes it adds REPL integration.
   
   ### How was this patch tested?
   Added tests for the command line parsing. Manually tested the REPL.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


LuciferYang commented on PR #40496:
URL: https://github.com/apache/spark/pull/40496#issuecomment-1478858139

   > @dtenedor Wait a few minutes for me to check with Scala 2.13 manually
   
   done, should be ok ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-21 Thread via GitHub


beliefer commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1144182236


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: 
proto.Expression) extends Logg
* @group df_ops
* @since 3.4.0
*/
-  def explain(extended: Boolean): Unit = {
+  def explain(extended: Boolean)(implicit spark: SparkSession): Unit = {

Review Comment:
   @hvanhovell Do you mean to construct local catalyst expression tree at 
connect client side ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40412: [SPARK-42784] should still create subDir when the number of subDir in merge dir is less than conf

2023-03-21 Thread via GitHub


Stove-hust commented on PR #40412:
URL: https://github.com/apache/spark/pull/40412#issuecomment-1478854808

   @zhouyejoe I I kept the accident scene, should be able to help you。(In our 
clustered machine environment,11 HDD,creating 11 * 64 subdirectories would take 
longer to create)
   23/02/21 10:44:09 YarnAllocator: Launching container container_184 on host 
hostA
   23/02/21 10:44:21  YarnAllocator: Launching container container_185 on host 
hostA
   23/02/21 10:44:21  YarnAllocator: Container container_184 on host: hostA was 
preempted.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478849048

   > I could not cherry pick this into 3.4 and 3.3 - we should fix for those 
branches as well IMO. Can you create a PR against those two branches as well 
@Stove-hust ? Thanks
   
   No problem


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-21 Thread via GitHub


beliefer commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1144175741


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: 
proto.Expression) extends Logg
* @group df_ops
* @since 3.4.0
*/
-  def explain(extended: Boolean): Unit = {
+  def explain(extended: Boolean)(implicit spark: SparkSession): Unit = {

Review Comment:
   @hvanhovell If so, the output of `Column.explain` can't keep consistent with 
Spark SQL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request, #40514: [SPARK-41233][CONNECT][PYTHON] Add array_prepend to Spark Connect Python client

2023-03-21 Thread via GitHub


ueshin opened a new pull request, #40514:
URL: https://github.com/apache/spark/pull/40514

   ### What changes were proposed in this pull request?
   
   This is a follow-up of #38947.
   
   Add `array_prepend` function to Spark Connect Python client.
   
   ### Why are the changes needed?
   
   `array_prepend` was added at #38947 without Spark Connect Python client.
   
   ### Does this PR introduce _any_ user-facing change?
   
   `array_prepend`  will be available in Spark Connect Python client.
   
   ### How was this patch tested?
   
   Enabled the related test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


LuciferYang commented on PR #40496:
URL: https://github.com/apache/spark/pull/40496#issuecomment-147884

   @dtenedor Wait a few minutes for me to check with Scala 2.13 manually


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


dtenedor commented on PR #40496:
URL: https://github.com/apache/spark/pull/40496#issuecomment-1478835534

   @HyukjinKwon the tests are passing now, this is ready to merge if you are 
ready :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


mridulm commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478833979

   I could not cherry pick this into 3.4 and 3.3 - we should fix for those 
branches as well IMO.
   Can you create a PR against those two branches as well @Stove-hust ? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


mridulm commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478833621

   Merged to master.
   Thanks for working on this @Stove-hust !
   Thanks for the review @otterc :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm closed pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


mridulm closed pull request #40393: [SPARK-40082] Schedule mergeFinalize when 
push merge shuffleMapStage retry but no running tasks
URL: https://github.com/apache/spark/pull/40393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng opened a new pull request, #40513: Block Arrow-optimized Python UDFs

2023-03-21 Thread via GitHub


xinrong-meng opened a new pull request, #40513:
URL: https://github.com/apache/spark/pull/40513

   ### What changes were proposed in this pull request?
   Block the usage of Arrow-optimized Python UDFs in Apache Spark 3.4.0. 
   
   ### Why are the changes needed?
   Considering the upcoming improvements on the result inconsistencies between 
traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
block the feature, otherwise, users who try out the feature will expect 
behavior changes in the next release.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Arrow-optimized Python UDFs are blocked.
   
   ### How was this patch tested?
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-21 Thread via GitHub


hvanhovell commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1144160497


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: 
proto.Expression) extends Logg
* @group df_ops
* @since 3.4.0
*/
-  def explain(extended: Boolean): Unit = {
+  def explain(extended: Boolean)(implicit spark: SparkSession): Unit = {

Review Comment:
   I am not sure what happened here, my comment seemed to have gone missing... 
Anyway, please don't send an RPC for this. Just use the local expression tree 
and create something that is a little bit easier to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 2.0.7

2023-03-21 Thread via GitHub


LuciferYang commented on PR #40489:
URL: https://github.com/apache/spark/pull/40489#issuecomment-1478824499

   Thanks @srowen @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40490: [SPARK-42536][BUILD] Upgrade log4j2 to 2.20.0

2023-03-21 Thread via GitHub


LuciferYang commented on PR #40490:
URL: https://github.com/apache/spark/pull/40490#issuecomment-1478821629

   Thanks @dongjoon-hyun @viirya ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] frankliee commented on a diff in pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax

2023-03-21 Thread via GitHub


frankliee commented on code in PR #40504:
URL: https://github.com/apache/spark/pull/40504#discussion_r1144154895


##
docs/running-on-yarn.md:
##
@@ -137,7 +137,7 @@ Note that for the first option, both executors and the 
application master will s
 log4j configuration, which may cause issues when they run on the same node 
(e.g. trying to write
 to the same log file).
 
-If you need a reference to the proper location to put log files in the YARN so 
that YARN can properly display and aggregate them, use 
`spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, 
`log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`.
 For streaming applications, configuring `RollingFileAppender` and setting file 
location to YARN's log directory will avoid disk overflow caused by large log 
files, and logs can be accessed using YARN's log utility.
+If you need a reference to the proper location to put log files in the YARN so 
that YARN can properly display and aggregate them, use 
`spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, 
`appender.spark.fileName=${sys:spark.yarn.app.container.log.dir}/spark.log`. 
For streaming applications, configuring `RollingFileAppender` and setting file 
location to YARN's log directory will avoid disk overflow caused by large log 
files, and logs can be accessed using YARN's log utility.

Review Comment:
   I have changed back to `file_appender`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] frankliee commented on a diff in pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax

2023-03-21 Thread via GitHub


frankliee commented on code in PR #40504:
URL: https://github.com/apache/spark/pull/40504#discussion_r1144154348


##
docs/running-on-yarn.md:
##
@@ -137,7 +137,7 @@ Note that for the first option, both executors and the 
application master will s
 log4j configuration, which may cause issues when they run on the same node 
(e.g. trying to write
 to the same log file).
 
-If you need a reference to the proper location to put log files in the YARN so 
that YARN can properly display and aggregate them, use 
`spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, 
`log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`.
 For streaming applications, configuring `RollingFileAppender` and setting file 
location to YARN's log directory will avoid disk overflow caused by large log 
files, and logs can be accessed using YARN's log utility.
+If you need a reference to the proper location to put log files in the YARN so 
that YARN can properly display and aggregate them, use 
`spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, 
`appender.spark.fileName=${sys:spark.yarn.app.container.log.dir}/spark.log`. 
For streaming applications, configuring `RollingFileAppender` and setting file 
location to YARN's log directory will avoid disk overflow caused by large log 
files, and logs can be accessed using YARN's log utility.

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType

2023-03-21 Thread via GitHub


amaliujia commented on PR #40512:
URL: https://github.com/apache/spark/pull/40512#issuecomment-1478801856

   @hvanhovell @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType

2023-03-21 Thread via GitHub


amaliujia opened a new pull request, #40512:
URL: https://github.com/apache/spark/pull/40512

   
   
   ### What changes were proposed in this pull request?
   
   This PR moves the following methods from `DataType`:
   1. equalsIgnoreNullability
   2. sameType
   3. equalsIgnoreCaseAndNullability
   
   The moved class is put together into a Util class.
   
   ### Why are the changes needed?
   
   To make `DataType` become a simpler interface, non-public methods can be 
moved outside of the DataType class.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No as the moved methods are private within Spark.
   
   ### How was this patch tested?
   
   Existing UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-21 Thread via GitHub


cloud-fan closed pull request #40385: [SPARK-42753] ReusedExchange refers to 
non-existent nodes
URL: https://github.com/apache/spark/pull/40385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-21 Thread via GitHub


cloud-fan commented on PR #40385:
URL: https://github.com/apache/spark/pull/40385#issuecomment-1478798727

   thanks, merging to master/3.4!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40034: [SPARK-42447][INFRA] Remove Hadoop 2 GitHub Action job

2023-03-21 Thread via GitHub


HyukjinKwon commented on PR #40034:
URL: https://github.com/apache/spark/pull/40034#issuecomment-1478796828

   I remember Guava upgrade is also blocked by Hive .. IIRC .. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40494: [MINOR][DOCS] Fix typos

2023-03-21 Thread via GitHub


HyukjinKwon closed pull request #40494: [MINOR][DOCS] Fix typos 
URL: https://github.com/apache/spark/pull/40494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40494: [MINOR][DOCS] Fix typos

2023-03-21 Thread via GitHub


HyukjinKwon commented on PR #40494:
URL: https://github.com/apache/spark/pull/40494#issuecomment-1478792526

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel

2023-03-21 Thread via GitHub


HyukjinKwon closed pull request #40510: [SPARK-42889][CONNECT][PYTHON] 
Implement cache, persist, unpersist, and storageLevel
URL: https://github.com/apache/spark/pull/40510


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel

2023-03-21 Thread via GitHub


HyukjinKwon commented on PR #40510:
URL: https://github.com/apache/spark/pull/40510#issuecomment-1478790850

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40447: [SPARK-42816][CONNECT] Support Max Message size up to 128MB

2023-03-21 Thread via GitHub


dongjoon-hyun commented on PR #40447:
URL: https://github.com/apache/spark/pull/40447#issuecomment-1478787974

   Merged to master/branch-3.4. Thank you, @grundprinzip and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40447: [SPARK-42816][CONNECT] Support Max Message size up to 128MB

2023-03-21 Thread via GitHub


dongjoon-hyun closed pull request #40447: [SPARK-42816][CONNECT] Support Max 
Message size up to 128MB
URL: https://github.com/apache/spark/pull/40447


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11

2023-03-21 Thread via GitHub


sunchao commented on PR #40511:
URL: https://github.com/apache/spark/pull/40511#issuecomment-1478787651

   LGTM too, thanks @cnauroth @dongjoon-hyun !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sudoliyang commented on pull request #40494: [MINOR][DOCS] Fix typos

2023-03-21 Thread via GitHub


sudoliyang commented on PR #40494:
URL: https://github.com/apache/spark/pull/40494#issuecomment-1478786416

   @HyukjinKwon Thanks. I ran the tests at 
https://github.com/sudoliyang/spark/pull/2. 
   I would like to rebase and force push again to run all tests here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11

2023-03-21 Thread via GitHub


dongjoon-hyun commented on PR #40511:
URL: https://github.com/apache/spark/pull/40511#issuecomment-1478780794

   Merged to master/3.4 for Apache Spark 3.4.0. This will be a part of next 
Apache Spark 3.4.0 RC.
   I added you to the Apache Spark contributor group and assigned SPARK-42888 
to you.
   Welcome to the Apache Spark community, @cnauroth .
   
   Also, cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11

2023-03-21 Thread via GitHub


dongjoon-hyun closed pull request #40511: [SPARK-42888][BUILD] Upgrade 
`gcs-connector` to 2.2.11
URL: https://github.com/apache/spark/pull/40511


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-21 Thread via GitHub


zhengruifeng commented on PR #40263:
URL: https://github.com/apache/spark/pull/40263#issuecomment-1478764552

   @srowen sounds reasonable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 2.0.7

2023-03-21 Thread via GitHub


HyukjinKwon closed pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 
2.0.7
URL: https://github.com/apache/spark/pull/40489


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 2.0.7

2023-03-21 Thread via GitHub


HyukjinKwon commented on PR #40489:
URL: https://github.com/apache/spark/pull/40489#issuecomment-1478757952

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40468: [SPARK-42838][SQL] changed error class name _LEGACY_ERROR_TEMP_2000

2023-03-21 Thread via GitHub


HyukjinKwon commented on PR #40468:
URL: https://github.com/apache/spark/pull/40468#issuecomment-1478757584

   Mind filling the PR description as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40507: [SPARK-42662][CONNECT][PS] Add proto message for pandas API on Spark default index

2023-03-21 Thread via GitHub


HyukjinKwon closed pull request #40507: [SPARK-42662][CONNECT][PS] Add proto 
message for pandas API on Spark default index
URL: https://github.com/apache/spark/pull/40507


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40505: [MINOR][DOCS] Remove SparkSession constructor invocation in the example

2023-03-21 Thread via GitHub


HyukjinKwon closed pull request #40505: [MINOR][DOCS] Remove SparkSession 
constructor invocation in the example
URL: https://github.com/apache/spark/pull/40505


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40507: [SPARK-42662][CONNECT][PS] Add proto message for pandas API on Spark default index

2023-03-21 Thread via GitHub


HyukjinKwon commented on PR #40507:
URL: https://github.com/apache/spark/pull/40507#issuecomment-1478755845

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40505: [MINOR][DOCS] Remove SparkSession constructor invocation in the example

2023-03-21 Thread via GitHub


HyukjinKwon commented on code in PR #40505:
URL: https://github.com/apache/spark/pull/40505#discussion_r1144108734


##
python/pyspark/sql/session.py:
##
@@ -179,10 +179,15 @@ class SparkSession(SparkConversionMixin):
 ... .getOrCreate()
 ... )
 
-Create a Spark session from a Spark context.
+Create a Spark session with Spark Connect.
 
->>> sc = spark.sparkContext
->>> spark = SparkSession(sc)
+>>> spark = (
+... SparkSession.builder
+... .remote("sc://localhost")

Review Comment:
   That's fine without a port - it uses the default port 15002.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40505: [MINOR][DOCS] Remove SparkSession constructor invocation in the example

2023-03-21 Thread via GitHub


HyukjinKwon commented on PR #40505:
URL: https://github.com/apache/spark/pull/40505#issuecomment-1478755239

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38608: [SPARK-41080][SQL] Support Bit manipulation function SETBIT

2023-03-21 Thread via GitHub


github-actions[bot] closed pull request #38608: [SPARK-41080][SQL] Support Bit 
manipulation function SETBIT
URL: https://github.com/apache/spark/pull/38608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

2023-03-21 Thread via GitHub


github-actions[bot] closed pull request #38534: [SPARK-38505][SQL] Make partial 
aggregation adaptive
URL: https://github.com/apache/spark/pull/38534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhouyejoe commented on pull request #40412: [SPARK-42784] should still create subDir when the number of subDir in merge dir is less than conf

2023-03-21 Thread via GitHub


zhouyejoe commented on PR #40412:
URL: https://github.com/apache/spark/pull/40412#issuecomment-1478751685

   @Stove-hust I think the changes will help resolve the issue described in the 
ticket. I am checking more about what could be causing to the race conditions 
where there are two Executor containers starting at almost the same time on the 
same node, and they will all enter this code section and try creating subdirs 
at the same time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cnauroth commented on pull request #40511: [SPARK-42888][BUILD] Upgrade GCS connector from 2.2.7 to 2.2.11.

2023-03-21 Thread via GitHub


cnauroth commented on PR #40511:
URL: https://github.com/apache/spark/pull/40511#issuecomment-1478711103

   @dongjoon-hyun , may I ask for your review, since you did the original 
import of the GCS connector in 
[SPARK-33605](https://issues.apache.org/jira/browse/SPARK-33605)/#37745? Thank 
you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cnauroth opened a new pull request, #40511: [SPARK-42888][BUILD] Upgrade GCS connector from 2.2.7 to 2.2.11.

2023-03-21 Thread via GitHub


cnauroth opened a new pull request, #40511:
URL: https://github.com/apache/spark/pull/40511

   ### What changes were proposed in this pull request?
   
   Upgrade the [GCS 
Connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/v2.2.11/gcs)
 bundled in the Spark distro from version 2.2.7 to 2.2.11.
   
   ### Why are the changes needed?
   
   The new release contains multiple bug fixes and enhancements discussed in 
the [Release 
Notes](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.11/gcs/CHANGES.md).
 Notable changes include:
   * Improved socket timeout handling.
   * Trace logging capabilities.
   * Fix bug that prevented usage of GCS as a [Hadoop Credential 
Provider](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html).
   * Dependency upgrades.
   * Support OAuth2 based client authentication.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Distributions built with `-Phadoop-cloud` now include GCS connector 2.2.7 
instead of 2.2.11.
   
   ```
   cnauroth@cnauroth-2-1-m:~/spark-3.5.0-SNAPSHOT-bin-custom-spark$ ls -lrt 
jars/gcs*
   -rw-r--r-- 1 cnauroth cnauroth 36497606 Mar 21 00:42 
jars/gcs-connector-hadoop3-2.2.11-shaded.jar
   ```
   
   ### How was this patch tested?
   
   **Build**
   
   I built a custom distro with `-Phadoop-cloud`:
   
   ```
   ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-3 
-Phadoop-cloud -Pscala-2.12
   ```
   
   **Run**
   
   I ran a PySpark job that successfully reads and writes using GCS:
   
   ```
   from pyspark.sql import SparkSession
   
   def main() -> None:
 # Create SparkSession.
 spark = (SparkSession.builder
  .appName('copy-shakespeare')
  .getOrCreate())
   
 # Read.
 df = spark.read.text('gs://dataproc-datasets-us-central1/shakespeare')
   
 # Write.
 
df.write.text('gs://cnauroth-hive-metastore-proxy-dist/output/copy-shakespeare')
   
 spark.stop()
   
   if __name__ == '__main__':
 main()
   ```
   
   Authored-by: Chris Nauroth 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-21 Thread via GitHub


ueshin commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1144066905


##
python/pyspark/sql/connect/plan.py:
##
@@ -302,13 +302,16 @@ def plan(self, session: "SparkConnectClient") -> 
proto.Relation:
 
 
 class Read(LogicalPlan):
-def __init__(self, table_name: str) -> None:
+def __init__(self, table_name: str, options: Dict[str, str] = {}) -> None:

Review Comment:
   Please avoid `{}` as a default argument, which could cause unexpected 
failures.
   
   ```py
   def __init__(self, table_name: str, options: Optional[Dict[str, str]] = 
None) -> None:
   ...
   self.options = options or {}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-21 Thread via GitHub


ueshin commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1144066905


##
python/pyspark/sql/connect/plan.py:
##
@@ -302,13 +302,16 @@ def plan(self, session: "SparkConnectClient") -> 
proto.Relation:
 
 
 class Read(LogicalPlan):
-def __init__(self, table_name: str) -> None:
+def __init__(self, table_name: str, options: Dict[str, str] = {}) -> None:

Review Comment:
   Please avoid `{}` as a default argument, which could cause unexpected 
failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-21 Thread via GitHub


ueshin commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1144066373


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -122,6 +122,9 @@ message Read {
   message NamedTable {
 // (Required) Unparsed identifier for the table.
 string unparsed_identifier = 1;
+
+// Options for the named table. The map key is case insensitive.
+map options = 3;

Review Comment:
   nit: Why is this id `3`? seems like `2` is still available?



##
python/pyspark/sql/connect/plan.py:
##
@@ -302,13 +302,16 @@ def plan(self, session: "SparkConnectClient") -> 
proto.Relation:
 
 
 class Read(LogicalPlan):
-def __init__(self, table_name: str) -> None:
+def __init__(self, table_name: str, options: Dict[str, str] = {}) -> None:

Review Comment:
   Please avoid `{}` as a default argument, which could cause unexpected 
behaviors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request, #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel

2023-03-21 Thread via GitHub


ueshin opened a new pull request, #40510:
URL: https://github.com/apache/spark/pull/40510

   ### What changes were proposed in this pull request?
   
   Implements `DataFrame.cache`, `persist`, `unpersist`, and `storageLevel`.
   
   ### Why are the changes needed?
   
   Missing APIs.
   
   ### Does this PR introduce _any_ user-facing change?
   
   `DataFrame.cache`, `persist`, `unpersist`, and `storageLevel` will be 
available.
   
   ### How was this patch tested?
   
   Added/enabled the related tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


amaliujia commented on PR #40496:
URL: https://github.com/apache/spark/pull/40496#issuecomment-1478658342

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


dtenedor commented on code in PR #40496:
URL: https://github.com/apache/spark/pull/40496#discussion_r1144040124


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala:
##
@@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging {
*/
   protected def getNormalizedQueryAnalysisResult(
   session: SparkSession, sql: String): (String, Seq[String]) = {
+// Note that creating the following DataFrame includes eager execution for 
commands that create
+// objects such as views. Therefore any following queries that reference 
these objects should
+// find them in the catalog.
 val df = session.sql(sql)
 val schema = df.schema.catalogString
-// Get the answer, but also get rid of the #1234 expression IDs that show 
up in analyzer plans.
-(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString)))
+val analyzed = df.queryExecution.analyzed
+// Determine if the analyzed plan contains any nondeterministic 
expressions.
+var deterministic = true
+analyzed.transformAllExpressionsWithSubqueries {
+  case expr: CurrentDate =>
+deterministic = false
+expr
+  case expr: CurrentTimestampLike =>
+deterministic = false
+expr
+  case expr: CurrentUser =>
+deterministic = false
+expr
+  case expr: Literal if expr.dataType == DateType || expr.dataType == 
TimestampType =>
+deterministic = false
+expr
+  case expr if !expr.deterministic =>
+deterministic = false
+expr
+}
+if (deterministic) {
+  // Perform query analysis, but also get rid of the #1234 expression IDs 
that show up in the
+  // resolved plans.
+  (schema, Seq(replaceNotIncludedMsg(analyzed.toString)))
+} else {
+  // The analyzed plan is nondeterministic so elide it from the result to 
keep tests reliable.
+  (schema, Seq("[Analyzer test output redacted due to nondeterminism]"))
+}

Review Comment:
   Oh yeah it's only used in the SQL golden test runner. It's in 
`SQLQueryTestHelper.scala` :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


amaliujia commented on code in PR #40496:
URL: https://github.com/apache/spark/pull/40496#discussion_r1144038811


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala:
##
@@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging {
*/
   protected def getNormalizedQueryAnalysisResult(
   session: SparkSession, sql: String): (String, Seq[String]) = {
+// Note that creating the following DataFrame includes eager execution for 
commands that create
+// objects such as views. Therefore any following queries that reference 
these objects should
+// find them in the catalog.
 val df = session.sql(sql)
 val schema = df.schema.catalogString
-// Get the answer, but also get rid of the #1234 expression IDs that show 
up in analyzer plans.
-(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString)))
+val analyzed = df.queryExecution.analyzed
+// Determine if the analyzed plan contains any nondeterministic 
expressions.
+var deterministic = true
+analyzed.transformAllExpressionsWithSubqueries {
+  case expr: CurrentDate =>
+deterministic = false
+expr
+  case expr: CurrentTimestampLike =>
+deterministic = false
+expr
+  case expr: CurrentUser =>
+deterministic = false
+expr
+  case expr: Literal if expr.dataType == DateType || expr.dataType == 
TimestampType =>
+deterministic = false
+expr
+  case expr if !expr.deterministic =>
+deterministic = false
+expr
+}
+if (deterministic) {
+  // Perform query analysis, but also get rid of the #1234 expression IDs 
that show up in the
+  // resolved plans.
+  (schema, Seq(replaceNotIncludedMsg(analyzed.toString)))
+} else {
+  // The analyzed plan is nondeterministic so elide it from the result to 
keep tests reliable.
+  (schema, Seq("[Analyzer test output redacted due to nondeterminism]"))
+}

Review Comment:
   I was wondering if this code is used in non testing environment and I tried 
a code search and it seems only being used with test suites.
   
   Basically I was not sure in production if this part is also used then the 
`if else` actually changes the output.
   
   At least code search shows this is not a concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


dtenedor commented on code in PR #40496:
URL: https://github.com/apache/spark/pull/40496#discussion_r1144023834


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala:
##
@@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging {
*/
   protected def getNormalizedQueryAnalysisResult(
   session: SparkSession, sql: String): (String, Seq[String]) = {
+// Note that creating the following DataFrame includes eager execution for 
commands that create
+// objects such as views. Therefore any following queries that reference 
these objects should
+// find them in the catalog.
 val df = session.sql(sql)
 val schema = df.schema.catalogString
-// Get the answer, but also get rid of the #1234 expression IDs that show 
up in analyzer plans.
-(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString)))
+val analyzed = df.queryExecution.analyzed
+// Determine if the analyzed plan contains any nondeterministic 
expressions.
+var deterministic = true
+analyzed.transformAllExpressionsWithSubqueries {
+  case expr: CurrentDate =>
+deterministic = false
+expr
+  case expr: CurrentTimestampLike =>
+deterministic = false
+expr
+  case expr: CurrentUser =>
+deterministic = false
+expr
+  case expr: Literal if expr.dataType == DateType || expr.dataType == 
TimestampType =>
+deterministic = false
+expr
+  case expr if !expr.deterministic =>
+deterministic = false
+expr
+}
+if (deterministic) {
+  // Perform query analysis, but also get rid of the #1234 expression IDs 
that show up in the
+  // resolved plans.
+  (schema, Seq(replaceNotIncludedMsg(analyzed.toString)))
+} else {
+  // The analyzed plan is nondeterministic so elide it from the result to 
keep tests reliable.
+  (schema, Seq("[Analyzer test output redacted due to nondeterminism]"))
+}

Review Comment:
   Hi @amaliujia I am a bit confused by your question, would you mind to 
clarify? This redaction is for the generated analyzer test result files only, 
in order to prevent the tests from becoming flaky from things like 
`current_date()` changing each day. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


dtenedor commented on code in PR #40496:
URL: https://github.com/apache/spark/pull/40496#discussion_r1144021973


##
sql/core/src/test/resources/sql-tests/analyzer-results/ansi/cast.sql.out:
##
@@ -0,0 +1,881 @@
+-- Automatically generated by SQLQueryTestSuite
+-- !query
+SELECT CAST('1.23' AS int)
+-- !query analysis
+Project [cast(1.23 as int) AS CAST(1.23 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('1.23' AS long)
+-- !query analysis
+Project [cast(1.23 as bigint) AS CAST(1.23 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-4.56' AS int)
+-- !query analysis
+Project [cast(-4.56 as int) AS CAST(-4.56 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-4.56' AS long)
+-- !query analysis
+Project [cast(-4.56 as bigint) AS CAST(-4.56 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS int)
+-- !query analysis
+Project [cast(abc as int) AS CAST(abc AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS long)
+-- !query analysis
+Project [cast(abc as bigint) AS CAST(abc AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS float)
+-- !query analysis
+Project [cast(abc as float) AS CAST(abc AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS double)
+-- !query analysis
+Project [cast(abc as double) AS CAST(abc AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('1234567890123' AS int)
+-- !query analysis
+Project [cast(1234567890123 as int) AS CAST(1234567890123 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('12345678901234567890123' AS long)
+-- !query analysis
+Project [cast(12345678901234567890123 as bigint) AS 
CAST(12345678901234567890123 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS int)
+-- !query analysis
+Project [cast( as int) AS CAST( AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS long)
+-- !query analysis
+Project [cast( as bigint) AS CAST( AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS float)
+-- !query analysis
+Project [cast( as float) AS CAST( AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS double)
+-- !query analysis
+Project [cast( as double) AS CAST( AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST(NULL AS int)
+-- !query analysis
+Project [cast(null as int) AS CAST(NULL AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST(NULL AS long)
+-- !query analysis
+Project [cast(null as bigint) AS CAST(NULL AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS int)
+-- !query analysis
+Project [cast(123.a as int) AS CAST(123.a AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS long)
+-- !query analysis
+Project [cast(123.a as bigint) AS CAST(123.a AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS float)
+-- !query analysis
+Project [cast(123.a as float) AS CAST(123.a AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS double)
+-- !query analysis
+Project [cast(123.a as double) AS CAST(123.a AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-2147483648' AS int)
+-- !query analysis
+Project [cast(-2147483648 as int) AS CAST(-2147483648 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-2147483649' AS int)
+-- !query analysis
+Project [cast(-2147483649 as int) AS CAST(-2147483649 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('2147483647' AS int)
+-- !query analysis
+Project [cast(2147483647 as int) AS CAST(2147483647 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('2147483648' AS int)
+-- !query analysis
+Project [cast(2147483648 as int) AS CAST(2147483648 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-9223372036854775808' AS long)
+-- !query analysis
+Project [cast(-9223372036854775808 as bigint) AS CAST(-9223372036854775808 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-9223372036854775809' AS long)
+-- !query analysis
+Project [cast(-9223372036854775809 as bigint) AS CAST(-9223372036854775809 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('9223372036854775807' AS long)
+-- !query analysis
+Project [cast(9223372036854775807 as bigint) AS CAST(9223372036854775807 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('9223372036854775808' AS long)
+-- !query analysis
+Project [cast(9223372036854775808 as bigint) AS CAST(9223372036854775808 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT HEX(CAST('abc' AS binary))
+-- !query analysis
+Project [hex(cast(abc as binary)) AS hex(CAST(abc AS BINARY))#x]
++- OneRowRelation
+
+
+-- !query
+SELECT HEX(CAST(CAST(123 AS byte) AS binary))
+-- !query analysis
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "DATATYPE_MISMATCH.CAST_WITH_CONF_SUGGESTION",
+  "sqlState" : "42K09",
+  "messageParameters" : {
+"config" : "\"spark.sql.ansi.enabled\"",
+"configVal" : "'false'",
+"sqlExpr" : "\"CAST(CAST(123 AS TINYINT) AS 

[GitHub] [spark] amaliujia commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-21 Thread via GitHub


amaliujia commented on code in PR #40496:
URL: https://github.com/apache/spark/pull/40496#discussion_r1144020709


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala:
##
@@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging {
*/
   protected def getNormalizedQueryAnalysisResult(
   session: SparkSession, sql: String): (String, Seq[String]) = {
+// Note that creating the following DataFrame includes eager execution for 
commands that create
+// objects such as views. Therefore any following queries that reference 
these objects should
+// find them in the catalog.
 val df = session.sql(sql)
 val schema = df.schema.catalogString
-// Get the answer, but also get rid of the #1234 expression IDs that show 
up in analyzer plans.
-(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString)))
+val analyzed = df.queryExecution.analyzed
+// Determine if the analyzed plan contains any nondeterministic 
expressions.
+var deterministic = true
+analyzed.transformAllExpressionsWithSubqueries {
+  case expr: CurrentDate =>
+deterministic = false
+expr
+  case expr: CurrentTimestampLike =>
+deterministic = false
+expr
+  case expr: CurrentUser =>
+deterministic = false
+expr
+  case expr: Literal if expr.dataType == DateType || expr.dataType == 
TimestampType =>
+deterministic = false
+expr
+  case expr if !expr.deterministic =>
+deterministic = false
+expr
+}
+if (deterministic) {
+  // Perform query analysis, but also get rid of the #1234 expression IDs 
that show up in the
+  // resolved plans.
+  (schema, Seq(replaceNotIncludedMsg(analyzed.toString)))
+} else {
+  // The analyzed plan is nondeterministic so elide it from the result to 
keep tests reliable.
+  (schema, Seq("[Analyzer test output redacted due to nondeterminism]"))
+}

Review Comment:
   Should this be configured? Otherwise every caller who calls this function 
will start to use non-deterministic codepath?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto

2023-03-21 Thread via GitHub


grundprinzip commented on PR #39294:
URL: https://github.com/apache/spark/pull/39294#issuecomment-1478626808

   The current level of proto compat required is the most strict one which 
requires to be fully compatible, even with the naming of the fields. Strictly 
speaking, this is not required to be wire compatible. But before we relax the 
compatibility, I would like to enforce the most strict one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto

2023-03-21 Thread via GitHub


amaliujia commented on PR #39294:
URL: https://github.com/apache/spark/pull/39294#issuecomment-1478618853

   > 
![SCR-20230321-v7f](https://user-images.githubusercontent.com/3421/226745165-60b1611a-2ec5-4ddf-98c0-61c3a180ed9c.png)
   > 
   > Verifying that the broken build is reported.
   
   cc @zhenlineo as I recall you thought about field renaming? Seems like that 
is not allowed per current design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto

2023-03-21 Thread via GitHub


grundprinzip commented on code in PR #39294:
URL: https://github.com/apache/spark/pull/39294#discussion_r1144010201


##
.github/workflows/build_and_test.yml:
##
@@ -493,6 +494,80 @@ jobs:
 name: test-results-sparkr--8-${{ inputs.hadoop }}-hive2.3
 path: "**/target/test-reports/*.xml"
 
+  proto-breaking-changes-check-master:
+needs: [precondition]
+if: always() && 
fromJson(needs.precondition.outputs.required).proto-breaking-changes-check == 
'true'
+name: Spark Connect proto backwards compatibility check (master)
+continue-on-error: true
+runs-on: ubuntu-22.04
+env:
+  LC_ALL: C.UTF-8
+  LANG: C.UTF-8
+  PYSPARK_DRIVER_PYTHON: python3.9
+  PYSPARK_PYTHON: python3.9
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v3
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: ${{ inputs.branch }}
+- name: Add GITHUB_WORKSPACE to git trust safe.directory
+  run: |
+git config --global --add safe.directory ${GITHUB_WORKSPACE}
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  run: |
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty
+# Install the `buf` CLI
+- uses: bufbuild/buf-setup-action@v1
+  with:
+github_token: ${{ secrets.GITHUB_TOKEN }}
+# Run buf breaking.
+- uses: bufbuild/buf-breaking-action@v1
+  with:
+input: connector/connect/common/src/main
+against: 
'https://github.com/apache/spark.git#branch=master,subdir=connector/connect/common/src/main'
+
+  proto-breaking-changes-check-release:
+needs: [precondition]
+if: always() && 
fromJson(needs.precondition.outputs.required).proto-breaking-changes-check == 
'true'
+name: Spark Connect proto backwards compatibility check (master)
+runs-on: ubuntu-22.04
+env:
+  LC_ALL: C.UTF-8
+  LANG: C.UTF-8
+  PYSPARK_DRIVER_PYTHON: python3.9
+  PYSPARK_PYTHON: python3.9
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v3
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: ${{ inputs.branch }}
+- name: Add GITHUB_WORKSPACE to git trust safe.directory
+  run: |
+git config --global --add safe.directory ${GITHUB_WORKSPACE}
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  run: |
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty
+# Install the `buf` CLI
+- uses: bufbuild/buf-setup-action@v1
+  with:
+github_token: ${{ secrets.GITHUB_TOKEN }}
+# Run buf breaking.
+- uses: bufbuild/buf-breaking-action@v1
+  with:
+input: connector/connect/common/src/main
+against: 
'https://github.com/apache/spark.git#branch=branch-3.4,subdir=connector/connect/common/src/main'

Review Comment:
   runs against the 3.4 release branch, this has to be updated on every release 
to the previous branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto

2023-03-21 Thread via GitHub


grundprinzip commented on code in PR #39294:
URL: https://github.com/apache/spark/pull/39294#discussion_r1144009897


##
.github/workflows/build_and_test.yml:
##
@@ -493,6 +494,80 @@ jobs:
 name: test-results-sparkr--8-${{ inputs.hadoop }}-hive2.3
 path: "**/target/test-reports/*.xml"
 
+  proto-breaking-changes-check-master:
+needs: [precondition]
+if: always() && 
fromJson(needs.precondition.outputs.required).proto-breaking-changes-check == 
'true'
+name: Spark Connect proto backwards compatibility check (master)
+continue-on-error: true

Review Comment:
   continue-on-error will let the workflow continue even if the job fails.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto

2023-03-21 Thread via GitHub


grundprinzip commented on PR #39294:
URL: https://github.com/apache/spark/pull/39294#issuecomment-1478606975

   
![SCR-20230321-v7f](https://user-images.githubusercontent.com/3421/226745165-60b1611a-2ec5-4ddf-98c0-61c3a180ed9c.png)
   
   Verifying that the broken build is reported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-21 Thread via GitHub


StevenChenDatabricks commented on PR #40385:
URL: https://github.com/apache/spark/pull/40385#issuecomment-1478591539

   @cloud-fan I've addressed your comments. Thanks for the review! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #40490: [SPARK-42536][BUILD] Upgrade log4j2 to 2.20.0

2023-03-21 Thread via GitHub


viirya commented on PR #40490:
URL: https://github.com/apache/spark/pull/40490#issuecomment-1478569986

   Looks good to me. +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40490: [SPARK-42536][BUILD] Upgrade log4j2 to 2.20.0

2023-03-21 Thread via GitHub


dongjoon-hyun commented on PR #40490:
URL: https://github.com/apache/spark/pull/40490#issuecomment-1478567572

   cc @viirya , too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1

2023-03-21 Thread via GitHub


dongjoon-hyun closed pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade 
`kubernetes-client` to 6.5.1
URL: https://github.com/apache/spark/pull/40509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1

2023-03-21 Thread via GitHub


dongjoon-hyun commented on PR #40509:
URL: https://github.com/apache/spark/pull/40509#issuecomment-1478564576

   Thank you, @viirya . Merged to master for Apache Spark 3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40440: [SPARK-42808][CORE] Avoid getting availableProcessors every time in `MapOutputTrackerMaster#getStatistics`

2023-03-21 Thread via GitHub


dongjoon-hyun commented on PR #40440:
URL: https://github.com/apache/spark/pull/40440#issuecomment-1478522804

   Thank you all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1

2023-03-21 Thread via GitHub


dongjoon-hyun commented on PR #40509:
URL: https://github.com/apache/spark/pull/40509#issuecomment-1478454379

   Thank you, @amaliujia . 
   
   Also, could you review this PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #40268: [SPARK-42500][SQL] ConstantPropagation support more cases

2023-03-21 Thread via GitHub


peter-toth commented on PR #40268:
URL: https://github.com/apache/spark/pull/40268#issuecomment-1478292434

   @cloud-fan, @wangyum please let me know if this PR needs further 
improvements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] simonvanderveldt commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

2023-03-21 Thread via GitHub


simonvanderveldt commented on PR #38518:
URL: https://github.com/apache/spark/pull/38518#issuecomment-1478291017

   > To be safe, could you revisie this PR by adding a new internal 
configuration like KUBERNETES_EXECUTOR_ENABLE_API_WATCHER and use it with the 
following?
   
   Not sure this makes sense. AFAIK the current behavior is a bug, so it should 
just be fixed.
   If there's uncertainty about the implementation of this fix then IMHO this 
uncertainty should be addressed.
   For reference, the patch as it was before the config option was added is 
working fine for us, we run about 800 Spark apps a day.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1

2023-03-21 Thread via GitHub


amaliujia commented on PR #40509:
URL: https://github.com/apache/spark/pull/40509#issuecomment-1478279790

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40447: [SPARK-42816][CONNECT] Support Max Message size up to 128MB

2023-03-21 Thread via GitHub


amaliujia commented on PR #40447:
URL: https://github.com/apache/spark/pull/40447#issuecomment-1478275510

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-21 Thread via GitHub


amaliujia commented on PR #40498:
URL: https://github.com/apache/spark/pull/40498#issuecomment-1478228229

   @hvanhovell 
   
   oh are you thinking about golden file based tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-21 Thread via GitHub


amaliujia commented on PR #40498:
URL: https://github.com/apache/spark/pull/40498#issuecomment-1478220179

   @hvanhovell existing codebase uses this to verify the options:

   ```
 test("SPARK-32844: DataFrameReader.table take the specified options for V1 
relation") {
   withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
 withTable("t") {
   sql("CREATE TABLE t(i int, d double) USING parquet OPTIONS 
('p1'='v1', 'p2'='v2')")
   
   val df = spark.read.option("P2", "v2").option("p3", "v3").table("t")
   val options = df.queryExecution.analyzed.collectFirst {
 case r: LogicalRelation => 
r.relation.asInstanceOf[HadoopFsRelation].options
   }.get
   assert(options("p2") == "v2")
   assert(options("p3") == "v3")
 }
   }
 }
 ```
 
 However this won't work for Spark Connect. Do you know if we have another 
way to achieve it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on a diff in pull request #40508: [MINOR][SQL][CONNECT][PYTHON] Clarify the comment of parameterized SQL args

2023-03-21 Thread via GitHub


srielau commented on code in PR #40508:
URL: https://github.com/apache/spark/pull/40508#discussion_r1143689845


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -213,7 +213,8 @@ class SparkSession private[sql] (
* @param sqlText
*   A SQL statement with named parameters to execute.
* @param args
-   *   A map of parameter names to literal values.
+   *   A map of parameter names to string values that are parsed as SQL 
literal expressions. For
+   *   example, map keys: "abc", "def_1", map values: "1", "'abc'", 
"DATE'2023-03-21'".

Review Comment:
   How about keys that clarify the type, at least implicitly. Also arity didn't 
match up
   ```suggestion
  *   example, map keys: "rank", "name", "birthdate"; map values: "1", 
"'Steven'", "DATE'2023-03-21'".
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] attilapiros commented on a diff in pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

2023-03-21 Thread via GitHub


attilapiros commented on code in PR #38518:
URL: https://github.com/apache/spark/pull/38518#discussion_r1143681244


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala:
##
@@ -45,7 +46,9 @@ class ExecutorPodsWatchSnapshotSource(
 conf: SparkConf) extends Logging {
 
   private var watchConnection: Closeable = _

Review Comment:
   The `Watcher` is running on a separate thread, right? And `watchConnection` 
sometimes accessed from one thread via the 
`ExecutorPodsWatchSnapshotSource#stop()` and sometimes from the watcher running 
thread via the `ExecutorPodsWatchSnapshotSource#reset()`. So those methods 
where `watchConnection` is accessed start(), stop() and reset() must be 
synchronized.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] unical1988 commented on pull request #40468: "[SPARK-42838][SQL] changed error class name _LEGACY_ERROR_TEMP_2000"

2023-03-21 Thread via GitHub


unical1988 commented on PR #40468:
URL: https://github.com/apache/spark/pull/40468#issuecomment-1478174533

   > Found JIRA: 
[SPARK-42838](https://issues.apache.org/jira/browse/SPARK-42838).
   > 
   > Can you attach the JIRA number to title?
   > 
   > e.g. 
"[[SPARK-42838](https://issues.apache.org/jira/browse/SPARK-42838)][SQL] 
changed error class name _LEGACY_ERROR_TEMP_2000"
   @itholic 
   done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40444: [SPARK-42813][K8S] Print application info when waitAppCompletion is false

2023-03-21 Thread via GitHub


pan3793 commented on PR #40444:
URL: https://github.com/apache/spark/pull/40444#issuecomment-1478154866

   Thank you, @dongjoon-hyun and @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >