[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


zhengruifeng commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1129079073


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala:
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml.param.{ParamMap, Params}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.connect.planner.SparkConnectPlanner
+import org.apache.spark.sql.connect.service.SessionHolder
+
+object MLUtils {
+
+  def setInstanceParams(instance: Params, paramsProto: proto.Params): Unit = {
+import scala.collection.JavaConverters._
+paramsProto.getParamsMap.asScala.foreach { case (paramName, 
paramValueProto) =>
+  val paramDef = instance.getParam(paramName)
+  val paramValue = 
parseParamValue(paramDef.paramValueClassTag.runtimeClass, paramValueProto)
+  instance.set(paramDef, paramValue)
+}
+paramsProto.getDefaultParamsMap.asScala.foreach { case (paramName, 
paramValueProto) =>
+  val paramDef = instance.getParam(paramName)
+  val paramValue = 
parseParamValue(paramDef.paramValueClassTag.runtimeClass, paramValueProto)
+  instance._setDefault(paramDef -> paramValue)
+}
+  }
+
+  def parseParamValue(paramType: Class[_], paramValueProto: 
proto.Expression.Literal): Any = {

Review Comment:
   I see, we don't have `IntParam` in PySpark



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1127834899


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/Serializer.scala:
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml.linalg.{Matrix, Vector}
+
+object Serializer {

Review Comment:
   ~~Good idea.~~
   We need address this issue first 
https://github.com/apache/spark/pull/40297#discussion_r1129072923



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1129073881


##
connector/connect/common/src/main/protobuf/spark/connect/ml.proto:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+syntax = 'proto3';
+
+package spark.connect;
+
+import "spark/connect/expressions.proto";
+import "spark/connect/relations.proto";
+import "spark/connect/ml_common.proto";
+
+option java_multiple_files = true;
+option java_package = "org.apache.spark.connect.proto";
+
+
+message Evaluator {
+  string name = 1;
+  Params params = 2;
+  string uid = 3;
+}
+
+
+message MlCommand {
+  oneof ml_command_type {
+Fit fit = 1;
+ModelAttr model_attr = 2;
+ModelSummaryAttr model_summary_attr = 3;
+LoadModel load_model = 4;
+SaveModel save_model = 5;
+Evaluate evaluate = 6;
+  }
+
+  message Fit {
+Stage estimator = 1;
+Relation dataset = 2;
+  }
+
+  message Evaluate {
+Evaluator evaluator = 1;
+  }
+
+  message LoadModel {

Review Comment:
   For estimator, we need define a new LoadEstimator command



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1129073675


##
connector/connect/common/src/main/protobuf/spark/connect/ml.proto:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+syntax = 'proto3';
+
+package spark.connect;
+
+import "spark/connect/expressions.proto";
+import "spark/connect/relations.proto";
+import "spark/connect/ml_common.proto";
+
+option java_multiple_files = true;
+option java_package = "org.apache.spark.connect.proto";
+
+
+message Evaluator {
+  string name = 1;
+  Params params = 2;
+  string uid = 3;
+}
+
+
+message MlCommand {
+  oneof ml_command_type {
+Fit fit = 1;
+ModelAttr model_attr = 2;
+ModelSummaryAttr model_summary_attr = 3;
+LoadModel load_model = 4;
+SaveModel save_model = 5;
+Evaluate evaluate = 6;
+  }
+
+  message Fit {
+Stage estimator = 1;
+Relation dataset = 2;
+  }
+
+  message Evaluate {
+Evaluator evaluator = 1;
+  }
+
+  message LoadModel {

Review Comment:
   I will rename it to `LoadStage` to support estimator too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1129073121


##
connector/connect/common/src/main/protobuf/spark/connect/ml.proto:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+syntax = 'proto3';
+
+package spark.connect;
+
+import "spark/connect/expressions.proto";
+import "spark/connect/relations.proto";
+import "spark/connect/ml_common.proto";
+
+option java_multiple_files = true;
+option java_package = "org.apache.spark.connect.proto";
+
+
+message Evaluator {
+  string name = 1;
+  Params params = 2;
+  string uid = 3;
+}
+
+
+message MlCommand {
+  oneof ml_command_type {
+Fit fit = 1;
+ModelAttr model_attr = 2;
+ModelSummaryAttr model_summary_attr = 3;
+LoadModel load_model = 4;
+SaveModel save_model = 5;
+Evaluate evaluate = 6;
+  }
+
+  message Fit {
+Stage estimator = 1;
+Relation dataset = 2;
+  }
+
+  message Evaluate {
+Evaluator evaluator = 1;
+  }
+
+  message LoadModel {

Review Comment:
   Sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1129072923


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala:
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml.param.{ParamMap, Params}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.connect.planner.SparkConnectPlanner
+import org.apache.spark.sql.connect.service.SessionHolder
+
+object MLUtils {
+
+  def setInstanceParams(instance: Params, paramsProto: proto.Params): Unit = {
+import scala.collection.JavaConverters._
+paramsProto.getParamsMap.asScala.foreach { case (paramName, 
paramValueProto) =>
+  val paramDef = instance.getParam(paramName)
+  val paramValue = 
parseParamValue(paramDef.paramValueClassTag.runtimeClass, paramValueProto)
+  instance.set(paramDef, paramValue)
+}
+paramsProto.getDefaultParamsMap.asScala.foreach { case (paramName, 
paramValueProto) =>
+  val paramDef = instance.getParam(paramName)
+  val paramValue = 
parseParamValue(paramDef.paramValueClassTag.runtimeClass, paramValueProto)
+  instance._setDefault(paramDef -> paramValue)
+}
+  }
+
+  def parseParamValue(paramType: Class[_], paramValueProto: 
proto.Expression.Literal): Any = {

Review Comment:
   > for int/long case, we can specify the DataType in Python Client
   
   This does not help.
   Because the pyspark estimator Params it does not record the type attribute, 
so we cannot distinguish int/long, double/float for python side param values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub


shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1459668923

   > You first defined a case-sensitive data set, then queried in a 
case-insensitive way, I guess the error is expected.
   
   In the physical plan, both id and ID columns are projected to the same 
column in the dataframe: _1#6
   _1#6 AS id#17,  _1#6 AS ID#17
   So, there is no ambiguity,
   
   Also, in the matched attributes, results are same: attributes: Vector(id#17, 
id#17)
   Just because, we have duplicates in the matched result, it's being 
considered as ambiguous.
   
   If the matched attribute result was Vector(id#17, ID#17) , then it would 
have been valid error.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub


yaooqinn commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1459654900

   You first defined a case-sensitive data set, then queried in a 
case-insensitive way, I guess the error is expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub


shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1459652942

   > Can you try `set spark.sql.caseSensitive=true`?
   
   Yes, I have tried it. With caseSensitive set to false, it will work as then 
id and ID will be treated as separate columns.
   Issue is when columns names are supposed to considered  as case insensitive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub


mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1459652457

   Sure ! Please go ahead :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub


yaooqinn commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1459648316

   Can you try `set spark.sql.caseSensitive=true`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

2023-03-07 Thread via GitHub


MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1129047990


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -465,7 +465,20 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   }
 
   /**
-   * Replaces [[UnresolvedAlias]]s with concrete aliases.
+   * Replaces [[UnresolvedAlias]]s with concrete aliases by applying the 
following rules:
+   *   1. Use the specified name of named expressions;
+   *   2. Derive a stable alias from the original normalized SQL text except 
of:
+   * 2.1. A multipart identifier (column, field, mapkey, ...) -> the right 
most identifier.
+   *  For example: a.b.c => c
+   * 2.2. A CAST, or try_cast -> the argument of the cast.
+   *  For example: CAST(c1 AS INT) => c1
+   * 2.3. A map lookup with a literal -> the map key.
+   *  For example: map[5] => 5
+   * 2.4. Squeezing out unwanted whitespace or comments.
+   *  For example: T.c1 + /* test */ foo( 5 ) => T.c1+foo(5)
+   * 2.5. Normalize SQL string literals when ANSI mode is enabled and the 
SQL config
+   *  `spark.sql.ansi.doubleQuotedIdentifiers` is set to `true`.
+   *  For example: "abc" => 'abc'

Review Comment:
   Updated the comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub


LuciferYang commented on PR #40317:
URL: https://github.com/apache/spark/pull/40317#issuecomment-1459646333

   Thanks @HyukjinKwon @hvanhovell @dongjoon-hyun @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #40331: [SPARK-42713][PYTHON][DOCS] Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference

2023-03-07 Thread via GitHub


zhengruifeng opened a new pull request, #40331:
URL: https://github.com/apache/spark/pull/40331

   ### What changes were proposed in this pull request?
   Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference
   
   
   ### Why are the changes needed?
'__getattr__' and '__getitem__' are widely used, but we did not document 
them.
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new doc
   
   ### How was this patch tested?
   added doctests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40329: [SPARK-42710][CONNECT][PYTHON] Rename FrameMap proto to MapPartitions

2023-03-07 Thread via GitHub


xinrong-meng commented on PR #40329:
URL: https://github.com/apache/spark/pull/40329#issuecomment-1459636445

   CC @HyukjinKwon @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng opened a new pull request, #40330: [SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

2023-03-07 Thread via GitHub


xinrong-meng opened a new pull request, #40330:
URL: https://github.com/apache/spark/pull/40330

   ### What changes were proposed in this pull request?
   Improve docstring of mapInPandas and mapInArrow 
   
   ### Why are the changes needed?
   For readability. We call out they are not scalar - the input and output of 
the function might be of different sizes.
   
   ### Does this PR introduce _any_ user-facing change?
   No. Doc change only.
   
   
   ### How was this patch tested?
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40330: [SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

2023-03-07 Thread via GitHub


xinrong-meng commented on PR #40330:
URL: https://github.com/apache/spark/pull/40330#issuecomment-1459635902

   CC @HyukjinKwon @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub


jerqi commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1459630906

   > @jerqi the basic issue here is, `getPreferredLocations` in 
`ShuffledRowRDD` should return `Nil` at the very beginning in case 
`spark.shuffle.reduceLocality.enabled = false` (conceptually).
   > 
   > This logic is pushed into MapOutputTracker though - and 
`getPreferredLocationsForShuffle` honors `spark.shuffle.reduceLocality.enabled` 
- but `getMapLocation` does not.
   > 
   > So the fix would be to fix `getMapLocation` to honor the parameter.
   > 
   > We should fix this in a different PR though.
   
   > 
   
   Could I raise another pr to fix this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub


jerqi commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1459630519

   > 
   
   Could I raise another pr to fix this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40322: [SPARK-41775][PYTHON][FOLLOW-UP] Updating error message for training using PyTorch functions

2023-03-07 Thread via GitHub


zhengruifeng commented on PR #40322:
URL: https://github.com/apache/spark/pull/40322#issuecomment-1459625070

   merged into master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40322: [SPARK-41775][PYTHON][FOLLOW-UP] Updating error message for training using PyTorch functions

2023-03-07 Thread via GitHub


zhengruifeng closed pull request #40322: [SPARK-41775][PYTHON][FOLLOW-UP] 
Updating error message for training using PyTorch functions
URL: https://github.com/apache/spark/pull/40322


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub


mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1459618430

   @jerqi the basic issue here is, `getPreferredLocations` in `ShuffledRowRDD` 
should return `Nil` at the very beginning in case 
`spark.shuffle.reduceLocality.enabled = false`
   
   We should fix this in a different PR though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


grundprinzip commented on code in PR #40323:
URL: https://github.com/apache/spark/pull/40323#discussion_r1129021709


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1508,8 +1508,10 @@ class SparkConnectPlanner(val session: SparkSession) {
 maxRecordsPerBatch,
 maxBatchSize,
 timeZoneId)
-  assert(batches.size == 1)
-  batches.next()
+  assert(batches.hasNext)

Review Comment:
   I see thank you. My understanding with iterators was that non rewindable 
iterators simply do not have a size method. 
   
   But something new learned. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


ueshin commented on code in PR #40323:
URL: https://github.com/apache/spark/pull/40323#discussion_r1129020413


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1508,8 +1508,10 @@ class SparkConnectPlanner(val session: SparkSession) {
 maxRecordsPerBatch,
 maxBatchSize,
 timeZoneId)
-  assert(batches.size == 1)
-  batches.next()
+  assert(batches.hasNext)

Review Comment:
   The `batches` is an iterator, so `batches.size` consumes all the data in the 
iterator to calculate the size.
   Then `batches.next()` would return an empty data? Usually it should throw an 
Exception, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40244: [SPARK-42643][CONNECT][PYTHON] Register Java (aggregate) user-defined functions

2023-03-07 Thread via GitHub


xinrong-meng commented on PR #40244:
URL: https://github.com/apache/spark/pull/40244#issuecomment-1459608526

   Merged to master and branch-3.4, thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng closed pull request #40244: [SPARK-42643][CONNECT][PYTHON] Register Java (aggregate) user-defined functions

2023-03-07 Thread via GitHub


xinrong-meng closed pull request #40244: [SPARK-42643][CONNECT][PYTHON] 
Register Java (aggregate) user-defined functions
URL: https://github.com/apache/spark/pull/40244


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng opened a new pull request, #40329: Rename FrameMap proto to MapPartitions

2023-03-07 Thread via GitHub


xinrong-meng opened a new pull request, #40329:
URL: https://github.com/apache/spark/pull/40329

   ### What changes were proposed in this pull request?
   Rename FrameMap proto to MapPartitions.
   
   
   ### Why are the changes needed?
   For readability.
   
   Frame Map API refers to mapInPandas and mapInArrow, which are equivalent to 
MapPartitions.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


zhengruifeng commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1129010791


##
connector/connect/common/src/main/protobuf/spark/connect/ml.proto:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+syntax = 'proto3';
+
+package spark.connect;
+
+import "spark/connect/expressions.proto";
+import "spark/connect/relations.proto";
+import "spark/connect/ml_common.proto";
+
+option java_multiple_files = true;
+option java_package = "org.apache.spark.connect.proto";
+
+
+message Evaluator {
+  string name = 1;
+  Params params = 2;
+  string uid = 3;
+}
+
+
+message MlCommand {
+  oneof ml_command_type {
+Fit fit = 1;
+ModelAttr model_attr = 2;
+ModelSummaryAttr model_summary_attr = 3;
+LoadModel load_model = 4;
+SaveModel save_model = 5;
+Evaluate evaluate = 6;
+  }
+
+  message Fit {
+Stage estimator = 1;
+Relation dataset = 2;
+  }
+
+  message Evaluate {
+Evaluator evaluator = 1;
+  }
+
+  message LoadModel {

Review Comment:
   will `LodeModel` and `SaveModel` also support load/save 
estimator/transformer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-07 Thread via GitHub


zhengruifeng commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1459594072

   If the nested lambda issue also exists in the Scala Client, do we need to 
fix it in the same way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


grundprinzip commented on code in PR #40323:
URL: https://github.com/apache/spark/pull/40323#discussion_r1129008040


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1508,8 +1508,10 @@ class SparkConnectPlanner(val session: SparkSession) {
 maxRecordsPerBatch,
 maxBatchSize,
 timeZoneId)
-  assert(batches.size == 1)
-  batches.next()
+  assert(batches.hasNext)

Review Comment:
   Sorry for the late reply but how is this code different to the existing one?
   
   val bytes = batches.next()
   bytes
   
   
   Is the same as
   
   batches.next()
   
   
   The asserts in between don't count as they don't have side effects. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


zhengruifeng commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1128994906


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.classification.TrainingSummary
+import org.apache.spark.sql.DataFrame
+
+
+object AlgorithmRegistry {
+
+  def get(name: String): Algorithm = {
+name match {
+  case "LogisticRegression" => new LogisticRegressionAlgorithm
+  case _ =>
+throw new IllegalArgumentException()
+}
+  }
+
+}
+
+
+abstract class Algorithm {
+
+  def initiateEstimator(uid: String): Estimator[_]
+
+  def getModelAttr(model: Model[_], name: String): 
Either[proto.MlCommandResponse, DataFrame]

Review Comment:
   ok. I am neutral on it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub


shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1459535564

   Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please 
review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #40328: [SPARK-42709][PYTHON] Remove the assumption of `__file__` being available

2023-03-07 Thread via GitHub


HyukjinKwon opened a new pull request, #40328:
URL: https://github.com/apache/spark/pull/40328

   ### What changes were proposed in this pull request?
   
   This PR proposes to add a check for `__file__` attributes.
   
   ### Why are the changes needed?
   
   `__file__` might not be available everywhere. See also 
https://github.com/scikit-learn/scikit-learn/issues/20081
   
   ### Does this PR introduce _any_ user-facing change?
   
   If users' Python environment does not have `__file__`, now users can use 
PySpark in their environment too.
   
   ### How was this patch tested?
   
   Manually tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on a diff in pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-07 Thread via GitHub


shrprasa commented on code in PR #40128:
URL: https://github.com/apache/spark/pull/40128#discussion_r1128983339


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala:
##
@@ -143,6 +144,9 @@ private[spark] class Client(
 logError("Please check \"kubectl auth can-i create [resource]\" 
first." +
   " It should be yes. And please also check your feature step 
implementation.")
 kubernetesClient.resourceList(preKubernetesResources: _*).delete()
+// register shutdownhook for cleaning up the upload dir only

Review Comment:
   @holdenk do you have any suggestion to handle this in a better way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40233: [WIP][SPARK-42630][CONNECT][PYTHON] Make `parse_data_type` use new proto message `DDLParse`

2023-03-07 Thread via GitHub


zhengruifeng commented on PR #40233:
URL: https://github.com/apache/spark/pull/40233#issuecomment-1459503928

   close in favor of https://github.com/apache/spark/pull/40260


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40233: [WIP][SPARK-42630][CONNECT][PYTHON] Make `parse_data_type` use new proto message `DDLParse`

2023-03-07 Thread via GitHub


zhengruifeng closed pull request #40233: [WIP][SPARK-42630][CONNECT][PYTHON] 
Make `parse_data_type` use new proto message `DDLParse`
URL: https://github.com/apache/spark/pull/40233


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2023-03-07 Thread via GitHub


shrprasa commented on PR #37880:
URL: https://github.com/apache/spark/pull/37880#issuecomment-1459499404

   Thanks @yaooqinn for merging the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #40299: [SPARK-42684][SQL] v2 catalog should not allow column default value by default

2023-03-07 Thread via GitHub


dtenedor commented on code in PR #40299:
URL: https://github.com/apache/spark/pull/40299#discussion_r1128979585


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##
@@ -137,6 +113,49 @@ object ResolveDefaultColumns {
 }
   }
 
+  // Fails if the given catalog does not support column default value.
+  def validateCatalogForDefaultValue(
+  schema: StructType,
+  catalog: TableCatalog,
+  ident: Identifier): Unit = {
+if (SQLConf.get.enableDefaultColumns &&
+  schema.exists(_.metadata.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY))) 
{
+  if (!catalog.capabilities().contains(

Review Comment:
   We can combine with the previous if statement using `&&`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40325: [SPARK-42707][CONNECT][DOCS] Update developer documentation about API stability warning

2023-03-07 Thread via GitHub


zhengruifeng commented on PR #40325:
URL: https://github.com/apache/spark/pull/40325#issuecomment-1459464317

   merged into master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40325: [SPARK-42707][CONNECT][DOCS] Update developer documentation about API stability warning

2023-03-07 Thread via GitHub


zhengruifeng closed pull request #40325: [SPARK-42707][CONNECT][DOCS] Update 
developer documentation about API stability warning
URL: https://github.com/apache/spark/pull/40325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub


HyukjinKwon closed pull request #40317: [SPARK-42700][BUILD] Add `h2` as test 
dependency of connect-server module
URL: https://github.com/apache/spark/pull/40317


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub


HyukjinKwon commented on PR #40317:
URL: https://github.com/apache/spark/pull/40317#issuecomment-1459462492

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40325: [SPARK-42707][CONNECT][DOCS] Update developer documentation about API stability warning

2023-03-07 Thread via GitHub


HyukjinKwon commented on code in PR #40325:
URL: https://github.com/apache/spark/pull/40325#discussion_r1128968097


##
python/pyspark/sql/connect/__init__.py:
##
@@ -15,5 +15,4 @@
 # limitations under the License.
 #
 
-"""Currently Spark Connect is very experimental and the APIs to interact with
-Spark through this API are can be changed at any time without warning."""
+"""Spark Connect cleint"""

Review Comment:
   ```suggestion
   """Spark Connect client"""
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40327: [SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used

2023-03-07 Thread via GitHub


HyukjinKwon commented on PR #40327:
URL: https://github.com/apache/spark/pull/40327#issuecomment-1459455421

   cc @zhengruifeng @ueshin @grundprinzip FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #40327: [SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used

2023-03-07 Thread via GitHub


HyukjinKwon opened a new pull request, #40327:
URL: https://github.com/apache/spark/pull/40327

   ### What changes were proposed in this pull request?
   
   This PR proposes to remove the parent directory in `shell.py` execution when 
IPython is used. 
   
   This is a general issue for PySpark shell specifically with IPython - 
IPython temporarily adds the parent directory of the script into the Python 
path (`sys.path`), which results in searching packages under `pyspark` 
directory. For example, `import pandas` attempts to import `pyspark.pandas`.
   
   So far, we haven't had such cases within PySpark itself importing code path, 
but Spark Connect now has the case via checking dependency checking (which 
attempts to import pandas) which exposes the actual problem.
   
   Running it with IPython can easily reproduce the error:
   
   ```bash
   PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
   ```
   
   ### Why are the changes needed?
   
   To make PySpark shell properly import other packages even when the names 
conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)
   
   ### Does this PR introduce _any_ user-facing change?
   
   No to the end users:
   - Because this path is only inserted for `shell.py` execution, and 
thankfully we didn't have such relative import case so far.
   - It fixes the issue in the unreleased, Spark Connect.
   
   ### How was this patch tested?
   
   Manually tested.
   
   ```bash
   PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
   ```
   
   **Before:**
   
   ```
   Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
   Type 'copyright', 'credits' or 'license' for more information
   IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
   /.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize 
Spark session.
 warnings.warn("Failed to initialize Spark session.")
   Traceback (most recent call last):
 File "/.../spark/python/pyspark/shell.py", line 40, in 
   spark = SparkSession.builder.getOrCreate()
 File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
   from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
 File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in 

   check_dependencies(__name__, __file__)
 File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in 
check_dependencies
   require_minimum_pandas_version()
 File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in 
require_minimum_pandas_version
   import pandas
 File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in 
   from pyspark.pandas.missing.general_functions import 
MissingPandasLikeGeneralFunctions
 File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in 
   require_minimum_pandas_version()
 File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in 
require_minimum_pandas_version
   if LooseVersion(pandas.__version__) < 
LooseVersion(minimum_pandas_version):
   AttributeError: partially initialized module 'pandas' has no attribute 
'__version__' (most likely due to a circular import)
   ...
   ```
   
   
   **After:**
   
   
   ```
   Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
   Type 'copyright', 'credits' or 'license' for more information
   IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Welcome to
   __
/ __/__  ___ _/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
 /_/
   
   Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
   Client connected to the Spark Connect server at localhost
   SparkSession available as 'spark'.
   
   In [1]:
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40299: [SPARK-42684][SQL] v2 catalog should not allow column default value by default

2023-03-07 Thread via GitHub


cloud-fan commented on PR #40299:
URL: https://github.com/apache/spark/pull/40299#issuecomment-1459426907

   cc @gengliangwang @dtenedor 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub


zhengruifeng commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1128955568


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala:
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml.param.{ParamMap, Params}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.connect.planner.SparkConnectPlanner
+import org.apache.spark.sql.connect.service.SessionHolder
+
+object MLUtils {
+
+  def setInstanceParams(instance: Params, paramsProto: proto.Params): Unit = {
+import scala.collection.JavaConverters._
+paramsProto.getParamsMap.asScala.foreach { case (paramName, 
paramValueProto) =>
+  val paramDef = instance.getParam(paramName)
+  val paramValue = 
parseParamValue(paramDef.paramValueClassTag.runtimeClass, paramValueProto)
+  instance.set(paramDef, paramValue)
+}
+paramsProto.getDefaultParamsMap.asScala.foreach { case (paramName, 
paramValueProto) =>
+  val paramDef = instance.getParam(paramName)
+  val paramValue = 
parseParamValue(paramDef.paramValueClassTag.runtimeClass, paramValueProto)
+  instance._setDefault(paramDef -> paramValue)
+}
+  }
+
+  def parseParamValue(paramType: Class[_], paramValueProto: 
proto.Expression.Literal): Any = {

Review Comment:
   for int/long case, we can specify the DataType in Python Client
   
   
https://github.com/apache/spark/blob/c99a632fea74136964b27b28563115fe2d7667b3/python/pyspark/sql/connect/expressions.py#L165-L171



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Hisoka-X opened a new pull request, #40326: [SPARK-42708] [Docs] Improve doc about protobuf java file can't be indexed.

2023-03-07 Thread via GitHub


Hisoka-X opened a new pull request, #40326:
URL: https://github.com/apache/spark/pull/40326

   
   
   ### What changes were proposed in this pull request?
   Improve README doc for developers about protobuf java file can't be indexed.
   
   
   
   ### Why are the changes needed?
   To make sure developers to start spark easier.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   
   
   ### How was this patch tested?
   No
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #40295: [SPARK-42681][SQL] Relax ordering constraint for ALTER TABLE ADD|REPLACE column descriptor

2023-03-07 Thread via GitHub


gengliangwang closed pull request #40295: [SPARK-42681][SQL] Relax ordering 
constraint for ALTER TABLE ADD|REPLACE column descriptor
URL: https://github.com/apache/spark/pull/40295


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #40295: [SPARK-42681][SQL] Relax ordering constraint for ALTER TABLE ADD|REPLACE column descriptor

2023-03-07 Thread via GitHub


gengliangwang commented on PR #40295:
URL: https://github.com/apache/spark/pull/40295#issuecomment-1459387912

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub


beliefer commented on PR #40317:
URL: https://github.com/apache/spark/pull/40317#issuecomment-1459375825

   @LuciferYang Thank you for the job. 
https://github.com/apache/spark/pull/40291 need this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-07 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1128943046


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   One notable difference is not using `AssertNotNull` and relying on the same 
framework we have for V2 tables. In particular, we check the assignment mode 
and whether in and out attributes are compatible. As a consequence, we get 
analysis errors and not runtime errors.
   
   However, the main contribution of this PR is `AssignmentUtils` that aligns 
assignments with table attributes. It is a prerequisite for rewriting UPDATEs. 
Right now, we only rewrite DELETEs. The new utility also handles casts and 
char/varchar types, which we already did here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-07 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1128945996


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignRowLevelCommandAssignments.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, UpdateTable}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+
+/**
+ * A rule that aligns assignments in row-level operations.
+ *
+ * Note that this rule must be run after resolving default values but before 
rewriting row-level
+ * commands into executable plans. This rule does not apply to tables that 
accept any schema.
+ * Such tables must inject their own rules to align assignments.
+ */
+object AlignRowLevelCommandAssignments extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+case u: UpdateTable if !u.skipSchemaResolution && u.resolved && !u.aligned 
=>

Review Comment:
   We can apply this rule only if `table` implements 
`SupportsRowLevelOperations` if that feels safer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-07 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1128944608


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala:
##
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeReference, CreateNamedStruct, Expression, ExtractValue, 
GetStructField, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.plans.logical.Assignment
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+
+object AssignmentUtils extends SQLConfHelper with CastSupport {
+
+  private case class ColumnUpdate(ref: Seq[String], expr: Expression)
+
+  /**

Review Comment:
   @cloud-fan, this doc gives a bit more details about why this PR is a 
prerequisite for rewriting UPDATEs. Let me know if this makes sense!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-07 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1128943046


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   One notable difference is not using `AssertNotNull` and relying on the same 
framework we have for V2 tables. In particular, we check the assignment mode 
and whether in and out attributes are compatible. As a consequence, we get 
analysis errors and not runtime errors.
   
   However, the main contribution of this PR is `AssignmentUtils` that aligns 
assignments with table attributes. It is a prerequisite for rewriting UPDATEs. 
Right now, we only rewrite DELETEs. The new utility also handles casts and 
char/varchar types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-07 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1128943386


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala:
##
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeReference, CreateNamedStruct, Expression, ExtractValue, 
GetStructField, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.plans.logical.Assignment
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+
+object AssignmentUtils extends SQLConfHelper with CastSupport {
+
+  private case class ColumnUpdate(ref: Seq[String], expr: Expression)
+
+  /**
+   * Aligns assignments to match table columns.
+   * 
+   * This method processes and reorders given assignments so that each target 
column gets
+   * an expression it should be set to. If a column does not have a matching 
assignment,
+   * it will be set to its current value. For example, if one passes table 
attributes c1, c2
+   * and an assignment c2 = 1, this method will return c1 = c1, c2 = 1.
+   * 
+   * This method also handles updates to nested columns. If there is an 
assignment to a particular
+   * nested field, this method will construct a new struct with one field 
updated preserving other
+   * fields that have not been modified. For example, if one passes table 
attributes c1, c2
+   * where c2 is a struct with fields n1 and n2 and an assignment c2.n2 = 1, 
this method will
+   * return c1 = c1, c2 = struct(c2.n1, 1).
+   *
+   * @param attrs table attributes
+   * @param assignments assignments to align
+   * @return aligned assignments that match table columns
+   */
+  def alignAssignments(
+  attrs: Seq[Attribute],
+  assignments: Seq[Assignment]): Seq[Assignment] = {
+
+val errors = new mutable.ArrayBuffer[String]()
+
+val output = applyUpdates(
+  updates = assignments.map(toColumnUpdate),
+  cols = attrs.map(restoreActualType),
+  colExprs = attrs,
+  addError = err => errors += err)
+
+if (errors.nonEmpty) {
+  throw 
QueryCompilationErrors.invalidRowLevelOperationAssignments(assignments, 
errors.toSeq)
+}
+
+attrs.zip(output).map { case (attr, expr) => Assignment(attr, expr) }
+  }
+
+  private def toColumnUpdate(assignment: Assignment): ColumnUpdate = {
+ColumnUpdate(toRef(assignment.key), assignment.value)
+  }
+
+  private def restoreActualType(attr: Attribute): Attribute = {
+
attr.withDataType(CharVarcharUtils.getRawType(attr.metadata).getOrElse(attr.dataType))
+  }
+
+  private def applyUpdates(
+  updates: Seq[ColumnUpdate],
+  cols: Seq[Attribute],
+  colExprs: Seq[Expression],
+  addError: String => Unit,
+  colPath: Seq[String] = Nil): Seq[Expression] = {
+
+// iterate through columns at the current level and find matching updates
+cols.zip(colExprs).map { case (col, colExpr) =>
+  // find matches for this column or any of its children
+  val prefixMatchedUpdates = updates.filter(update => 
conf.resolver(update.ref.head, col.name))
+  prefixMatchedUpdates match {
+// if there is no exact match and no match for children, return the 
column expr as is
+case matchedUpdates if matchedUpdates.isEmpty =>
+  colExpr

Review Comment:
   Note: I am skipping char/varchar length checks when a field is assigned to 
itself (i.e. did not change).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact 

[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-07 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1128943046


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   One notable difference is not using `AssertNotNull` and relying on the same 
framework we have for V2 tables. In particular, we check the assignment mode 
and whether in and out attributes are compatible.
   
   The main contribution of this PR is `AssignmentUtils` that aligns 
assignments with table attributes. It is a prerequisite for rewriting UPDATEs. 
Right now, we only rewrite DELETEs. It also handles casts and char/varchar 
types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #40325: [SPARK-42707][CONNECT][DOCS] Update developer documentation about API stability warning

2023-03-07 Thread via GitHub


yaooqinn commented on code in PR #40325:
URL: https://github.com/apache/spark/pull/40325#discussion_r1128941874


##
python/pyspark/sql/connect/__init__.py:
##
@@ -15,5 +15,4 @@
 # limitations under the License.
 #
 
-"""Currently Spark Connect is very experimental and the APIs to interact with
-Spark through this API are can be changed at any time without warning."""
+"""Spark Connect cleint"""

Review Comment:
   nit: client



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2023-03-07 Thread via GitHub


yaooqinn commented on PR #37880:
URL: https://github.com/apache/spark/pull/37880#issuecomment-1459335005

   thanks @shrprasa @holdenk, merged to master and brand-3.4/3.3/3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-07 Thread via GitHub


beliefer commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1459329942

   @hvanhovell Do we still need this change ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2023-03-07 Thread via GitHub


yaooqinn closed pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user 
authentication for Spark on k8s in cluster deploy mode
URL: https://github.com/apache/spark/pull/37880


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #40325: [SPARK-42707][CONNECT][DOCS] Update developer documentation about API stability warning

2023-03-07 Thread via GitHub


HyukjinKwon opened a new pull request, #40325:
URL: https://github.com/apache/spark/pull/40325

   ### What changes were proposed in this pull request?
   
   This PR updates the developer documentation by removing the warnings about 
API compatibility.
   
   ### Why are the changes needed?
   
   We actually are going to keep the compatibility there. Such warnings cause a 
misconception that we can just break the compatibility there.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's developer-only docs.
   
   ### How was this patch tested?
   
   Linters in CI should test it out.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


zhengruifeng commented on PR #40323:
URL: https://github.com/apache/spark/pull/40323#issuecomment-1459254286

   thank you all, merged into master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


LuciferYang commented on PR #40323:
URL: https://github.com/apache/spark/pull/40323#issuecomment-1459252795

   > @LuciferYang This PR fix it in the connect planner, so should also works 
for the Scala Client.
   
   OK, got it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


zhengruifeng closed pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql 
to return values from the command
URL: https://github.com/apache/spark/pull/40323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


zhengruifeng commented on PR #40323:
URL: https://github.com/apache/spark/pull/40323#issuecomment-1459251014

   @LuciferYang This PR fix it in the connect planner, so should also works for 
the Scala Client.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


LuciferYang commented on PR #40323:
URL: https://github.com/apache/spark/pull/40323#issuecomment-1459249689

   Is there a chance to add a similar case in `ClientE2ETestSuite`?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub


AngersZh commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1459223471

   > Hi, @AngersZh .
   > This PR seems to have insufficient information. Could you provide more 
details about how to validate this in what environment?
   
   We run a client mode SparkSubmit job and throw below exception
   ```
   23/03/07 18:34:50 INFO YarnClientSchedulerBackend: Shutting down all 
executors
   23/03/07 18:34:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each 
executor to shut down
   23/03/07 18:34:50 INFO YarnClientSchedulerBackend: YARN client scheduler 
backend Stopped
   23/03/07 18:34:50 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
   23/03/07 18:34:50 INFO BlockManager: BlockManager stopped
   23/03/07 18:34:50 INFO BlockManagerMaster: BlockManagerMaster stopped
   23/03/07 18:34:50 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   23/03/07 18:34:50 INFO SparkContext: Successfully stopped SparkContext
   Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or 
view not found: xxx.xxx; line 1 pos 14;
   'GlobalLimit 1
   +- 'LocalLimit 1
  +- 'Project [*]
 +- 'UnresolvedRelation [xxx, xxx], [], false
   
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:95)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:184)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:183)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:183)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:183)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:183)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:183)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:183)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:95)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:92)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:155)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:178)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:175)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:778)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:778)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:621)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:778)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:616)
at 
org.apache.spark.sql.auth.QueryAuthChecker$.main(QueryAuthChecker.scala:33)
at 
org.apache.spark.sql.auth.QueryAuthChecker.main(QueryAuthChecker.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   

[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

2023-03-07 Thread via GitHub


itholic commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1128906598


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -2876,6 +2876,13 @@ def test_unsupported_io_functions(self):
 with self.assertRaises(NotImplementedError):
 getattr(df.write, f)()
 
+def test_inferred_schema(self):

Review Comment:
   Can we enrich the test by also including the test for schema_true and pandas 
DataFrame??
   
   such as:
   ```python
   schema_true = StructType([StructField("id", IntegerType(), True)])
   cdf2 = self.connect.createDataFrame([[1]], schema=schema_true)
   sdf2 = self.spark.createDataFrame([[1]], schema=schema_true)
   self.assertEqual(cdf2.schema, sdf2.schema)
   self.assertEqual(cdf2.collect(), sdf2.collect())
   
   pdf1 = cdf1.toPandas()
   cdf3 = self.connect.createDataFrame(pdf1, cdf1.schema)
   sdf3 = self.spark.createDataFrame(pdf1, cdf1.schema)
   self.assertEqual(cdf3.schema, sdf3.schema)
   self.assertEqual(cdf3.collect(), sdf3.collect())
   
   pdf2 = cdf2.toPandas()
   cdf4 = self.connect.createDataFrame(pdf2, cdf2.schema)
   sdf4 = self.spark.createDataFrame(pdf2, cdf2.schema)
   self.assertEqual(cdf4.schema, sdf4.schema)
   self.assertEqual(cdf4.collect(), sdf4.collect())
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

2023-03-07 Thread via GitHub


itholic commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1128906598


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -2876,6 +2876,13 @@ def test_unsupported_io_functions(self):
 with self.assertRaises(NotImplementedError):
 getattr(df.write, f)()
 
+def test_inferred_schema(self):

Review Comment:
   Can we enrich the test by also including the test for schema_true and pandas 
DataFrame??
   
   such as:
   ```
   schema_true = StructType([StructField("id", IntegerType(), True)])
   cdf2 = self.connect.createDataFrame([[1]], schema=schema_true)
   sdf2 = self.spark.createDataFrame([[1]], schema=schema_true)
   self.assertEqual(cdf2.schema, sdf2.schema)
   self.assertEqual(cdf2.collect(), sdf2.collect())
   
   pdf1 = cdf1.toPandas()
   cdf3 = self.connect.createDataFrame(pdf1, cdf1.schema)
   sdf3 = self.spark.createDataFrame(pdf1, cdf1.schema)
   self.assertEqual(cdf3.schema, sdf3.schema)
   self.assertEqual(cdf3.collect(), sdf3.collect())
   
   pdf2 = cdf2.toPandas()
   cdf4 = self.connect.createDataFrame(pdf2, cdf2.schema)
   sdf4 = self.spark.createDataFrame(pdf2, cdf2.schema)
   self.assertEqual(cdf4.schema, sdf4.schema)
   self.assertEqual(cdf4.collect(), sdf4.collect())
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allanf-db opened a new pull request, #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-07 Thread via GitHub


allanf-db opened a new pull request, #40324:
URL: https://github.com/apache/spark/pull/40324

   
   
   ### What changes were proposed in this pull request?
   Adding a Spark Connect overview page to the Spark 3.4 documentation.
   
   
   
   ### Why are the changes needed?
   The first version of Spark Connect is released as part of Spark 3.4.0 and 
this adds an overview for it to the documentation.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the user facing documentation is updated.
   
   
   
   ### How was this patch tested?
   Built the doc website locally and tested the pages.
   SKIP_SCALADOC=1 SKIP_RDOC=1 bundle exec jekyll build
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub


AngersZh commented on code in PR #40314:
URL: https://github.com/apache/spark/pull/40314#discussion_r1128902772


##
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##
@@ -1005,17 +1005,20 @@ private[spark] class SparkSubmit extends Logging {
 e
 }
 
+var exitCode = 0
+
 try {
   app.start(childArgs.toArray, sparkConf)
 } catch {
   case t: Throwable =>
+exitCode = -1
 throw findCause(t)
 } finally {
   if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
   !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
   !isConnectServer(args.mainClass)) {
 try {
-  SparkContext.getActive.foreach(_.stop())
+  SparkContext.getActive.foreach(_.stop(exitCode))

Review Comment:
   I don't know too much about K8S scheduler, but for yarn client mode we also 
need to keep a same exit code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub


AngersZh commented on code in PR #40314:
URL: https://github.com/apache/spark/pull/40314#discussion_r1128899260


##
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##
@@ -1005,17 +1005,20 @@ private[spark] class SparkSubmit extends Logging {
 e
 }
 
+var exitCode = 0
+
 try {
   app.start(childArgs.toArray, sparkConf)
 } catch {
   case t: Throwable =>
+exitCode = -1
 throw findCause(t)
 } finally {
   if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
   !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
   !isConnectServer(args.mainClass)) {
 try {
-  SparkContext.getActive.foreach(_.stop())
+  SparkContext.getActive.foreach(_.stop(exitCode))

Review Comment:
   Code in spark-3.1.2 is 
   ```
 if (!isShell(args.primaryResource) && !isSqlShell(args.mainClass) &&
   !isThriftServer(args.mainClass)) {
   ```
   
   Seems https://github.com/apache/spark/pull/33403 change the behavior



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


ueshin commented on PR #40323:
URL: https://github.com/apache/spark/pull/40323#issuecomment-1459184767

   > Is there a similar case on Scala connect client ?
   
   I haven't tried Scala client, but yes, it would happen, and this will fix 
both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

2023-03-07 Thread via GitHub


panbingkun commented on PR #40316:
URL: https://github.com/apache/spark/pull/40316#issuecomment-1459182974

   cc @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #40315: [SPARK-42699][CONNECT] SparkConnectServer should make client and AM same exit code

2023-03-07 Thread via GitHub


AngersZh commented on code in PR #40315:
URL: https://github.com/apache/spark/pull/40315#discussion_r112640


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -736,13 +736,15 @@ class SparkSession private(
   }
   // scalastyle:on
 
+  def stop(): Unit = stop(0)
+
   /**
* Stop the underlying `SparkContext`.
*
* @since 2.0.0

Review Comment:
   How about current?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-07 Thread via GitHub


yaooqinn commented on code in PR #40313:
URL: https://github.com/apache/spark/pull/40313#discussion_r1128885696


##
core/src/main/scala/org/apache/spark/ui/SparkUI.scala:
##
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,

Review Comment:
   This is SparkUI only, not used by HistoryServer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub


jerqi commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1459167441

   > @jerqi Agree that we should have a way to specify locality preference for 
disaggregated shuffle implementations to spark scheduler - so that shuffle 
tasks are closer to the data.
   > 
   > @otterc is relooking at 
[SPARK-25299](https://issues.apache.org/jira/browse/SPARK-25299), given the 
platform changes that have gone in since it was proposed - and propose changes 
in the upcoming months. We can add this to that discussion/jira.
   > 
   > Hope that sounds fine. For now, as @pan3793 suggested, turning off reduce 
locality for Uniffle should remove the need for the code patch (once this PR is 
merged) - if there are gaps, we can address those specifically to enforce the 
behavior of `spark.shuffle.reduceLocality.enabled`.
   
   There is some gaps. If we store data in the DFS or in disaggregated shuffle 
cluster. We should return `Nil` in the  `ShuffledRowRDD#getPreferredLocations`. 
For `CoalescedPartitionSpec`,  `spark.shuffle.reduceLocality.enabled` will help 
solve this problem. For `PartialReducerPartitionSpec` and 
`PartialMapperPartitionSpec`,  `spark.shuffle.reduceLocality.enabled`  can't 
help us. Because we call the method `MapOutpuTracker#getMapLocation`.
   Could we modify this method in this pr like?
   ```
 def getMapLocation(
 dep: ShuffleDependency[_, _, _],
 startMapIndex: Int,
 endMapIndex: Int): Seq[String] =
 {
   val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
   if (shuffleStatus != null || shuffleLocalityEnabled) {
 shuffleStatus.withMapStatuses { statuses =>
   if (startMapIndex < endMapIndex &&
 (startMapIndex >= 0 && endMapIndex <= statuses.length)) {
 val statusesPicked = statuses.slice(startMapIndex, 
endMapIndex).filter(_ != null)
 statusesPicked.map(_.location.host).toSeq
   } else {
 Nil
   }
 }
   } else {
 Nil
   }
 }
   ```
   Or we should raise an another pr to fix this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


LuciferYang commented on PR #40323:
URL: https://github.com/apache/spark/pull/40323#issuecomment-1459158853

   Is there a similar case on Scala connect client ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub


LuciferYang commented on PR #40317:
URL: https://github.com/apache/spark/pull/40317#issuecomment-1459157408

   re-triggered GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub


jerqi commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1459156913

   > > @jerqi locality may still have benefits when RSS works in hybrid 
deployments, besides, there is a dedicated configuration for that 
`spark.shuffle.reduceLocality.enabled`
   > 
   > `ShuffledRdd#getPreferredLocations` will call the method 
`MapOutputTracker#getPreferredLocationsForShuffle` and 
`MapOutputTracker#getMapLocation`. For `MapOutpuTracker#getMapLocation`, 
`spark.shuffle.reduceLocality.enabled` is useless.
   
   My mistake. I means that `ShuffledRowRdd#getPreferredLocations` instead of 
`ShuffledRdd#getPreferredLocations`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40315: [SPARK-42699][CONNECT] SparkConnectServer should make client and AM same exit code

2023-03-07 Thread via GitHub


HyukjinKwon commented on code in PR #40315:
URL: https://github.com/apache/spark/pull/40315#discussion_r1128823249


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -736,13 +736,15 @@ class SparkSession private(
   }
   // scalastyle:on
 
+  def stop(): Unit = stop(0)
+
   /**
* Stop the underlying `SparkContext`.
*
* @since 2.0.0

Review Comment:
   You should have the new docs for the one for `exitCode` with new `@since`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40315: [SPARK-42699][CONNECT] SparkConnectServer should make client and AM same exit code

2023-03-07 Thread via GitHub


HyukjinKwon commented on code in PR #40315:
URL: https://github.com/apache/spark/pull/40315#discussion_r1128822955


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -736,13 +736,15 @@ class SparkSession private(
   }
   // scalastyle:on
 
+  def stop(): Unit = stop(0)
+
   /**
* Stop the underlying `SparkContext`.
*
* @since 2.0.0
*/
-  def stop(): Unit = {
-sparkContext.stop()
+  def stop(exitCode: Int = 0): Unit = {

Review Comment:
   Can you remove the default value here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-07 Thread via GitHub


itholic commented on PR #40282:
URL: https://github.com/apache/spark/pull/40282#issuecomment-1459096241

   Just created ticket for SQL side: 
https://issues.apache.org/jira/browse/SPARK-42706.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub


mridulm commented on code in PR #40307:
URL: https://github.com/apache/spark/pull/40307#discussion_r1128807716


##
core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala:
##
@@ -456,7 +456,9 @@ class SparkListenerSuite extends SparkFunSuite with 
LocalSparkContext with Match
 val conf = new SparkConf().setMaster("local").setAppName("test")
   .set(EXTRA_LISTENERS, listeners.map(_.getName))
 sc = new SparkContext(conf)
-sc.listenerBus.listeners.asScala.count(_.isInstanceOf[BasicJobCounter]) 
should be (1)
+// sc.listenerBus.listeners.asScala.count(_.isInstanceOf[BasicJobCounter]) 
should be (1)
+
assert(sc.listenerBus.listeners.asScala.count(_.isInstanceOf[BasicJobCounter]) 
== 1,
+  s"listeners = ${sc.listenerBus.listeners.asScala.mkString(", ")}")

Review Comment:
   This change, and the change to LogUrlsStaqndaloneSuite was to debug why 
these assertions failed.
   But looks like it is just flakey test probably (on my local box) - so 
removing them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-07 Thread via GitHub


ivoson commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1128798692


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4572,6 +4572,48 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-42577: fail the job if a shuffle map stage attempts beyond the 
limitation") {
+setupStageAbortTest(sc)
+doAnswer(_ => 1).when(scheduler).maxStageAttempts

Review Comment:
   Thanks, will make the change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40310: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

2023-03-07 Thread via GitHub


HyukjinKwon closed pull request #40310: [SPARK-42022][CONNECT][PYTHON] Fix 
createDataFrame to autogenerate missing column names
URL: https://github.com/apache/spark/pull/40310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40310: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

2023-03-07 Thread via GitHub


HyukjinKwon commented on PR #40310:
URL: https://github.com/apache/spark/pull/40310#issuecomment-1459084648

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38740: [SQL] Add product encoders for local classes

2023-03-07 Thread via GitHub


github-actions[bot] commented on PR #38740:
URL: https://github.com/apache/spark/pull/38740#issuecomment-1459074618

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request, #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub


ueshin opened a new pull request, #40323:
URL: https://github.com/apache/spark/pull/40323

   ### What changes were proposed in this pull request?
   
   Fixes `spark.sql` to return values from the command.
   
   ### Why are the changes needed?
   
   Currently `spark.sql` doesn't return the result from the commands.
   
   ```py
   >>> spark.sql("show functions").show()
   ++
   |function|
   ++
   ++
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   `spark.sql` with commands will return the values.
   
   ### How was this patch tested?
   
   Added a test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rithwik-db opened a new pull request, #40322: Added small fix

2023-03-07 Thread via GitHub


rithwik-db opened a new pull request, #40322:
URL: https://github.com/apache/spark/pull/40322

   
   ### What changes were proposed in this pull request?
   
   I added a better way to show the error instead of having it be confusing for 
the reader.
   
   ### Why are the changes needed?
   
   
   User experience.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Just the error that will be shown to the user.
   
   ### How was this patch tested?
   
   Tested it out locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic closed pull request #40288: [SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-07 Thread via GitHub


itholic closed pull request #40288: [SPARK-42496][CONNECT][DOCS] Introduction 
Spark Connect at main page.
URL: https://github.com/apache/spark/pull/40288


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40288: [SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-07 Thread via GitHub


itholic commented on PR #40288:
URL: https://github.com/apache/spark/pull/40288#issuecomment-1459047783

   Let me close this for now, since the contents in this PR will be included in 
the future Spark Connect documents soon.
   cc @allanf-db  FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-07 Thread via GitHub


ryan-johnson-databricks commented on PR #40321:
URL: https://github.com/apache/spark/pull/40321#issuecomment-1459045889

   Something went wrong with [Run spark on kubernetes integration 
test](https://github.com/ryan-johnson-databricks/spark/actions/runs/4358877500/jobs/7620022040):
   ```
   [info] *** Test still running after 2 minutes, 57 seconds: suite name: 
KubernetesSuite, test name: Test decommissioning with dynamic allocation & 
shuffle cleanups. 
   [info] - Test decommissioning with dynamic allocation & shuffle cleanups (3 
minutes, 3 seconds)
   [info] - Test decommissioning timeouts (1 minute)
   [info] - SPARK-37576: Rolling decommissioning (1 minute, 11 seconds)
   [info] org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** 
ABORTED *** (25 minutes, 32 seconds)
   [info]   io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: POST at: https://192.168.49.2:8443/api/v1/namespaces. Message: 
object is being deleted: namespaces "spark-6bff7607e9884740a4bac53b1fb655ae" 
already exists. Received status: Status(apiVersion=v1, code=409, 
details=StatusDetails(causes=[], group=null, kind=namespaces, 
name=spark-6bff7607e9884740a4bac53b1fb655ae, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=object is being deleted: 
namespaces "spark-6bff7607e9884740a4bac53b1fb655ae" already exists, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), 
reason=AlreadyExists, status=Failure, additionalProperties={}).
   [info]   at 
io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
   [info]   at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538)
  ...
   [info]   at 
io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
   [info]   at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesTestComponents.createNamespace(KubernetesTestComponents.scala:51)
   [info]   at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.setUpTest(KubernetesSuite.scala:202)
  ...
   [info]   at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.runTest(KubernetesSuite.scala:45)
  ...
   [info]   Cause: io.fabric8.kubernetes.client.KubernetesClientException: 
Failure executing: POST at: https://192.168.49.2:8443/api/v1/namespaces. 
Message: object is being deleted: namespaces 
"spark-6bff7607e9884740a4bac53b1fb655ae" already exists. Received status: 
Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, 
kind=namespaces, name=spark-6bff7607e9884740a4bac53b1fb655ae, 
retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
message=object is being deleted: namespaces 
"spark-6bff7607e9884740a4bac53b1fb655ae" already exists, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), 
reason=AlreadyExists, status=Failure, additionalProperties={}).
   ```
   (not sure how that could be related to this PR?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] eric-maynard closed pull request #39519: [SPARK-41995][SQL] Accept non-foldable expressions in schema_of_json

2023-03-07 Thread via GitHub


eric-maynard closed pull request #39519: [SPARK-41995][SQL] Accept non-foldable 
expressions in schema_of_json
URL: https://github.com/apache/spark/pull/39519


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] eric-maynard commented on pull request #39519: [SPARK-41995][SQL] Accept non-foldable expressions in schema_of_json

2023-03-07 Thread via GitHub


eric-maynard commented on PR #39519:
URL: https://github.com/apache/spark/pull/39519#issuecomment-1458985753

   Closing for inactivity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub


dongjoon-hyun commented on code in PR #40314:
URL: https://github.com/apache/spark/pull/40314#discussion_r1128691347


##
core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:
##
@@ -1005,17 +1005,20 @@ private[spark] class SparkSubmit extends Logging {
 e
 }
 
+var exitCode = 0
+
 try {
   app.start(childArgs.toArray, sparkConf)
 } catch {
   case t: Throwable =>
+exitCode = -1
 throw findCause(t)
 } finally {
   if (args.master.startsWith("k8s") && !isShell(args.primaryResource) &&
   !isSqlShell(args.mainClass) && !isThriftServer(args.mainClass) &&
   !isConnectServer(args.mainClass)) {
 try {
-  SparkContext.getActive.foreach(_.stop())
+  SparkContext.getActive.foreach(_.stop(exitCode))

Review Comment:
   I don't think this is related to YARN AM because this is guarded by `if 
(args.master.startsWith("k8s")`. Is this K8s patch instead of YARN AM?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vitaliili-db commented on pull request #40295: [SPARK-42681] Relax ordering constraint for ALTER TABLE ADD|REPLACE column options

2023-03-07 Thread via GitHub


vitaliili-db commented on PR #40295:
URL: https://github.com/apache/spark/pull/40295#issuecomment-1458880038

   @gengliangwang great catch, yes, we should follow standard. Renamed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-07 Thread via GitHub


HeartSaVioR closed pull request #40215: [SPARK-42591][SS][DOCS] Add examples of 
unblocked workloads after SPARK-42376
URL: https://github.com/apache/spark/pull/40215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-07 Thread via GitHub


HeartSaVioR commented on PR #40215:
URL: https://github.com/apache/spark/pull/40215#issuecomment-1458815185

   Thanks for reviewing! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >