date:20240415

Re: [PR] [SPARK-47821][SQL] Implement is_variant_null expression [spark]

2024-04-15 Thread via GitHub



harshmotw-db commented on PR #46011:
URL: https://github.com/apache/spark/pull/46011#issuecomment-2058281936

   @cloud-fan Resolved


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47420][SQL] Fix test output [spark]

2024-04-15 Thread via GitHub



cloud-fan closed pull request #46058: [SPARK-47420][SQL] Fix test output
URL: https://github.com/apache/spark/pull/46058


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on code in PR #46045:
URL: https://github.com/apache/spark/pull/46045#discussion_r1566741860


##
python/pyspark/sql/functions/builtin.py:
##
@@ -10985,7 +10994,9 @@ def split(str: "ColumnOrName", pattern: str, limit: int 
= -1) -> Column:
 >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect()
 [Row(s=['one', 'two', 'three', ''])]
 """
-return _invoke_function("split", _to_java_column(str), pattern, limit)
+pattern = pattern if isinstance(pattern, Column) else lit(pattern)
+limit = lit(limit) if isinstance(limit, int) else limit

Review Comment:
   yea, we can use it for `pattern`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47420][SQL] Fix test output [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on PR #46058:
URL: https://github.com/apache/spark/pull/46058#issuecomment-2058279111

   the docker test failure is unrelated, merging to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47821][SQL] Implement is_variant_null expression [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on PR #46011:
URL: https://github.com/apache/spark/pull/46011#issuecomment-2058278099

   there are code conflicts again...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]

2024-04-15 Thread via GitHub



wForget commented on PR #45589:
URL: https://github.com/apache/spark/pull/45589#issuecomment-2058275927

   > @wForget can you help to create a 3.5 backport PR? thanks!
   
   Sure, I will create it as soon as possible, and thanks for your review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47769][SQL] Add schema_of_variant_agg expression. [spark]

2024-04-15 Thread via GitHub



cloud-fan closed pull request #45934: [SPARK-47769][SQL] Add 
schema_of_variant_agg expression.
URL: https://github.com/apache/spark/pull/45934


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47769][SQL] Add schema_of_variant_agg expression. [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on PR #45934:
URL: https://github.com/apache/spark/pull/45934#issuecomment-2058274239

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1566736409


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, 
IsNull, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR}
+
+/**
+ * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> 
t2.id`
+ * in join condition for better performance.
+ */
+object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(
+_.containsPattern(JOIN), ruleId) {
+case Join(left, right, joinType, condition, hint) if condition.nonEmpty =>
+  val newCondition = 
condition.map(_.transformWithPruning(_.containsPattern(OR), ruleId) {
+case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2)))
+  if (l.semanticEquals(c1) && r.semanticEquals(c2))
+|| (l.semanticEquals(c2) && r.semanticEquals(c1)) =>
+  EqualNullSafe(l, r)
+case Or(And(IsNull(c1), IsNull(c2)), EqualTo(l, r))
+  if (l.semanticEquals(c1) && r.semanticEquals(c2))
+|| (l.semanticEquals(c2) && r.semanticEquals(c1)) =>
+  EqualNullSafe(l, r)
+  })
+  Join(left, right, joinType, newCondition, hint)

Review Comment:
   nit: `j.copy(condition = newCondition)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on PR #45589:
URL: https://github.com/apache/spark/pull/45589#issuecomment-2058271672

   @wForget can you help to create a 3.5 backport PR? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]

2024-04-15 Thread via GitHub



cloud-fan closed pull request #45589: [SPARK-47463][SQL] Use V2Predicate to 
wrap expression with return type of boolean
URL: https://github.com/apache/spark/pull/45589


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on PR #45589:
URL: https://github.com/apache/spark/pull/45589#issuecomment-2058270805

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566734238


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##
@@ -160,6 +160,8 @@ case class DataFrameQueryContext(
   val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("")
   val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("")
 
+  PySparkCurrentOrigin.clear()

Review Comment:
   can we look into it? `DataFrameQueryContext` has nothing to do with 
`PySparkCurrentOrigin`, as all the values are materialized. It seems we rely on 
weird timing of when `DataFrameQueryContext` is instantiated, which is fragile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1565663205


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala:
##
@@ -76,6 +76,12 @@ object CurrentOrigin {
   value.get.copy(line = Some(line), startPosition = Some(start)))
   }
 
+  def setPySparkErrorContext(pysparkFragment: String, pysparkCallSite: 
String): Unit = {
+val tupleInfo = (pysparkFragment, pysparkCallSite)
+value.set(
+  value.get.copy(pysparkErrorContext = Some(tupleInfo)))
+  }
+

Review Comment:
   cc @HyukjinKwon @cloud-fan 
   
   Try reusing the existing `CurrentOrigin` here so we don't need to add 
separate ThreadLocal and also don't need to add helper functions for each 
existing methods.
   
   Some test cases are still need to be addressed, but I believe the overall 
structure is ready to review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]

2024-04-15 Thread via GitHub



rangadi commented on code in PR #44323:
URL: https://github.com/apache/spark/pull/44323#discussion_r1566712758


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##
@@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging {
   attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
   condition,
   eventTimeWatermarkForEviction)
-val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+// For example, if the condition is of the form:
+//left_time > right_time + INTERVAL 30 MINUTES
+// Then this extracts left_time and right_time.
+val attributesInCondition = AttributeSet(
+  condition.get.collect { case a: AttributeReference => a }
+)
+
+// Construct an AttributeSet so that we can perform equality between 
attributes,
+// which we do in the filter below.
+val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+// oneSideInputAttributes could be [left_value, left_time], and we just
+// want the attribute _in_ the time-interval condition.
+val oneSideStateWatermarkAttributes = attributesInCondition.filter { a 
=>
+oneSideInputAttributeSet.contains(a)

Review Comment:
   What ensures this is the event-time attribute? Is this assured to be 
`left_time` mentioned in the comment?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##
@@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging {
   attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
   condition,
   eventTimeWatermarkForEviction)
-val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+// For example, if the condition is of the form:
+//left_time > right_time + INTERVAL 30 MINUTES
+// Then this extracts left_time and right_time.
+val attributesInCondition = AttributeSet(
+  condition.get.collect { case a: AttributeReference => a }
+)
+
+// Construct an AttributeSet so that we can perform equality between 
attributes,
+// which we do in the filter below.
+val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+// oneSideInputAttributes could be [left_value, left_time], and we just
+// want the attribute _in_ the time-interval condition.
+val oneSideStateWatermarkAttributes = attributesInCondition.filter { a 
=>
+oneSideInputAttributeSet.contains(a)
+}
+
+// There should be a single attribute per side in the time-interval 
condition, so,

Review Comment:
   Where is this part ensured? ('single attribute per side').



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##
@@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging {
   attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
   condition,
   eventTimeWatermarkForEviction)
-val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))

Review Comment:
   Where is the equivalent of this statement? This looks like the event-time 
attribute from one side (left side in examples).



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##
@@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging {
   attributesWithEventWatermark = 
AttributeSet(otherSideInputAttributes),
   condition,
   eventTimeWatermarkForEviction)
-val inputAttributeWithWatermark = 
oneSideInputAttributes.find(_.metadata.contains(delayKey))
-val expr = watermarkExpression(inputAttributeWithWatermark, 
stateValueWatermark)
-expr.map(JoinStateValueWatermarkPredicate.apply _)
 
+// For example, if the condition is of the form:
+//left_time > right_time + INTERVAL 30 MINUTES
+// Then this extracts left_time and right_time.
+val attributesInCondition = AttributeSet(
+  condition.get.collect { case a: AttributeReference => a }
+)
+
+// Construct an AttributeSet so that we can perform equality between 
attributes,
+// which we do in the filter below.
+val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+

Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566715537


##
python/pyspark/errors/utils.py:
##
@@ -119,3 +124,59 @@ def get_message_template(self, error_class: str) -> str:
 message_template = main_message_template + " " + 
sub_message_template
 
 return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+"""
+Capture the call site information including file name, line number, and 
function name.
+This function updates the thread-local storage from server side 
(PySparkCurrentOrigin)

Review Comment:
   I wanted to mention JVM side. Let me clarify it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566715097


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -1319,6 +1289,9 @@ def test_dataframe_error_context(self):
 
 
 class DataFrameTests(DataFrameTestsMixin, ReusedSQLTestCase):
+def test_dataframe_error_context(self):

Review Comment:
   Will remove. It was added for local testing purpose



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on code in PR #46012:
URL: https://github.com/apache/spark/pull/46012#discussion_r1566710260


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala:
##
@@ -381,6 +405,53 @@ case class SessionHolder(userId: String, sessionId: 
String, session: SparkSessio
*/
   private[connect] val pythonAccumulator: Option[PythonAccumulator] =
 Try(session.sparkContext.collectionAccumulator[Array[Byte]]).toOption
+
+  /**
+   * Transform a relation into a logical plan, using the plan cache if enabled.
+   * The plan cache is enable only if 
`spark.connect.session.planCache.maxSize` is greater than zero
+   * AND `spark.connect.session.planCache.enabled` is true.
+   * @param rel The relation to transform.
+   * @param cachePlan Whether to cache the result logical plan.
+   * @param transform Function to transform the relation into a logical plan.
+   * @return The logical plan.
+   */
+  private[connect] def usePlanCache(rel: proto.Relation, cachePlan: Boolean)(

Review Comment:
   yea, I forgot the new `plan_id`. Then it's fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566707884


##
python/pyspark/errors/utils.py:
##
@@ -119,3 +124,59 @@ def get_message_template(self, error_class: str) -> str:
 message_template = main_message_template + " " + 
sub_message_template
 
 return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+"""
+Capture the call site information including file name, line number, and 
function name.
+This function updates the thread-local storage from server side 
(PySparkCurrentOrigin)

Review Comment:
   > server side
   
   sorry, probably I don't have enough context. Does this implementation also 
support spark connect?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566706967


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -1319,6 +1289,9 @@ def test_dataframe_error_context(self):
 
 
 class DataFrameTests(DataFrameTestsMixin, ReusedSQLTestCase):
+def test_dataframe_error_context(self):

Review Comment:
   why we need this test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47745] Add License to Spark Operator repository [spark-kubernetes-operator]

2024-04-15 Thread via GitHub



viirya commented on PR #3:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/3#issuecomment-2058226232

   The vote was passed today. I created `kubernetes-operator-0.1.0` in Spark 
JIRA. All Spark k8s operator related JIRA tickets can use this version now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566705061


##
python/pyspark/errors/utils.py:
##
@@ -119,3 +124,59 @@ def get_message_template(self, error_class: str) -> str:
 message_template = main_message_template + " " + 
sub_message_template
 
 return message_template
+
+
+def _capture_call_site(fragment: str) -> None:
+"""
+Capture the call site information including file name, line number, and 
function name.
+This function updates the thread-local storage from server side 
(PySparkCurrentOrigin)
+with the current call site information when a PySpark API function is 
called.
+Parameters
+--
+func_name : str

Review Comment:
   ```suggestion
   fragment : str
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]

2024-04-15 Thread via GitHub



wForget commented on code in PR #45589:
URL: https://github.com/apache/spark/pull/45589#discussion_r1566701178


##
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##
@@ -966,6 +966,22 @@ class DataSourceV2Suite extends QueryTest with 
SharedSparkSession with AdaptiveS
   )
 }
   }
+
+  test("SPARK-47463: Pushed down v2 filter with if expression") {
+withTempView("t1") {
+  
spark.read.format(classOf[AdvancedDataSourceV2WithV2Filter].getName).load()
+.createTempView("t1")
+  val df1 = sql(
+s"""
+   |select * from
+   |(select if(i = 1, i, 0) as c from t1) t
+   |where t.c > 0

Review Comment:
   > I think `where if(i = 1, i, 0) > 0` is a valid SQL? BTW, please upper case 
the SQL keywords in the SQL statement.
   
   Indeed, thank you. I have changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [DO-NOT-REVIEW] Speed up test_parity_listener [spark]

2024-04-15 Thread via GitHub



WweiL opened a new pull request, #46072:
URL: https://github.com/apache/spark/pull/46072

   This PR makes test_parity_listener run faster.
   
   The test was slow because of `TestListenerSparkV1` and `TestListenerSparkV2` 
makes server calls and has long wait time, and the test runs on both listeners. 
They were created to verify the listener function with and without the new 
`onQueryIdle` callback.
   
   This PR fixes the slowness by removing the V1 and V2 of that listener (now 
only a `TestListenerSpark`), and create lightweight `TestListenerLocalV1` and 
`TestListenerLocalV2` for the `onQueryIdle` verification.
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]

2024-04-15 Thread via GitHub



HyukjinKwon closed pull request #46068: [SPARK-47862][PYTHON][CONNECT]Fix 
generation of proto files
URL: https://github.com/apache/spark/pull/46068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]

2024-04-15 Thread via GitHub



HyukjinKwon commented on PR #46068:
URL: https://github.com/apache/spark/pull/46068#issuecomment-2058191383

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566678165


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##
@@ -160,6 +160,8 @@ case class DataFrameQueryContext(
   val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("")
   val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("")
 
+  PySparkCurrentOrigin.clear()

Review Comment:
   I tried get and clear at the first time, but it seems not working expectedly:
   
   ```scala
 def getAndClear(): Option[(String, String)] = {
   val context = pysparkErrorContext.get()
   // Clear the context after retrieving it
   pysparkErrorContext.remove()
   context
 }
   ```
   The error context is missing in somewhere, so it raises exception:
   
   ```
   Expected PySpark fragment was 'divide', got ''
   '' != 'divide'
   
   Expected :'divide'
   Actual   :''
   ```
So I put the clear here right after getting the `pysparkFragment` and 
`pysparkCallSite`, it works as expectedly.
   
   I guess the item is removed when clear the ThreadLocal?



##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##
@@ -160,6 +160,8 @@ case class DataFrameQueryContext(
   val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("")
   val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("")
 
+  PySparkCurrentOrigin.clear()

Review Comment:
   I tried get and clear at the first time, but it seems not working expectedly:
   
   ```scala
 def getAndClear(): Option[(String, String)] = {
   val context = pysparkErrorContext.get()
   // Clear the context after retrieving it
   pysparkErrorContext.remove()
   context
 }
   ```
   The error context is missing in somewhere, so it raises exception:
   
   ```
   Expected PySpark fragment was 'divide', got ''
   '' != 'divide'
   
   Expected :'divide'
   Actual   :''
   ```
So I put the clear here right after getting the `pysparkFragment` and 
`pysparkCallSite`, it works as expectedly.
   
   I guess the item is removed when clearing the ThreadLocal?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]

2024-04-15 Thread via GitHub



grundprinzip commented on code in PR #46068:
URL: https://github.com/apache/spark/pull/46068#discussion_r158212


##
dev/connect-gen-protos.sh:
##
@@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do
 done
 
 # Clean up everything.
-rm -Rf gen
+# rm -Rf gen

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47857][SQL] Utilize `java.sql.RowId.getBytes` API directly for UTF8String [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on code in PR #46062:
URL: https://github.com/apache/spark/pull/46062#discussion_r157372


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -467,12 +467,8 @@ object JdbcUtils extends Logging with SQLConfHelper {
 
 case StringType if metadata.contains("rowid") =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>
-val rawRowId = rs.getRowId(pos + 1)
-if (rawRowId == null) {
-  row.update(pos, null)
-} else {
-  row.update(pos, UTF8String.fromString(rawRowId.toString))
-}
+val id = nullSafeConvert[RowId](rs.getRowId(pos + 1), r => 
UTF8String.fromBytes(r.getBytes))

Review Comment:
   Got it. Thank you for the further investigation to make this sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47081][CONNECT][FOLLOW] Unflake Progress Execution [spark]

2024-04-15 Thread via GitHub



grundprinzip commented on PR #46060:
URL: https://github.com/apache/spark/pull/46060#issuecomment-2058171185

   > LGTM - pending tests.
   
   In my manual testing, the issue arises simply when running our existing 
tests and the tests will fail in the python client code in `reattach.py` that 
the `result_complete` method is not the last one in the stream.
   
   This behavior is fixed by 
   
   ```
   /**
  * Atomically submits a response and marks the stream as completed.
  */
 def onNextComplete(r: T): Unit = responseLock.synchronized {
   if (!tryOnNext(r)) {
 throw new IllegalStateException("Stream onNext can't be called after 
stream completed")
   }
   onCompleted()
 }
   ```
   
   Previously, these were two calls that are individually guarded, but now 
they're guarded together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46335][BUILD][3.5] Upgrade Maven to 3.9.6 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun closed pull request #46069: [SPARK-46335][BUILD][3.5] Upgrade 
Maven to 3.9.6
URL: https://github.com/apache/spark/pull/46069


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46335][BUILD][3.5] Upgrade Maven to 3.9.6 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46069:
URL: https://github.com/apache/spark/pull/46069#issuecomment-2058170454

   Thank you, @yaooqinn !  
   Merged to branch-3.5 for Apache Spark 3.5.2+


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47861][BUILD] Upgrade `slf4j` to 2.0.13 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun closed pull request #46067: [SPARK-47861][BUILD] Upgrade `slf4j` 
to 2.0.13
URL: https://github.com/apache/spark/pull/46067


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47861][BUILD] Upgrade `slf4j` to 2.0.13 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46067:
URL: https://github.com/apache/spark/pull/46067#issuecomment-2058169845

   Thank you, @yaooqinn !
   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]

2024-04-15 Thread via GitHub



grundprinzip commented on code in PR #46068:
URL: https://github.com/apache/spark/pull/46068#discussion_r154745


##
dev/connect-gen-protos.sh:
##
@@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do
 done
 
 # Clean up everything.
-rm -Rf gen
+# rm -Rf gen

Review Comment:
   ```suggestion
   rm -Rf gen
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]

2024-04-15 Thread via GitHub



HeartSaVioR closed pull request #46035: [SPARK-47840][SS] Disable foldable 
propagation across Streaming Aggregate/Join nodes
URL: https://github.com/apache/spark/pull/46035


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]

2024-04-15 Thread via GitHub



HeartSaVioR commented on PR #46035:
URL: https://github.com/apache/spark/pull/46035#issuecomment-2058168299

   Thanks! Merging to master/3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47857][SQL] Utilize `java.sql.RowId.getBytes` API directly for UTF8String [spark]

2024-04-15 Thread via GitHub



yaooqinn commented on code in PR #46062:
URL: https://github.com/apache/spark/pull/46062#discussion_r150692


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -467,12 +467,8 @@ object JdbcUtils extends Logging with SQLConfHelper {
 
 case StringType if metadata.contains("rowid") =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>
-val rawRowId = rs.getRowId(pos + 1)
-if (rawRowId == null) {
-  row.update(pos, null)
-} else {
-  row.update(pos, UTF8String.fromString(rawRowId.toString))
-}
+val id = nullSafeConvert[RowId](rs.getRowId(pos + 1), r => 
UTF8String.fromBytes(r.getBytes))

Review Comment:
   When I say 'practical', it means RowIds are alphanumerics  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]

2024-04-15 Thread via GitHub



grundprinzip commented on code in PR #46068:
URL: https://github.com/apache/spark/pull/46068#discussion_r150150


##
dev/connect-gen-protos.sh:
##
@@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do
 done
 
 # Clean up everything.
-rm -Rf gen
+# rm -Rf gen

Review Comment:
   yes, my bad. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub



panbingkun commented on code in PR #46022:
URL: https://github.com/apache/spark/pull/46022#discussion_r150002


##
connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala:
##
@@ -325,7 +327,8 @@ private[spark] class DirectKafkaInputDStream[K, V](
 
 override def restore(): Unit = {
   batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>
- logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", 
"]")}")
+ logInfo(log"Restoring KafkaRDD for time ${MDC(TIME, t)} " +

Review Comment:
   Yeah, Here `t` is an instance of the `Time`, and `Time` defaults to 
outputting a time unit of `ms`, as follows:
   
https://github.com/apache/spark/blob/e815012f26ab305030b170eb2f0aa28d2de843b6/streaming/src/main/scala/org/apache/spark/streaming/Time.scala#L87



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47856][SQL] Document Mapping Spark SQL Data Types from Oracle and add tests [spark]

2024-04-15 Thread via GitHub



yaooqinn commented on PR #46059:
URL: https://github.com/apache/spark/pull/46059#issuecomment-2058160869

   Thank you @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47233][CONNECT][SS][2/2] Client & Server logic for Client side streaming query listener [spark]

2024-04-15 Thread via GitHub



HyukjinKwon closed pull request #46037: [SPARK-47233][CONNECT][SS][2/2] Client 
& Server logic for Client side streaming query listener
URL: https://github.com/apache/spark/pull/46037


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47233][CONNECT][SS][2/2] Client & Server logic for Client side streaming query listener [spark]

2024-04-15 Thread via GitHub



HyukjinKwon commented on PR #46037:
URL: https://github.com/apache/spark/pull/46037#issuecomment-2058156554

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]

2024-04-15 Thread via GitHub



HeartSaVioR commented on PR #46035:
URL: https://github.com/apache/spark/pull/46035#issuecomment-2058156425

   @cloud-fan Could you please have a quick look at the change? I reviewed the 
test suite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47371] [SQL] XML: Ignore row tags found in CDATA [spark]

2024-04-15 Thread via GitHub



HyukjinKwon closed pull request #45487: [SPARK-47371] [SQL] XML: Ignore row 
tags found in CDATA
URL: https://github.com/apache/spark/pull/45487


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47371] [SQL] XML: Ignore row tags found in CDATA [spark]

2024-04-15 Thread via GitHub



HyukjinKwon commented on PR #45487:
URL: https://github.com/apache/spark/pull/45487#issuecomment-2058155599

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47857][SQL] Utilize `java.sql.RowId.getBytes` API directly for UTF8String [spark]

2024-04-15 Thread via GitHub



yaooqinn commented on code in PR #46062:
URL: https://github.com/apache/spark/pull/46062#discussion_r1566653317


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -467,12 +467,8 @@ object JdbcUtils extends Logging with SQLConfHelper {
 
 case StringType if metadata.contains("rowid") =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>
-val rawRowId = rs.getRowId(pos + 1)
-if (rawRowId == null) {
-  row.update(pos, null)
-} else {
-  row.update(pos, UTF8String.fromString(rawRowId.toString))
-}
+val id = nullSafeConvert[RowId](rs.getRowId(pos + 1), r => 
UTF8String.fromBytes(r.getBytes))

Review Comment:
   The assumption might be practical as JDBC clients use the platform's default 
charset. However, I guess we cannot make such an assumption. I might close this 
first to see if we can apply the charset suitably for the client-server encode 
conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46070:
URL: https://github.com/apache/spark/pull/46070#issuecomment-2058151957

   Thank you, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]

2024-04-15 Thread via GitHub



HyukjinKwon closed pull request #46070: [SPARK-47866][SQL][TESTS] Use explicit 
GC in `PythonForeachWriterSuite`
URL: https://github.com/apache/spark/pull/46070


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]

2024-04-15 Thread via GitHub



HyukjinKwon commented on PR #46070:
URL: https://github.com/apache/spark/pull/46070#issuecomment-2058150429

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-15 Thread via GitHub



itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2058148979

   Because I called `PySparkCurrentOrigin` directly on the 
`DataFrameQueryContext` without utilizing `withOrigin` in the initial 
implementation. I realized it from recent review from the refactoring PR, so 
I'm currently trying to reintroduce `PySparkCurrentOrigin` there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46070:
URL: https://github.com/apache/spark/pull/46070#issuecomment-2058148483

   Could you review this PR, @HyukjinKwon ? This is a best-effort approach to 
mitigate Apple Silicon CI flakiness issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47861][BUILD] Upgrade `slf4j` to 2.0.13 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46067:
URL: https://github.com/apache/spark/pull/46067#issuecomment-2058142527

   Could you review this `slf4j` dependency PR when you have some time, @viirya 
?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-15 Thread via GitHub



CTCC1 commented on code in PR #46045:
URL: https://github.com/apache/spark/pull/46045#discussion_r1566639687


##
python/pyspark/sql/functions/builtin.py:
##
@@ -10985,7 +10994,9 @@ def split(str: "ColumnOrName", pattern: str, limit: int 
= -1) -> Column:
 >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect()
 [Row(s=['one', 'two', 'three', ''])]
 """
-return _invoke_function("split", _to_java_column(str), pattern, limit)
+pattern = pattern if isinstance(pattern, Column) else lit(pattern)
+limit = lit(limit) if isinstance(limit, int) else limit

Review Comment:
   `limit` can also be of type `str` referring to a column, hence the check to 
avoid making the column name a string literal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47833][SQL][CORE] Supply caller stackstrace for checkAndGlobPathIfNecessary AnalysisException [spark]

2024-04-15 Thread via GitHub



pan3793 commented on PR #46028:
URL: https://github.com/apache/spark/pull/46028#issuecomment-2058132732

   cc @gengliangwang @LuciferYang @mridulm WDYT of this approach for stacktrace 
enhancement? Or do you have other suggestions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46935][DOCS] Consolidate error documentation [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on PR #44971:
URL: https://github.com/apache/spark/pull/44971#issuecomment-2058128551

   let's fix conflicts and move forward, thanks for the work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47767][SQL] Show offset value in TakeOrderedAndProjectExec [spark]

2024-04-15 Thread via GitHub



guixiaowen commented on code in PR #45931:
URL: https://github.com/apache/spark/pull/45931#discussion_r1566636044


##
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala:
##
@@ -358,7 +358,9 @@ case class TakeOrderedAndProjectExec(
 val orderByString = truncatedString(sortOrder, "[", ",", "]", maxFields)
 val outputString = truncatedString(output, "[", ",", "]", maxFields)
 
-s"TakeOrderedAndProject(limit=$limit, orderBy=$orderByString, 
output=$outputString)"
+val offsetStr = if (offset > 0) { s" offset=$offset, " } else { "" }

Review Comment:
   > I assume we only need to skip when offset = 0?
   
   @amaliujia Thank you for your review. I made some modifications. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46810][DOCS] Align error class terminology with SQL standard [spark]

2024-04-15 Thread via GitHub



cloud-fan closed pull request #44902: [SPARK-46810][DOCS] Align error class 
terminology with SQL standard
URL: https://github.com/apache/spark/pull/44902


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46810][DOCS] Align error class terminology with SQL standard [spark]

2024-04-15 Thread via GitHub



cloud-fan commented on PR #44902:
URL: https://github.com/apache/spark/pull/44902#issuecomment-2058126794

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on code in PR #46045:
URL: https://github.com/apache/spark/pull/46045#discussion_r1566629222


##
python/pyspark/sql/functions/builtin.py:
##
@@ -10972,6 +10976,11 @@ def split(str: "ColumnOrName", pattern: str, limit: 
int = -1) -> Column:
 .. versionchanged:: 3.0
`split` now takes an optional `limit` field. If not provided, 
default limit value is -1.
 
+.. versionchanged:: 4.0.0
+ `pattern` now accepts column. Does not accept column name since 
string type remain
+ accepted as a regular expression representation, for backwards 
compatibility.
+ In addition to int, `limit` now accepts column and column name.
+

Review Comment:
   please add more doctest in the `Examples` section to test the new supported 
types
   
   those doctests will automatically be reused in Spark Connect Python Client.



##
python/pyspark/sql/functions/builtin.py:
##
@@ -10985,7 +10994,9 @@ def split(str: "ColumnOrName", pattern: str, limit: int 
= -1) -> Column:
 >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect()
 [Row(s=['one', 'two', 'three', ''])]
 """
-return _invoke_function("split", _to_java_column(str), pattern, limit)
+pattern = pattern if isinstance(pattern, Column) else lit(pattern)
+limit = lit(limit) if isinstance(limit, int) else limit

Review Comment:
   I think `lit` function accept both Column and int



##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -2476,8 +2476,26 @@ def repeat(col: "ColumnOrName", n: Union["ColumnOrName", 
int]) -> Column:
 repeat.__doc__ = pysparkfuncs.repeat.__doc__
 
 
-def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column:
-return _invoke_function("split", _to_col(str), lit(pattern), lit(limit))
+def split(
+str: "ColumnOrName",
+pattern: Union[Column, str],
+limit: Union["ColumnOrName", int] = -1,
+) -> Column:
+# work around shadowing of str in the input variable name
+from builtins import str as py_str
+
+if isinstance(pattern, py_str):
+_pattern = lit(pattern)
+elif isinstance(pattern, Column):
+_pattern = pattern
+else:
+raise PySparkTypeError(
+error_class="NOT_COLUMN_OR_STR",
+message_parameters={"arg_name": "pattern", "arg_type": 
type(pattern).__name__},
+)
+
+limit = lit(limit) if isinstance(limit, int) else _to_col(limit)
+return _invoke_function("split", _to_col(str), _pattern, limit)

Review Comment:
   ```suggestion
   limit = lit(limit) if isinstance(limit, int) else _to_col(limit)
   return _invoke_function("split", _to_col(str), lit(pattern), limit)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on PR #46045:
URL: https://github.com/apache/spark/pull/46045#issuecomment-2058120153

   also cc @HyukjinKwon and @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47769][SQL] Add schema_of_variant_agg expression. [spark]

2024-04-15 Thread via GitHub



chenhao-db commented on PR #45934:
URL: https://github.com/apache/spark/pull/45934#issuecomment-2058116518

   @cloud-fan could you help merge it? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47867][SQL] Support variant in JSON scan. [spark]

2024-04-15 Thread via GitHub



chenhao-db opened a new pull request, #46071:
URL: https://github.com/apache/spark/pull/46071

   ### What changes were proposed in this pull request?
   
   This PR adds support for the variant type in the JSON scan.
   
   As part of this PR we introduce one new JSON option: 
`spark.read.format("json").option("singleVariantColumn", "colName")`. Setting 
this option specifies that each JSON document should be ingested into a single 
variant column called `colName`. When this option is used, the user must not 
specify a schema, and the schema is inferred as `colName variant`.
   
   ### Example 1 (multiple variant fields)
   
   JSON files can be ingested into variant fields, e.g.
   ```
   spark.read.format("json").schema("i int, var variant, arr 
ARRAY").load("a.json").show(false)
   ```
   for a file with the following data:
   ```
   {"i": 1, "var": {"d": "+94875-04-12", 
"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}, "arr": 
[{"a": 1}, {"b": 2}, {"c": 3, "d": [1, 2, 3]}]}
   {"i": 2, "var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": 
"value2"}}}
   {}
   {"i": 3}
   ```
   
   ### Example 2 (one variant field)
   
   Here's another example with a single variant field:
   ```
   spark.read.format("json").schema("var variant").load("a.json").show(false)
   ```
   for a file with the following data:
   ```
   {"var": {"d": "+94875-04-12", 
"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}}
   {"var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": 
"value2"}}}
   {}
   ```
   
   ### Example 3 (singleVariantColumn option)
   Each JSON document can also be ingested into a single variant column, e.g.
   ```
   spark.read.format("json").option("singleVariantColumn", 
"var").load("a.json").show(false)
   ```
   for a file with the following data:
   ```
   {"i": 1, "var": {"d": "+94875-04-12", 
"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}, "arr": 
[{"a": 1}, {"b": 2}, {"c": 3, "d": [1, 2, 3]}]}
   {"i": 2, "var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": 
"value2"}}}
   {}
   {"i": 3}
   ```
   
   ### Why are the changes needed?
   
   It allows Spark to ingest variant values directly from the JSON data source. 
Previously, the `parse_json` expression can only operate on a string column 
that is already in an existing table.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   Unit tests that verify the result and error reporting in JSON scan.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-15 Thread via GitHub



HyukjinKwon commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2058107235

   > perfectly sync the data between two separately operating TheadLocal, 
CurrentOrigin and PySparkCurrentOrigin.
   
   Why is that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub



ueshin commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1566619646


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##
@@ -160,6 +160,8 @@ case class DataFrameQueryContext(
   val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("")
   val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("")
 
+  PySparkCurrentOrigin.clear()

Review Comment:
   I guess we should clear this right after it's gotten?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-15 Thread via GitHub



HeartSaVioR closed pull request #45932: [SPARK-47673][SS] Implementing TTL for 
ListState
URL: https://github.com/apache/spark/pull/45932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]

2024-04-15 Thread via GitHub



HeartSaVioR commented on PR #45932:
URL: https://github.com/apache/spark/pull/45932#issuecomment-2058102492

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub



panbingkun commented on PR #46022:
URL: https://github.com/apache/spark/pull/46022#issuecomment-2058093392

   > @panbingkun Thanks for the works. LGTM except for some minor comments.
   
   Updated. Thank you for your review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47847][CORE] Deprecate spark.network.remoteReadNioBufferConversion [spark]

2024-04-15 Thread via GitHub



pan3793 commented on code in PR #46047:
URL: https://github.com/apache/spark/pull/46047#discussion_r1566609420


##
core/src/main/scala/org/apache/spark/SparkConf.scala:
##
@@ -640,7 +640,8 @@ private[spark] object SparkConf extends Logging {
   DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0",
 "Please use spark.excludeOnFailure.killExcludedExecutors"),
   
DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", 
"3.1.0",
-"Please use spark.yarn.executor.launch.excludeOnFailure.enabled")
+"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"),
+  DeprecatedConfig("spark.network.remoteReadNioBufferConversion", "3.5.2", 
"")

Review Comment:
   @dongjoon-hyun how should we fill the message in this case? BTW, according 
to the comment, I think this is an internal configuration, should we go through 
the deprecated procedure the same as user-facing configurations too?
   
   > // SPARK-24307 undocumented "escape-hatch" in case there are any issues in 
converting to
   > // ChunkedByteBuffer, to go back to old code-path.  Can be removed post 
Spark 2.4 if
   > // new path is stable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-15 Thread via GitHub



gengliangwang closed pull request #45990: [SPARK-47804] Add Dataframe cache 
debug log
URL: https://github.com/apache/spark/pull/45990


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on PR #45990:
URL: https://github.com/apache/spark/pull/45990#issuecomment-2058080259

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on code in PR #45990:
URL: https://github.com/apache/spark/pull/45990#discussion_r1566605060


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1609,6 +1609,19 @@ object SQLConf {
 .checkValues(StorageLevelMapper.values.map(_.name()).toSet)
 .createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name())
 
+  val DATAFRAME_CACHE_LOG_LEVEL = 
buildConf("spark.sql.dataframeCache.logLevel")

Review Comment:
   RDD is a Spark core concept. Anyway I respect your choice here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47866][SQL][TESTS] Deflaky `PythonForeachWriterSuite` [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun opened a new pull request, #46070:
URL: https://github.com/apache/spark/pull/46070

   ### What changes were proposed in this pull request?
   
   This PR aims to reduce the flakiness of `PythonForeachWriterSuite` in CIs by 
invoking `System.gc` explicitly before each test.
   
   ### Why are the changes needed?
   
   Currently, the flakiness happens in Apple Silicon CI environment.
   - 
https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the CIs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-15 Thread via GitHub



CTCC1 commented on PR #46045:
URL: https://github.com/apache/spark/pull/46045#issuecomment-2058059890

   Actually my first PR here :)
   @zhengruifeng based on git blame you did something very similar before. Do 
you want to take a look? Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]

2024-04-15 Thread via GitHub



HyukjinKwon commented on code in PR #46068:
URL: https://github.com/apache/spark/pull/46068#discussion_r1566577876


##
dev/connect-gen-protos.sh:
##
@@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do
 done
 
 # Clean up everything.
-rm -Rf gen
+# rm -Rf gen

Review Comment:
   should we remove this back?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43394][BUILD] Upgrade maven to 3.8.8 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #41073:
URL: https://github.com/apache/spark/pull/41073#issuecomment-2058030002

   During Apache Spark 3.4.2 RC1 release, I found that `build/mvn versions:set` 
could be flaky. Let me backport this to branch-3.4 because this is the last and 
stable bug fix version of Apache Maven 3.8.x line.
   - 
https://github.com/apache/spark/commit/025af02862cd045a112feb0da9c7519b846593f6


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-46335][BUILD][3.5] Upgrade Maven to 3.9.6 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun opened a new pull request, #46069:
URL: https://github.com/apache/spark/pull/46069

   ### What changes were proposed in this pull request?
   
   This PR aims to upgrade `Apache Maven` to 3.9.6.
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47739][SQL] Register logical avro type [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #45895:
URL: https://github.com/apache/spark/pull/45895#issuecomment-2058009387

   To @milastdbx , why did you request a review to me when you didn't address 
my comment?
   
   ![Screenshot 2024-04-15 at 17 03 
41](https://github.com/apache/spark/assets/9700541/518299f7-c6b9-48d5-bed0-e627a4341bd0)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47828][CONNECT][PYTHON][3.5] DataFrameWriterV2.overwrite fails with invalid plan [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on PR #46050:
URL: https://github.com/apache/spark/pull/46050#issuecomment-2058007822

   thank you @dongjoon-hyun and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47855][CONNECT] Add `spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list [spark]

2024-04-15 Thread via GitHub



zhengruifeng commented on PR #46056:
URL: https://github.com/apache/spark/pull/46056#issuecomment-2058007355

   thank you @HyukjinKwon and @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47852][PYTHON] Support `DataFrameQueryContext` for reverse operations [spark]

2024-04-15 Thread via GitHub



itholic commented on PR #46053:
URL: https://github.com/apache/spark/pull/46053#issuecomment-2058007102

   This can be covered by https://github.com/apache/spark/pull/46063


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47828][CONNECT][PYTHON][3.5] DataFrameWriterV2.overwrite fails with invalid plan [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun closed pull request #46050: [SPARK-47828][CONNECT][PYTHON][3.5] 
DataFrameWriterV2.overwrite fails with invalid plan
URL: https://github.com/apache/spark/pull/46050


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47852][PYTHON] Support `DataFrameQueryContext` for reverse operations [spark]

2024-04-15 Thread via GitHub



itholic closed pull request #46053: [SPARK-47852][PYTHON] Support 
`DataFrameQueryContext` for reverse operations
URL: https://github.com/apache/spark/pull/46053


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47855][CONNECT] Add `spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46056:
URL: https://github.com/apache/spark/pull/46056#issuecomment-2057989862

   Merged to master. Thank you, @zhengruifeng and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47855][CONNECT] Add `spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun closed pull request #46056: [SPARK-47855][CONNECT] Add 
`spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list
URL: https://github.com/apache/spark/pull/46056


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]

2024-04-15 Thread via GitHub



sahnib commented on code in PR #46035:
URL: https://github.com/apache/spark/pull/46035#discussion_r1566536030


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -978,7 +985,14 @@ object FoldablePropagation extends Rule[LogicalPlan] {
   // propagating the foldable expressions.
   // TODO(cloud-fan): It seems more reasonable to use new attributes as 
the output attributes
   // of outer join.
-  case j: Join =>
+  // FoldablePropagation rule can produce incorrect optimized plan for 
streaming queries.
+  // This is because the optimizer can replace the grouping expressions, 
or join column
+  // with a literal value if the grouping key is constant for the 
micro-batch. However,
+  // as Streaming queries also read from the StateStore, this optimization 
also
+  // overwrites any keys read from State Store. We need to disable this 
optimization
+  // until we can make optimizer aware of Streaming state store. The State 
Store nodes
+  // are currently added in the Physical plan.
+  case j: Join if !j.left.isStreaming && !j.right.isStreaming =>

Review Comment:
   yeah, sorry about this. Fixed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47678][CORE] Check `spark.shuffle.readHostLocalDisk` when reading shuffle blocks [spark]

2024-04-15 Thread via GitHub



viirya commented on PR #45803:
URL: https://github.com/apache/spark/pull/45803#issuecomment-2057938423

   Thank you @hiboyang !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun closed pull request #46066: [SPARK-47860][BUILD][K8S] Upgrade 
`kubernetes-client` to 6.12.0
URL: https://github.com/apache/spark/pull/46066


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46066:
URL: https://github.com/apache/spark/pull/46066#issuecomment-2057913181

   Thank you, @huaxingao . Merged to master for Apache Spark 4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47678][CORE] Check `spark.shuffle.readHostLocalDisk` when reading shuffle blocks [spark]

2024-04-15 Thread via GitHub



hiboyang commented on code in PR #45803:
URL: https://github.com/apache/spark/pull/45803#discussion_r1566498567


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -417,7 +418,7 @@ final class ShuffleBlockFetcherIterator(
 numBlocksToFetch += mergedBlockInfos.size
 localBlocks ++= mergedBlockInfos.map(info => (info.blockId, 
info.mapIndex))
 localBlockBytes += mergedBlockInfos.map(_.size).sum
-  } else if (blockManager.hostLocalDirManager.isDefined &&
+  } else if (doLocalDiskRead && blockManager.hostLocalDirManager.isDefined 
&&

Review Comment:
   Did a quick test by disabling `SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED`, 
looks good. Will close this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47678][CORE] Check `spark.shuffle.readHostLocalDisk` when reading shuffle blocks [spark]

2024-04-15 Thread via GitHub



hiboyang closed pull request #45803: [SPARK-47678][CORE]  Check 
`spark.shuffle.readHostLocalDisk` when reading shuffle blocks
URL: https://github.com/apache/spark/pull/45803


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46066:
URL: https://github.com/apache/spark/pull/46066#issuecomment-2057907006

   Could you review this K8s PR, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]

2024-04-15 Thread via GitHub



dongjoon-hyun commented on PR #46066:
URL: https://github.com/apache/spark/pull/46066#issuecomment-2057906801

   All K8s-related unit tests and integration tests passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on code in PR #45990:
URL: https://github.com/apache/spark/pull/45990#discussion_r1566492674


##
common/utils/src/main/scala/org/apache/spark/internal/Logging.scala:
##
@@ -153,10 +153,42 @@ trait Logging {
 if (log.isDebugEnabled) log.debug(msg)
   }
 
+  protected def logDebug(entry: LogEntry): Unit = {
+if (log.isDebugEnabled) {
+  withLogContext(entry.context) {
+log.debug(entry.message)
+  }
+}
+  }
+
+  protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = {

Review Comment:
   Let's skip this one if it is no needed



##
common/utils/src/main/scala/org/apache/spark/internal/Logging.scala:
##
@@ -153,10 +153,42 @@ trait Logging {
 if (log.isDebugEnabled) log.debug(msg)
   }
 
+  protected def logDebug(entry: LogEntry): Unit = {
+if (log.isDebugEnabled) {
+  withLogContext(entry.context) {
+log.debug(entry.message)
+  }
+}
+  }
+
+  protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = {
+if (log.isDebugEnabled) {
+  withLogContext(entry.context) {
+log.debug(entry.message, throwable)
+  }
+}
+  }
+
   protected def logTrace(msg: => String): Unit = {
 if (log.isTraceEnabled) log.trace(msg)
   }
 
+  protected def logTrace(entry: LogEntry): Unit = {
+if (log.isTraceEnabled) {
+  withLogContext(entry.context) {
+log.trace(entry.message)
+  }
+}
+  }
+
+  protected def logTrace(entry: LogEntry, throwable: Throwable): Unit = {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on code in PR #45990:
URL: https://github.com/apache/spark/pull/45990#discussion_r1566492674


##
common/utils/src/main/scala/org/apache/spark/internal/Logging.scala:
##
@@ -153,10 +153,42 @@ trait Logging {
 if (log.isDebugEnabled) log.debug(msg)
   }
 
+  protected def logDebug(entry: LogEntry): Unit = {
+if (log.isDebugEnabled) {
+  withLogContext(entry.context) {
+log.debug(entry.message)
+  }
+}
+  }
+
+  protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = {

Review Comment:
   Let's skip this one if it is no needed



##
common/utils/src/main/scala/org/apache/spark/internal/Logging.scala:
##
@@ -153,10 +153,42 @@ trait Logging {
 if (log.isDebugEnabled) log.debug(msg)
   }
 
+  protected def logDebug(entry: LogEntry): Unit = {
+if (log.isDebugEnabled) {
+  withLogContext(entry.context) {
+log.debug(entry.message)
+  }
+}
+  }
+
+  protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = {
+if (log.isDebugEnabled) {
+  withLogContext(entry.context) {
+log.debug(entry.message, throwable)
+  }
+}
+  }
+
   protected def logTrace(msg: => String): Unit = {
 if (log.isTraceEnabled) log.trace(msg)
   }
 
+  protected def logTrace(entry: LogEntry): Unit = {
+if (log.isTraceEnabled) {
+  withLogContext(entry.context) {
+log.trace(entry.message)
+  }
+}
+  }
+
+  protected def logTrace(entry: LogEntry, throwable: Throwable): Unit = {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]

2024-04-15 Thread via GitHub



anchovYu commented on code in PR #45990:
URL: https://github.com/apache/spark/pull/45990#discussion_r1566490570


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1609,6 +1609,19 @@ object SQLConf {
 .checkValues(StorageLevelMapper.values.map(_.name()).toSet)
 .createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name())
 
+  val DATAFRAME_CACHE_LOG_LEVEL = 
buildConf("spark.sql.dataframeCache.logLevel")

Review Comment:
   I kept the Dataframe cache naming to differentiate it from the RDD cache.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on PR #46022:
URL: https://github.com/apache/spark/pull/46022#issuecomment-2057848786

   @panbingkun Thanks for the works. LGTM except for some minor comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on code in PR #46022:
URL: https://github.com/apache/spark/pull/46022#discussion_r1566462197


##
connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala:
##
@@ -325,7 +327,8 @@ private[spark] class DirectKafkaInputDStream[K, V](
 
 override def restore(): Unit = {
   batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>
- logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", 
"]")}")
+ logInfo(log"Restoring KafkaRDD for time ${MDC(TIME, t)} " +

Review Comment:
   QQ: is the time here using ms?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on code in PR #46022:
URL: https://github.com/apache/spark/pull/46022#discussion_r1566459384


##
connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala:
##
@@ -391,10 +392,12 @@ private[kafka010] class KafkaDataConsumer(
   .getOrElse("")
 val walTime = System.nanoTime() - startTimestampNano
 
-logInfo(
-  s"From Kafka $kafkaMeta read $totalRecordsRead records through $numPolls 
polls (polled " +
-  s" out $numRecordsPolled records), taking $totalTimeReadNanos nanos, 
during time span of " +
-  s"$walTime nanos."
+logInfo(log"From Kafka ${MDC(CONSUMER, kafkaMeta)} read " +
+  log"${MDC(TOTAL_RECORDS_READ, totalRecordsRead)} records through " +
+  log"${MDC(COUNT_POLL, numPolls)} polls " +
+  log"(polled out ${MDC(COUNT_RECORDS_POLL, numRecordsPolled)} records), " 
+

Review Comment:
   KAFKA_RECORDS_PULLED_COUNT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub



gengliangwang commented on code in PR #46022:
URL: https://github.com/apache/spark/pull/46022#discussion_r1566459174


##
connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala:
##
@@ -391,10 +392,12 @@ private[kafka010] class KafkaDataConsumer(
   .getOrElse("")
 val walTime = System.nanoTime() - startTimestampNano
 
-logInfo(
-  s"From Kafka $kafkaMeta read $totalRecordsRead records through $numPolls 
polls (polled " +
-  s" out $numRecordsPolled records), taking $totalTimeReadNanos nanos, 
during time span of " +
-  s"$walTime nanos."
+logInfo(log"From Kafka ${MDC(CONSUMER, kafkaMeta)} read " +
+  log"${MDC(TOTAL_RECORDS_READ, totalRecordsRead)} records through " +
+  log"${MDC(COUNT_POLL, numPolls)} polls " +

Review Comment:
   KAFKA_PULLS_COUNT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 315 matches

Mail list logo