Re: [PR] [SPARK-47821][SQL] Implement is_variant_null expression [spark]
harshmotw-db commented on PR #46011: URL: https://github.com/apache/spark/pull/46011#issuecomment-2058281936 @cloud-fan Resolved -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47420][SQL] Fix test output [spark]
cloud-fan closed pull request #46058: [SPARK-47420][SQL] Fix test output URL: https://github.com/apache/spark/pull/46058 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
zhengruifeng commented on code in PR #46045: URL: https://github.com/apache/spark/pull/46045#discussion_r1566741860 ## python/pyspark/sql/functions/builtin.py: ## @@ -10985,7 +10994,9 @@ def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column: >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect() [Row(s=['one', 'two', 'three', ''])] """ -return _invoke_function("split", _to_java_column(str), pattern, limit) +pattern = pattern if isinstance(pattern, Column) else lit(pattern) +limit = lit(limit) if isinstance(limit, int) else limit Review Comment: yea, we can use it for `pattern` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47420][SQL] Fix test output [spark]
cloud-fan commented on PR #46058: URL: https://github.com/apache/spark/pull/46058#issuecomment-2058279111 the docker test failure is unrelated, merging to master, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47821][SQL] Implement is_variant_null expression [spark]
cloud-fan commented on PR #46011: URL: https://github.com/apache/spark/pull/46011#issuecomment-2058278099 there are code conflicts again... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]
wForget commented on PR #45589: URL: https://github.com/apache/spark/pull/45589#issuecomment-2058275927 > @wForget can you help to create a 3.5 backport PR? thanks! Sure, I will create it as soon as possible, and thanks for your review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47769][SQL] Add schema_of_variant_agg expression. [spark]
cloud-fan closed pull request #45934: [SPARK-47769][SQL] Add schema_of_variant_agg expression. URL: https://github.com/apache/spark/pull/45934 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47769][SQL] Add schema_of_variant_agg expression. [spark]
cloud-fan commented on PR #45934: URL: https://github.com/apache/spark/pull/45934#issuecomment-2058274239 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]
cloud-fan commented on code in PR #45999: URL: https://github.com/apache/spark/pull/45999#discussion_r1566736409 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, IsNull, Or, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR} + +/** + * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> t2.id` + * in join condition for better performance. + */ +object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( +_.containsPattern(JOIN), ruleId) { +case Join(left, right, joinType, condition, hint) if condition.nonEmpty => + val newCondition = condition.map(_.transformWithPruning(_.containsPattern(OR), ruleId) { +case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2))) + if (l.semanticEquals(c1) && r.semanticEquals(c2)) +|| (l.semanticEquals(c2) && r.semanticEquals(c1)) => + EqualNullSafe(l, r) +case Or(And(IsNull(c1), IsNull(c2)), EqualTo(l, r)) + if (l.semanticEquals(c1) && r.semanticEquals(c2)) +|| (l.semanticEquals(c2) && r.semanticEquals(c1)) => + EqualNullSafe(l, r) + }) + Join(left, right, joinType, newCondition, hint) Review Comment: nit: `j.copy(condition = newCondition)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]
cloud-fan commented on PR #45589: URL: https://github.com/apache/spark/pull/45589#issuecomment-2058271672 @wForget can you help to create a 3.5 backport PR? thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]
cloud-fan closed pull request #45589: [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean URL: https://github.com/apache/spark/pull/45589 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]
cloud-fan commented on PR #45589: URL: https://github.com/apache/spark/pull/45589#issuecomment-2058270805 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
cloud-fan commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566734238 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala: ## @@ -160,6 +160,8 @@ case class DataFrameQueryContext( val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("") val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("") + PySparkCurrentOrigin.clear() Review Comment: can we look into it? `DataFrameQueryContext` has nothing to do with `PySparkCurrentOrigin`, as all the values are materialized. It seems we rely on weird timing of when `DataFrameQueryContext` is instantiated, which is fragile. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1565663205 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala: ## @@ -76,6 +76,12 @@ object CurrentOrigin { value.get.copy(line = Some(line), startPosition = Some(start))) } + def setPySparkErrorContext(pysparkFragment: String, pysparkCallSite: String): Unit = { +val tupleInfo = (pysparkFragment, pysparkCallSite) +value.set( + value.get.copy(pysparkErrorContext = Some(tupleInfo))) + } + Review Comment: cc @HyukjinKwon @cloud-fan Try reusing the existing `CurrentOrigin` here so we don't need to add separate ThreadLocal and also don't need to add helper functions for each existing methods. Some test cases are still need to be addressed, but I believe the overall structure is ready to review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46350][SS] Fix state removal for stream-stream join with one watermark and one time-interval condition [spark]
rangadi commented on code in PR #44323: URL: https://github.com/apache/spark/pull/44323#discussion_r1566712758 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) -val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) -val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) -expr.map(JoinStateValueWatermarkPredicate.apply _) +// For example, if the condition is of the form: +//left_time > right_time + INTERVAL 30 MINUTES +// Then this extracts left_time and right_time. +val attributesInCondition = AttributeSet( + condition.get.collect { case a: AttributeReference => a } +) + +// Construct an AttributeSet so that we can perform equality between attributes, +// which we do in the filter below. +val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes) + +// oneSideInputAttributes could be [left_value, left_time], and we just +// want the attribute _in_ the time-interval condition. +val oneSideStateWatermarkAttributes = attributesInCondition.filter { a => +oneSideInputAttributeSet.contains(a) Review Comment: What ensures this is the event-time attribute? Is this assured to be `left_time` mentioned in the comment? ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) -val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) -val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) -expr.map(JoinStateValueWatermarkPredicate.apply _) +// For example, if the condition is of the form: +//left_time > right_time + INTERVAL 30 MINUTES +// Then this extracts left_time and right_time. +val attributesInCondition = AttributeSet( + condition.get.collect { case a: AttributeReference => a } +) + +// Construct an AttributeSet so that we can perform equality between attributes, +// which we do in the filter below. +val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes) + +// oneSideInputAttributes could be [left_value, left_time], and we just +// want the attribute _in_ the time-interval condition. +val oneSideStateWatermarkAttributes = attributesInCondition.filter { a => +oneSideInputAttributeSet.contains(a) +} + +// There should be a single attribute per side in the time-interval condition, so, Review Comment: Where is this part ensured? ('single attribute per side'). ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) -val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) Review Comment: Where is the equivalent of this statement? This looks like the event-time attribute from one side (left side in examples). ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala: ## @@ -219,10 +222,35 @@ object StreamingSymmetricHashJoinHelper extends Logging { attributesWithEventWatermark = AttributeSet(otherSideInputAttributes), condition, eventTimeWatermarkForEviction) -val inputAttributeWithWatermark = oneSideInputAttributes.find(_.metadata.contains(delayKey)) -val expr = watermarkExpression(inputAttributeWithWatermark, stateValueWatermark) -expr.map(JoinStateValueWatermarkPredicate.apply _) +// For example, if the condition is of the form: +//left_time > right_time + INTERVAL 30 MINUTES +// Then this extracts left_time and right_time. +val attributesInCondition = AttributeSet( + condition.get.collect { case a: AttributeReference => a } +) + +// Construct an AttributeSet so that we can perform equality between attributes, +// which we do in the filter below. +val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes) + +
Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566715537 ## python/pyspark/errors/utils.py: ## @@ -119,3 +124,59 @@ def get_message_template(self, error_class: str) -> str: message_template = main_message_template + " " + sub_message_template return message_template + + +def _capture_call_site(fragment: str) -> None: +""" +Capture the call site information including file name, line number, and function name. +This function updates the thread-local storage from server side (PySparkCurrentOrigin) Review Comment: I wanted to mention JVM side. Let me clarify it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566715097 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -1319,6 +1289,9 @@ def test_dataframe_error_context(self): class DataFrameTests(DataFrameTestsMixin, ReusedSQLTestCase): +def test_dataframe_error_context(self): Review Comment: Will remove. It was added for local testing purpose -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]
zhengruifeng commented on code in PR #46012: URL: https://github.com/apache/spark/pull/46012#discussion_r1566710260 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala: ## @@ -381,6 +405,53 @@ case class SessionHolder(userId: String, sessionId: String, session: SparkSessio */ private[connect] val pythonAccumulator: Option[PythonAccumulator] = Try(session.sparkContext.collectionAccumulator[Array[Byte]]).toOption + + /** + * Transform a relation into a logical plan, using the plan cache if enabled. + * The plan cache is enable only if `spark.connect.session.planCache.maxSize` is greater than zero + * AND `spark.connect.session.planCache.enabled` is true. + * @param rel The relation to transform. + * @param cachePlan Whether to cache the result logical plan. + * @param transform Function to transform the relation into a logical plan. + * @return The logical plan. + */ + private[connect] def usePlanCache(rel: proto.Relation, cachePlan: Boolean)( Review Comment: yea, I forgot the new `plan_id`. Then it's fine -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
zhengruifeng commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566707884 ## python/pyspark/errors/utils.py: ## @@ -119,3 +124,59 @@ def get_message_template(self, error_class: str) -> str: message_template = main_message_template + " " + sub_message_template return message_template + + +def _capture_call_site(fragment: str) -> None: +""" +Capture the call site information including file name, line number, and function name. +This function updates the thread-local storage from server side (PySparkCurrentOrigin) Review Comment: > server side sorry, probably I don't have enough context. Does this implementation also support spark connect? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
zhengruifeng commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566706967 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -1319,6 +1289,9 @@ def test_dataframe_error_context(self): class DataFrameTests(DataFrameTestsMixin, ReusedSQLTestCase): +def test_dataframe_error_context(self): Review Comment: why we need this test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47745] Add License to Spark Operator repository [spark-kubernetes-operator]
viirya commented on PR #3: URL: https://github.com/apache/spark-kubernetes-operator/pull/3#issuecomment-2058226232 The vote was passed today. I created `kubernetes-operator-0.1.0` in Spark JIRA. All Spark k8s operator related JIRA tickets can use this version now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
zhengruifeng commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566705061 ## python/pyspark/errors/utils.py: ## @@ -119,3 +124,59 @@ def get_message_template(self, error_class: str) -> str: message_template = main_message_template + " " + sub_message_template return message_template + + +def _capture_call_site(fragment: str) -> None: +""" +Capture the call site information including file name, line number, and function name. +This function updates the thread-local storage from server side (PySparkCurrentOrigin) +with the current call site information when a PySpark API function is called. +Parameters +-- +func_name : str Review Comment: ```suggestion fragment : str ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]
wForget commented on code in PR #45589: URL: https://github.com/apache/spark/pull/45589#discussion_r1566701178 ## sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala: ## @@ -966,6 +966,22 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS ) } } + + test("SPARK-47463: Pushed down v2 filter with if expression") { +withTempView("t1") { + spark.read.format(classOf[AdvancedDataSourceV2WithV2Filter].getName).load() +.createTempView("t1") + val df1 = sql( +s""" + |select * from + |(select if(i = 1, i, 0) as c from t1) t + |where t.c > 0 Review Comment: > I think `where if(i = 1, i, 0) > 0` is a valid SQL? BTW, please upper case the SQL keywords in the SQL statement. Indeed, thank you. I have changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [DO-NOT-REVIEW] Speed up test_parity_listener [spark]
WweiL opened a new pull request, #46072: URL: https://github.com/apache/spark/pull/46072 This PR makes test_parity_listener run faster. The test was slow because of `TestListenerSparkV1` and `TestListenerSparkV2` makes server calls and has long wait time, and the test runs on both listeners. They were created to verify the listener function with and without the new `onQueryIdle` callback. This PR fixes the slowness by removing the V1 and V2 of that listener (now only a `TestListenerSpark`), and create lightweight `TestListenerLocalV1` and `TestListenerLocalV2` for the `onQueryIdle` verification. ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]
HyukjinKwon closed pull request #46068: [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files URL: https://github.com/apache/spark/pull/46068 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]
HyukjinKwon commented on PR #46068: URL: https://github.com/apache/spark/pull/46068#issuecomment-2058191383 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566678165 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala: ## @@ -160,6 +160,8 @@ case class DataFrameQueryContext( val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("") val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("") + PySparkCurrentOrigin.clear() Review Comment: I tried get and clear at the first time, but it seems not working expectedly: ```scala def getAndClear(): Option[(String, String)] = { val context = pysparkErrorContext.get() // Clear the context after retrieving it pysparkErrorContext.remove() context } ``` The error context is missing in somewhere, so it raises exception: ``` Expected PySpark fragment was 'divide', got '' '' != 'divide' Expected :'divide' Actual :'' ``` So I put the clear here right after getting the `pysparkFragment` and `pysparkCallSite`, it works as expectedly. I guess the item is removed when clear the ThreadLocal? ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala: ## @@ -160,6 +160,8 @@ case class DataFrameQueryContext( val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("") val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("") + PySparkCurrentOrigin.clear() Review Comment: I tried get and clear at the first time, but it seems not working expectedly: ```scala def getAndClear(): Option[(String, String)] = { val context = pysparkErrorContext.get() // Clear the context after retrieving it pysparkErrorContext.remove() context } ``` The error context is missing in somewhere, so it raises exception: ``` Expected PySpark fragment was 'divide', got '' '' != 'divide' Expected :'divide' Actual :'' ``` So I put the clear here right after getting the `pysparkFragment` and `pysparkCallSite`, it works as expectedly. I guess the item is removed when clearing the ThreadLocal? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]
grundprinzip commented on code in PR #46068: URL: https://github.com/apache/spark/pull/46068#discussion_r158212 ## dev/connect-gen-protos.sh: ## @@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do done # Clean up everything. -rm -Rf gen +# rm -Rf gen Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47857][SQL] Utilize `java.sql.RowId.getBytes` API directly for UTF8String [spark]
dongjoon-hyun commented on code in PR #46062: URL: https://github.com/apache/spark/pull/46062#discussion_r157372 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -467,12 +467,8 @@ object JdbcUtils extends Logging with SQLConfHelper { case StringType if metadata.contains("rowid") => (rs: ResultSet, row: InternalRow, pos: Int) => -val rawRowId = rs.getRowId(pos + 1) -if (rawRowId == null) { - row.update(pos, null) -} else { - row.update(pos, UTF8String.fromString(rawRowId.toString)) -} +val id = nullSafeConvert[RowId](rs.getRowId(pos + 1), r => UTF8String.fromBytes(r.getBytes)) Review Comment: Got it. Thank you for the further investigation to make this sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47081][CONNECT][FOLLOW] Unflake Progress Execution [spark]
grundprinzip commented on PR #46060: URL: https://github.com/apache/spark/pull/46060#issuecomment-2058171185 > LGTM - pending tests. In my manual testing, the issue arises simply when running our existing tests and the tests will fail in the python client code in `reattach.py` that the `result_complete` method is not the last one in the stream. This behavior is fixed by ``` /** * Atomically submits a response and marks the stream as completed. */ def onNextComplete(r: T): Unit = responseLock.synchronized { if (!tryOnNext(r)) { throw new IllegalStateException("Stream onNext can't be called after stream completed") } onCompleted() } ``` Previously, these were two calls that are individually guarded, but now they're guarded together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46335][BUILD][3.5] Upgrade Maven to 3.9.6 [spark]
dongjoon-hyun closed pull request #46069: [SPARK-46335][BUILD][3.5] Upgrade Maven to 3.9.6 URL: https://github.com/apache/spark/pull/46069 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46335][BUILD][3.5] Upgrade Maven to 3.9.6 [spark]
dongjoon-hyun commented on PR #46069: URL: https://github.com/apache/spark/pull/46069#issuecomment-2058170454 Thank you, @yaooqinn ! Merged to branch-3.5 for Apache Spark 3.5.2+ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47861][BUILD] Upgrade `slf4j` to 2.0.13 [spark]
dongjoon-hyun closed pull request #46067: [SPARK-47861][BUILD] Upgrade `slf4j` to 2.0.13 URL: https://github.com/apache/spark/pull/46067 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47861][BUILD] Upgrade `slf4j` to 2.0.13 [spark]
dongjoon-hyun commented on PR #46067: URL: https://github.com/apache/spark/pull/46067#issuecomment-2058169845 Thank you, @yaooqinn ! Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]
grundprinzip commented on code in PR #46068: URL: https://github.com/apache/spark/pull/46068#discussion_r154745 ## dev/connect-gen-protos.sh: ## @@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do done # Clean up everything. -rm -Rf gen +# rm -Rf gen Review Comment: ```suggestion rm -Rf gen ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]
HeartSaVioR closed pull request #46035: [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes URL: https://github.com/apache/spark/pull/46035 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]
HeartSaVioR commented on PR #46035: URL: https://github.com/apache/spark/pull/46035#issuecomment-2058168299 Thanks! Merging to master/3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47857][SQL] Utilize `java.sql.RowId.getBytes` API directly for UTF8String [spark]
yaooqinn commented on code in PR #46062: URL: https://github.com/apache/spark/pull/46062#discussion_r150692 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -467,12 +467,8 @@ object JdbcUtils extends Logging with SQLConfHelper { case StringType if metadata.contains("rowid") => (rs: ResultSet, row: InternalRow, pos: Int) => -val rawRowId = rs.getRowId(pos + 1) -if (rawRowId == null) { - row.update(pos, null) -} else { - row.update(pos, UTF8String.fromString(rawRowId.toString)) -} +val id = nullSafeConvert[RowId](rs.getRowId(pos + 1), r => UTF8String.fromBytes(r.getBytes)) Review Comment: When I say 'practical', it means RowIds are alphanumerics -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]
grundprinzip commented on code in PR #46068: URL: https://github.com/apache/spark/pull/46068#discussion_r150150 ## dev/connect-gen-protos.sh: ## @@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do done # Clean up everything. -rm -Rf gen +# rm -Rf gen Review Comment: yes, my bad. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on code in PR #46022: URL: https://github.com/apache/spark/pull/46022#discussion_r150002 ## connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala: ## @@ -325,7 +327,8 @@ private[spark] class DirectKafkaInputDStream[K, V]( override def restore(): Unit = { batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) => - logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}") + logInfo(log"Restoring KafkaRDD for time ${MDC(TIME, t)} " + Review Comment: Yeah, Here `t` is an instance of the `Time`, and `Time` defaults to outputting a time unit of `ms`, as follows: https://github.com/apache/spark/blob/e815012f26ab305030b170eb2f0aa28d2de843b6/streaming/src/main/scala/org/apache/spark/streaming/Time.scala#L87 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47856][SQL] Document Mapping Spark SQL Data Types from Oracle and add tests [spark]
yaooqinn commented on PR #46059: URL: https://github.com/apache/spark/pull/46059#issuecomment-2058160869 Thank you @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47233][CONNECT][SS][2/2] Client & Server logic for Client side streaming query listener [spark]
HyukjinKwon closed pull request #46037: [SPARK-47233][CONNECT][SS][2/2] Client & Server logic for Client side streaming query listener URL: https://github.com/apache/spark/pull/46037 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47233][CONNECT][SS][2/2] Client & Server logic for Client side streaming query listener [spark]
HyukjinKwon commented on PR #46037: URL: https://github.com/apache/spark/pull/46037#issuecomment-2058156554 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]
HeartSaVioR commented on PR #46035: URL: https://github.com/apache/spark/pull/46035#issuecomment-2058156425 @cloud-fan Could you please have a quick look at the change? I reviewed the test suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47371] [SQL] XML: Ignore row tags found in CDATA [spark]
HyukjinKwon closed pull request #45487: [SPARK-47371] [SQL] XML: Ignore row tags found in CDATA URL: https://github.com/apache/spark/pull/45487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47371] [SQL] XML: Ignore row tags found in CDATA [spark]
HyukjinKwon commented on PR #45487: URL: https://github.com/apache/spark/pull/45487#issuecomment-2058155599 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47857][SQL] Utilize `java.sql.RowId.getBytes` API directly for UTF8String [spark]
yaooqinn commented on code in PR #46062: URL: https://github.com/apache/spark/pull/46062#discussion_r1566653317 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -467,12 +467,8 @@ object JdbcUtils extends Logging with SQLConfHelper { case StringType if metadata.contains("rowid") => (rs: ResultSet, row: InternalRow, pos: Int) => -val rawRowId = rs.getRowId(pos + 1) -if (rawRowId == null) { - row.update(pos, null) -} else { - row.update(pos, UTF8String.fromString(rawRowId.toString)) -} +val id = nullSafeConvert[RowId](rs.getRowId(pos + 1), r => UTF8String.fromBytes(r.getBytes)) Review Comment: The assumption might be practical as JDBC clients use the platform's default charset. However, I guess we cannot make such an assumption. I might close this first to see if we can apply the charset suitably for the client-server encode conversion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]
dongjoon-hyun commented on PR #46070: URL: https://github.com/apache/spark/pull/46070#issuecomment-2058151957 Thank you, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]
HyukjinKwon closed pull request #46070: [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` URL: https://github.com/apache/spark/pull/46070 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]
HyukjinKwon commented on PR #46070: URL: https://github.com/apache/spark/pull/46070#issuecomment-2058150429 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]
itholic commented on PR #45377: URL: https://github.com/apache/spark/pull/45377#issuecomment-2058148979 Because I called `PySparkCurrentOrigin` directly on the `DataFrameQueryContext` without utilizing `withOrigin` in the initial implementation. I realized it from recent review from the refactoring PR, so I'm currently trying to reintroduce `PySparkCurrentOrigin` there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47866][SQL][TESTS] Use explicit GC in `PythonForeachWriterSuite` [spark]
dongjoon-hyun commented on PR #46070: URL: https://github.com/apache/spark/pull/46070#issuecomment-2058148483 Could you review this PR, @HyukjinKwon ? This is a best-effort approach to mitigate Apple Silicon CI flakiness issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47861][BUILD] Upgrade `slf4j` to 2.0.13 [spark]
dongjoon-hyun commented on PR #46067: URL: https://github.com/apache/spark/pull/46067#issuecomment-2058142527 Could you review this `slf4j` dependency PR when you have some time, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
CTCC1 commented on code in PR #46045: URL: https://github.com/apache/spark/pull/46045#discussion_r1566639687 ## python/pyspark/sql/functions/builtin.py: ## @@ -10985,7 +10994,9 @@ def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column: >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect() [Row(s=['one', 'two', 'three', ''])] """ -return _invoke_function("split", _to_java_column(str), pattern, limit) +pattern = pattern if isinstance(pattern, Column) else lit(pattern) +limit = lit(limit) if isinstance(limit, int) else limit Review Comment: `limit` can also be of type `str` referring to a column, hence the check to avoid making the column name a string literal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47833][SQL][CORE] Supply caller stackstrace for checkAndGlobPathIfNecessary AnalysisException [spark]
pan3793 commented on PR #46028: URL: https://github.com/apache/spark/pull/46028#issuecomment-2058132732 cc @gengliangwang @LuciferYang @mridulm WDYT of this approach for stacktrace enhancement? Or do you have other suggestions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46935][DOCS] Consolidate error documentation [spark]
cloud-fan commented on PR #44971: URL: https://github.com/apache/spark/pull/44971#issuecomment-2058128551 let's fix conflicts and move forward, thanks for the work! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47767][SQL] Show offset value in TakeOrderedAndProjectExec [spark]
guixiaowen commented on code in PR #45931: URL: https://github.com/apache/spark/pull/45931#discussion_r1566636044 ## sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala: ## @@ -358,7 +358,9 @@ case class TakeOrderedAndProjectExec( val orderByString = truncatedString(sortOrder, "[", ",", "]", maxFields) val outputString = truncatedString(output, "[", ",", "]", maxFields) -s"TakeOrderedAndProject(limit=$limit, orderBy=$orderByString, output=$outputString)" +val offsetStr = if (offset > 0) { s" offset=$offset, " } else { "" } Review Comment: > I assume we only need to skip when offset = 0? @amaliujia Thank you for your review. I made some modifications. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46810][DOCS] Align error class terminology with SQL standard [spark]
cloud-fan closed pull request #44902: [SPARK-46810][DOCS] Align error class terminology with SQL standard URL: https://github.com/apache/spark/pull/44902 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46810][DOCS] Align error class terminology with SQL standard [spark]
cloud-fan commented on PR #44902: URL: https://github.com/apache/spark/pull/44902#issuecomment-2058126794 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
zhengruifeng commented on code in PR #46045: URL: https://github.com/apache/spark/pull/46045#discussion_r1566629222 ## python/pyspark/sql/functions/builtin.py: ## @@ -10972,6 +10976,11 @@ def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column: .. versionchanged:: 3.0 `split` now takes an optional `limit` field. If not provided, default limit value is -1. +.. versionchanged:: 4.0.0 + `pattern` now accepts column. Does not accept column name since string type remain + accepted as a regular expression representation, for backwards compatibility. + In addition to int, `limit` now accepts column and column name. + Review Comment: please add more doctest in the `Examples` section to test the new supported types those doctests will automatically be reused in Spark Connect Python Client. ## python/pyspark/sql/functions/builtin.py: ## @@ -10985,7 +10994,9 @@ def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column: >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect() [Row(s=['one', 'two', 'three', ''])] """ -return _invoke_function("split", _to_java_column(str), pattern, limit) +pattern = pattern if isinstance(pattern, Column) else lit(pattern) +limit = lit(limit) if isinstance(limit, int) else limit Review Comment: I think `lit` function accept both Column and int ## python/pyspark/sql/connect/functions/builtin.py: ## @@ -2476,8 +2476,26 @@ def repeat(col: "ColumnOrName", n: Union["ColumnOrName", int]) -> Column: repeat.__doc__ = pysparkfuncs.repeat.__doc__ -def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column: -return _invoke_function("split", _to_col(str), lit(pattern), lit(limit)) +def split( +str: "ColumnOrName", +pattern: Union[Column, str], +limit: Union["ColumnOrName", int] = -1, +) -> Column: +# work around shadowing of str in the input variable name +from builtins import str as py_str + +if isinstance(pattern, py_str): +_pattern = lit(pattern) +elif isinstance(pattern, Column): +_pattern = pattern +else: +raise PySparkTypeError( +error_class="NOT_COLUMN_OR_STR", +message_parameters={"arg_name": "pattern", "arg_type": type(pattern).__name__}, +) + +limit = lit(limit) if isinstance(limit, int) else _to_col(limit) +return _invoke_function("split", _to_col(str), _pattern, limit) Review Comment: ```suggestion limit = lit(limit) if isinstance(limit, int) else _to_col(limit) return _invoke_function("split", _to_col(str), lit(pattern), limit) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
zhengruifeng commented on PR #46045: URL: https://github.com/apache/spark/pull/46045#issuecomment-2058120153 also cc @HyukjinKwon and @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47769][SQL] Add schema_of_variant_agg expression. [spark]
chenhao-db commented on PR #45934: URL: https://github.com/apache/spark/pull/45934#issuecomment-2058116518 @cloud-fan could you help merge it? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47867][SQL] Support variant in JSON scan. [spark]
chenhao-db opened a new pull request, #46071: URL: https://github.com/apache/spark/pull/46071 ### What changes were proposed in this pull request? This PR adds support for the variant type in the JSON scan. As part of this PR we introduce one new JSON option: `spark.read.format("json").option("singleVariantColumn", "colName")`. Setting this option specifies that each JSON document should be ingested into a single variant column called `colName`. When this option is used, the user must not specify a schema, and the schema is inferred as `colName variant`. ### Example 1 (multiple variant fields) JSON files can be ingested into variant fields, e.g. ``` spark.read.format("json").schema("i int, var variant, arr ARRAY").load("a.json").show(false) ``` for a file with the following data: ``` {"i": 1, "var": {"d": "+94875-04-12", "string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}, "arr": [{"a": 1}, {"b": 2}, {"c": 3, "d": [1, 2, 3]}]} {"i": 2, "var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}} {} {"i": 3} ``` ### Example 2 (one variant field) Here's another example with a single variant field: ``` spark.read.format("json").schema("var variant").load("a.json").show(false) ``` for a file with the following data: ``` {"var": {"d": "+94875-04-12", "string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}} {"var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}} {} ``` ### Example 3 (singleVariantColumn option) Each JSON document can also be ingested into a single variant column, e.g. ``` spark.read.format("json").option("singleVariantColumn", "var").load("a.json").show(false) ``` for a file with the following data: ``` {"i": 1, "var": {"d": "+94875-04-12", "string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}, "arr": [{"a": 1}, {"b": 2}, {"c": 3, "d": [1, 2, 3]}]} {"i": 2, "var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}} {} {"i": 3} ``` ### Why are the changes needed? It allows Spark to ingest variant values directly from the JSON data source. Previously, the `parse_json` expression can only operate on a string column that is already in an existing table. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit tests that verify the result and error reporting in JSON scan. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]
HyukjinKwon commented on PR #45377: URL: https://github.com/apache/spark/pull/45377#issuecomment-2058107235 > perfectly sync the data between two separately operating TheadLocal, CurrentOrigin and PySparkCurrentOrigin. Why is that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
ueshin commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1566619646 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala: ## @@ -160,6 +160,8 @@ case class DataFrameQueryContext( val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("") val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("") + PySparkCurrentOrigin.clear() Review Comment: I guess we should clear this right after it's gotten? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]
HeartSaVioR closed pull request #45932: [SPARK-47673][SS] Implementing TTL for ListState URL: https://github.com/apache/spark/pull/45932 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47673][SS] Implementing TTL for ListState [spark]
HeartSaVioR commented on PR #45932: URL: https://github.com/apache/spark/pull/45932#issuecomment-2058102492 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
panbingkun commented on PR #46022: URL: https://github.com/apache/spark/pull/46022#issuecomment-2058093392 > @panbingkun Thanks for the works. LGTM except for some minor comments. Updated. Thank you for your review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47847][CORE] Deprecate spark.network.remoteReadNioBufferConversion [spark]
pan3793 commented on code in PR #46047: URL: https://github.com/apache/spark/pull/46047#discussion_r1566609420 ## core/src/main/scala/org/apache/spark/SparkConf.scala: ## @@ -640,7 +640,8 @@ private[spark] object SparkConf extends Logging { DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0", "Please use spark.excludeOnFailure.killExcludedExecutors"), DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", "3.1.0", -"Please use spark.yarn.executor.launch.excludeOnFailure.enabled") +"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"), + DeprecatedConfig("spark.network.remoteReadNioBufferConversion", "3.5.2", "") Review Comment: @dongjoon-hyun how should we fill the message in this case? BTW, according to the comment, I think this is an internal configuration, should we go through the deprecated procedure the same as user-facing configurations too? > // SPARK-24307 undocumented "escape-hatch" in case there are any issues in converting to > // ChunkedByteBuffer, to go back to old code-path. Can be removed post Spark 2.4 if > // new path is stable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]
gengliangwang closed pull request #45990: [SPARK-47804] Add Dataframe cache debug log URL: https://github.com/apache/spark/pull/45990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]
gengliangwang commented on PR #45990: URL: https://github.com/apache/spark/pull/45990#issuecomment-2058080259 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]
gengliangwang commented on code in PR #45990: URL: https://github.com/apache/spark/pull/45990#discussion_r1566605060 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -1609,6 +1609,19 @@ object SQLConf { .checkValues(StorageLevelMapper.values.map(_.name()).toSet) .createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name()) + val DATAFRAME_CACHE_LOG_LEVEL = buildConf("spark.sql.dataframeCache.logLevel") Review Comment: RDD is a Spark core concept. Anyway I respect your choice here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47866][SQL][TESTS] Deflaky `PythonForeachWriterSuite` [spark]
dongjoon-hyun opened a new pull request, #46070: URL: https://github.com/apache/spark/pull/46070 ### What changes were proposed in this pull request? This PR aims to reduce the flakiness of `PythonForeachWriterSuite` in CIs by invoking `System.gc` explicitly before each test. ### Why are the changes needed? Currently, the flakiness happens in Apple Silicon CI environment. - https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
CTCC1 commented on PR #46045: URL: https://github.com/apache/spark/pull/46045#issuecomment-2058059890 Actually my first PR here :) @zhengruifeng based on git blame you did something very similar before. Do you want to take a look? Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47862][PYTHON][CONNECT]Fix generation of proto files [spark]
HyukjinKwon commented on code in PR #46068: URL: https://github.com/apache/spark/pull/46068#discussion_r1566577876 ## dev/connect-gen-protos.sh: ## @@ -97,4 +100,4 @@ for f in `find gen/proto/python -name "*.py*"`; do done # Clean up everything. -rm -Rf gen +# rm -Rf gen Review Comment: should we remove this back? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43394][BUILD] Upgrade maven to 3.8.8 [spark]
dongjoon-hyun commented on PR #41073: URL: https://github.com/apache/spark/pull/41073#issuecomment-2058030002 During Apache Spark 3.4.2 RC1 release, I found that `build/mvn versions:set` could be flaky. Let me backport this to branch-3.4 because this is the last and stable bug fix version of Apache Maven 3.8.x line. - https://github.com/apache/spark/commit/025af02862cd045a112feb0da9c7519b846593f6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-46335][BUILD][3.5] Upgrade Maven to 3.9.6 [spark]
dongjoon-hyun opened a new pull request, #46069: URL: https://github.com/apache/spark/pull/46069 ### What changes were proposed in this pull request? This PR aims to upgrade `Apache Maven` to 3.9.6. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47739][SQL] Register logical avro type [spark]
dongjoon-hyun commented on PR #45895: URL: https://github.com/apache/spark/pull/45895#issuecomment-2058009387 To @milastdbx , why did you request a review to me when you didn't address my comment? ![Screenshot 2024-04-15 at 17 03 41](https://github.com/apache/spark/assets/9700541/518299f7-c6b9-48d5-bed0-e627a4341bd0) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47828][CONNECT][PYTHON][3.5] DataFrameWriterV2.overwrite fails with invalid plan [spark]
zhengruifeng commented on PR #46050: URL: https://github.com/apache/spark/pull/46050#issuecomment-2058007822 thank you @dongjoon-hyun and @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47855][CONNECT] Add `spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list [spark]
zhengruifeng commented on PR #46056: URL: https://github.com/apache/spark/pull/46056#issuecomment-2058007355 thank you @HyukjinKwon and @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47852][PYTHON] Support `DataFrameQueryContext` for reverse operations [spark]
itholic commented on PR #46053: URL: https://github.com/apache/spark/pull/46053#issuecomment-2058007102 This can be covered by https://github.com/apache/spark/pull/46063 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47828][CONNECT][PYTHON][3.5] DataFrameWriterV2.overwrite fails with invalid plan [spark]
dongjoon-hyun closed pull request #46050: [SPARK-47828][CONNECT][PYTHON][3.5] DataFrameWriterV2.overwrite fails with invalid plan URL: https://github.com/apache/spark/pull/46050 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47852][PYTHON] Support `DataFrameQueryContext` for reverse operations [spark]
itholic closed pull request #46053: [SPARK-47852][PYTHON] Support `DataFrameQueryContext` for reverse operations URL: https://github.com/apache/spark/pull/46053 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47855][CONNECT] Add `spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list [spark]
dongjoon-hyun commented on PR #46056: URL: https://github.com/apache/spark/pull/46056#issuecomment-2057989862 Merged to master. Thank you, @zhengruifeng and @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47855][CONNECT] Add `spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list [spark]
dongjoon-hyun closed pull request #46056: [SPARK-47855][CONNECT] Add `spark.sql.execution.arrow.pyspark.fallback.enabled` in the unsupported list URL: https://github.com/apache/spark/pull/46056 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]
sahnib commented on code in PR #46035: URL: https://github.com/apache/spark/pull/46035#discussion_r1566536030 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -978,7 +985,14 @@ object FoldablePropagation extends Rule[LogicalPlan] { // propagating the foldable expressions. // TODO(cloud-fan): It seems more reasonable to use new attributes as the output attributes // of outer join. - case j: Join => + // FoldablePropagation rule can produce incorrect optimized plan for streaming queries. + // This is because the optimizer can replace the grouping expressions, or join column + // with a literal value if the grouping key is constant for the micro-batch. However, + // as Streaming queries also read from the StateStore, this optimization also + // overwrites any keys read from State Store. We need to disable this optimization + // until we can make optimizer aware of Streaming state store. The State Store nodes + // are currently added in the Physical plan. + case j: Join if !j.left.isStreaming && !j.right.isStreaming => Review Comment: yeah, sorry about this. Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47678][CORE] Check `spark.shuffle.readHostLocalDisk` when reading shuffle blocks [spark]
viirya commented on PR #45803: URL: https://github.com/apache/spark/pull/45803#issuecomment-2057938423 Thank you @hiboyang ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]
dongjoon-hyun closed pull request #46066: [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 URL: https://github.com/apache/spark/pull/46066 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]
dongjoon-hyun commented on PR #46066: URL: https://github.com/apache/spark/pull/46066#issuecomment-2057913181 Thank you, @huaxingao . Merged to master for Apache Spark 4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47678][CORE] Check `spark.shuffle.readHostLocalDisk` when reading shuffle blocks [spark]
hiboyang commented on code in PR #45803: URL: https://github.com/apache/spark/pull/45803#discussion_r1566498567 ## core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala: ## @@ -417,7 +418,7 @@ final class ShuffleBlockFetcherIterator( numBlocksToFetch += mergedBlockInfos.size localBlocks ++= mergedBlockInfos.map(info => (info.blockId, info.mapIndex)) localBlockBytes += mergedBlockInfos.map(_.size).sum - } else if (blockManager.hostLocalDirManager.isDefined && + } else if (doLocalDiskRead && blockManager.hostLocalDirManager.isDefined && Review Comment: Did a quick test by disabling `SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED`, looks good. Will close this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47678][CORE] Check `spark.shuffle.readHostLocalDisk` when reading shuffle blocks [spark]
hiboyang closed pull request #45803: [SPARK-47678][CORE] Check `spark.shuffle.readHostLocalDisk` when reading shuffle blocks URL: https://github.com/apache/spark/pull/45803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]
dongjoon-hyun commented on PR #46066: URL: https://github.com/apache/spark/pull/46066#issuecomment-2057907006 Could you review this K8s PR, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47860][BUILD][K8S] Upgrade `kubernetes-client` to 6.12.0 [spark]
dongjoon-hyun commented on PR #46066: URL: https://github.com/apache/spark/pull/46066#issuecomment-2057906801 All K8s-related unit tests and integration tests passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]
gengliangwang commented on code in PR #45990: URL: https://github.com/apache/spark/pull/45990#discussion_r1566492674 ## common/utils/src/main/scala/org/apache/spark/internal/Logging.scala: ## @@ -153,10 +153,42 @@ trait Logging { if (log.isDebugEnabled) log.debug(msg) } + protected def logDebug(entry: LogEntry): Unit = { +if (log.isDebugEnabled) { + withLogContext(entry.context) { +log.debug(entry.message) + } +} + } + + protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = { Review Comment: Let's skip this one if it is no needed ## common/utils/src/main/scala/org/apache/spark/internal/Logging.scala: ## @@ -153,10 +153,42 @@ trait Logging { if (log.isDebugEnabled) log.debug(msg) } + protected def logDebug(entry: LogEntry): Unit = { +if (log.isDebugEnabled) { + withLogContext(entry.context) { +log.debug(entry.message) + } +} + } + + protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = { +if (log.isDebugEnabled) { + withLogContext(entry.context) { +log.debug(entry.message, throwable) + } +} + } + protected def logTrace(msg: => String): Unit = { if (log.isTraceEnabled) log.trace(msg) } + protected def logTrace(entry: LogEntry): Unit = { +if (log.isTraceEnabled) { + withLogContext(entry.context) { +log.trace(entry.message) + } +} + } + + protected def logTrace(entry: LogEntry, throwable: Throwable): Unit = { Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]
gengliangwang commented on code in PR #45990: URL: https://github.com/apache/spark/pull/45990#discussion_r1566492674 ## common/utils/src/main/scala/org/apache/spark/internal/Logging.scala: ## @@ -153,10 +153,42 @@ trait Logging { if (log.isDebugEnabled) log.debug(msg) } + protected def logDebug(entry: LogEntry): Unit = { +if (log.isDebugEnabled) { + withLogContext(entry.context) { +log.debug(entry.message) + } +} + } + + protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = { Review Comment: Let's skip this one if it is no needed ## common/utils/src/main/scala/org/apache/spark/internal/Logging.scala: ## @@ -153,10 +153,42 @@ trait Logging { if (log.isDebugEnabled) log.debug(msg) } + protected def logDebug(entry: LogEntry): Unit = { +if (log.isDebugEnabled) { + withLogContext(entry.context) { +log.debug(entry.message) + } +} + } + + protected def logDebug(entry: LogEntry, throwable: Throwable): Unit = { +if (log.isDebugEnabled) { + withLogContext(entry.context) { +log.debug(entry.message, throwable) + } +} + } + protected def logTrace(msg: => String): Unit = { if (log.isTraceEnabled) log.trace(msg) } + protected def logTrace(entry: LogEntry): Unit = { +if (log.isTraceEnabled) { + withLogContext(entry.context) { +log.trace(entry.message) + } +} + } + + protected def logTrace(entry: LogEntry, throwable: Throwable): Unit = { Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47804] Add Dataframe cache debug log [spark]
anchovYu commented on code in PR #45990: URL: https://github.com/apache/spark/pull/45990#discussion_r1566490570 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -1609,6 +1609,19 @@ object SQLConf { .checkValues(StorageLevelMapper.values.map(_.name()).toSet) .createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name()) + val DATAFRAME_CACHE_LOG_LEVEL = buildConf("spark.sql.dataframeCache.logLevel") Review Comment: I kept the Dataframe cache naming to differentiate it from the RDD cache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #46022: URL: https://github.com/apache/spark/pull/46022#issuecomment-2057848786 @panbingkun Thanks for the works. LGTM except for some minor comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46022: URL: https://github.com/apache/spark/pull/46022#discussion_r1566462197 ## connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala: ## @@ -325,7 +327,8 @@ private[spark] class DirectKafkaInputDStream[K, V]( override def restore(): Unit = { batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) => - logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}") + logInfo(log"Restoring KafkaRDD for time ${MDC(TIME, t)} " + Review Comment: QQ: is the time here using ms? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46022: URL: https://github.com/apache/spark/pull/46022#discussion_r1566459384 ## connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala: ## @@ -391,10 +392,12 @@ private[kafka010] class KafkaDataConsumer( .getOrElse("") val walTime = System.nanoTime() - startTimestampNano -logInfo( - s"From Kafka $kafkaMeta read $totalRecordsRead records through $numPolls polls (polled " + - s" out $numRecordsPolled records), taking $totalTimeReadNanos nanos, during time span of " + - s"$walTime nanos." +logInfo(log"From Kafka ${MDC(CONSUMER, kafkaMeta)} read " + + log"${MDC(TOTAL_RECORDS_READ, totalRecordsRead)} records through " + + log"${MDC(COUNT_POLL, numPolls)} polls " + + log"(polled out ${MDC(COUNT_RECORDS_POLL, numRecordsPolled)} records), " + Review Comment: KAFKA_RECORDS_PULLED_COUNT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46022: URL: https://github.com/apache/spark/pull/46022#discussion_r1566459174 ## connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala: ## @@ -391,10 +392,12 @@ private[kafka010] class KafkaDataConsumer( .getOrElse("") val walTime = System.nanoTime() - startTimestampNano -logInfo( - s"From Kafka $kafkaMeta read $totalRecordsRead records through $numPolls polls (polled " + - s" out $numRecordsPolled records), taking $totalTimeReadNanos nanos, during time span of " + - s"$walTime nanos." +logInfo(log"From Kafka ${MDC(CONSUMER, kafkaMeta)} read " + + log"${MDC(TOTAL_RECORDS_READ, totalRecordsRead)} records through " + + log"${MDC(COUNT_POLL, numPolls)} polls " + Review Comment: KAFKA_PULLS_COUNT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org