[GitHub] [spark] gengliangwang commented on pull request #39100: [SPARK-41422][UI] Protobuf serializer for ExecutorStageSummaryWrapper
gengliangwang commented on PR #39100: URL: https://github.com/apache/spark/pull/39100#issuecomment-1356109099 @techaddict Yes it is for ExecutorStageSummaryWrapper. I just updated the title. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38861: [SPARK-41294][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1203 / 1168
MaxGekk commented on code in PR #38861: URL: https://github.com/apache/spark/pull/38861#discussion_r1051345163 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -2018,15 +2007,29 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { "tableColumns" -> expected.map(c => s"'${c.name}'").mkString(", "), "dataColumns" -> query.output.map(c => s"'${c.name}'").mkString(", "))) } + + def numColumnsMismatchError( + operator: LogicalPlan, + firstNumColumns: Int, + invalidOrdinalNum: Int, + invalidNumColumns: Int): Throwable = { + +def ordinalNumber(i: Int): String = i match { + case 0 => "first" + case 1 => "second" + case 2 => "third" + case i => s"${i + 1}th" +} - def cannotWriteNotEnoughColumnsToTableError( - tableName: String, expected: Seq[Attribute], query: LogicalPlan): Throwable = { new AnalysisException( - errorClass = "_LEGACY_ERROR_TEMP_1203", + errorClass = "NUM_COLUMNS_MISMATCH", messageParameters = Map( -"tableName" -> tableName, -"tableColumns" -> expected.map(c => s"'${c.name}'").mkString(", "), -"dataColumns" -> query.output.map(c => s"'${c.name}'").mkString(", "))) +"operator" -> toSQLStmt(operator.nodeName), Review Comment: > It's weird, maybe we should pin it to INSERT INTO? So far, let's do that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict opened a new pull request, #39105: [SPARK-41426] Protobuf serializer for ResourceProfileWrapper
techaddict opened a new pull request, #39105: URL: https://github.com/apache/spark/pull/39105 ### What changes were proposed in this pull request? Add Protobuf serializer for ResourceProfileWrapper ### Why are the changes needed? Support fast and compact serialization/deserialization for ResourceProfileWrapper over RocksDB. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38922: [SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks
SandishKumarHN commented on code in PR #38922: URL: https://github.com/apache/spark/pull/38922#discussion_r1051328356 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala: ## @@ -38,6 +38,14 @@ private[sql] class ProtobufOptions( val parseMode: ParseMode = parameters.get("mode").map(ParseMode.fromString).getOrElse(FailFastMode) + + // Setting the `recursive.fields.max.depth` to 0 allows the field to be recurse once, Review Comment: @rangadi Thank you for your suggestion. I have implemented it by adding a comment and a unit test to make the example more clear to users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38922: [SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks
SandishKumarHN commented on code in PR #38922: URL: https://github.com/apache/spark/pull/38922#discussion_r1051328177 ## core/src/main/resources/error/error-classes.json: ## @@ -1016,7 +1016,7 @@ }, "RECURSIVE_PROTOBUF_SCHEMA" : { "message" : [ - "Found recursive reference in Protobuf schema, which can not be processed by Spark: " + "Found recursive reference in Protobuf schema, which can not be processed by Spark by default: . try setting the option `recursive.fields.max.depth` as 0 or 1 or 2. Going beyond 3 levels of recursion is not allowed." Review Comment: @rangadi agree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict opened a new pull request, #39104: [SPARK-41425] Protobuf serializer for RDDStorageInfoWrapper
techaddict opened a new pull request, #39104: URL: https://github.com/apache/spark/pull/39104 ### What changes were proposed in this pull request? Add Protobuf serializer for RDDStorageInfoWrapper ### Why are the changes needed? Support fast and compact serialization/deserialization for RDDStorageInfoWrapper over RocksDB. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on pull request #39100: [SPARK-41422][UI] Protobuf serializer for ExecutorStageSummary
techaddict commented on PR #39100: URL: https://github.com/apache/spark/pull/39100#issuecomment-1356054846 @gengliangwang is this not for https://issues.apache.org/jira/browse/SPARK-41427 ? This one looks like PR for `ExecutorStageSummaryWrapper` where as issue [SPARK-41422] was for `ExecutorSummaryWrapper` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on pull request #39093: [SPARK-41420][UI] Protobuf serializer for ApplicationInfoWrapper
techaddict commented on PR #39093: URL: https://github.com/apache/spark/pull/39093#issuecomment-1356050497 @gengliangwang thanks for the review, updated the PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shuyouZZ commented on pull request #38983: [SPARK-41447][CORE] Reduce the number of `doMergeApplicationListing` invocations
shuyouZZ commented on PR #38983: URL: https://github.com/apache/spark/pull/38983#issuecomment-1356044762 > @shuyouZZ . I added you to the Apache Spark contributor group and assigned [SPARK-41447](https://issues.apache.org/jira/browse/SPARK-41447) to you. Welcome to the Apache Spark community! Thanks for your review. @dongjoon-hyun @mridulm @thejdeep -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shardulm94 commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
shardulm94 commented on PR #38676: URL: https://github.com/apache/spark/pull/38676#issuecomment-1356030772 I tried looking into this a bit > @EnricoMi @cloud-fan Could we fix the DeduplicateRelations? It did not generate different expression IDs for all conflicting attributes: As @EnricoMi said `DeduplicateRelations` only considers the output attrs of the left and right, which do not conflict here. Also the `Project` case in `PushDownLeftSemiAntiJoin` calls [this method](https://github.com/apache/spark/blob/45d9daa2ecf6081ef1d031065a9c0e9a3a7f7a58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala#L119) which seems to check for self-join case based on conflicting expression ids. This makes me believe the duplicate expression IDs are expected here and hence DeduplicateRelations may not be at fault. Similar to the `Project` case, should we add a check like `canPushThroughCondition(Seq(agg.child), joinCond, rightOp)` to ensure that it is safe to push the join down an `Aggregate` node too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wankunde commented on pull request #38560: [SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service
wankunde commented on PR #38560: URL: https://github.com/apache/spark/pull/38560#issuecomment-1356030075 Hi, @yabola @mridulm , I will update SPARK-40480 this weekend. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wankunde commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics
wankunde commented on PR #38496: URL: https://github.com/apache/spark/pull/38496#issuecomment-1356027243 Hi, @jackylee-ch @melin any update ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #39082: [SPARK-41539][SQL] Remap stats and constraints against output in logical plan for LogicalRDD
HeartSaVioR commented on code in PR #39082: URL: https://github.com/apache/spark/pull/39082#discussion_r1051306854 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala: ## @@ -183,16 +172,80 @@ object LogicalRDD { } } +val logicalPlan = originDataset.logicalPlan val optimizedPlan = originDataset.queryExecution.optimizedPlan val executedPlan = originDataset.queryExecution.executedPlan +val (stats, constraints) = rewriteStatsAndConstraints(logicalPlan, optimizedPlan) + LogicalRDD( originDataset.logicalPlan.output, rdd, firstLeafPartitioning(executedPlan.outputPartitioning), executedPlan.outputOrdering, isStreaming -)(originDataset.sparkSession, Some(optimizedPlan.stats), Some(optimizedPlan.constraints)) +)(originDataset.sparkSession, stats, constraints) + } + + private[sql] def buildOutputAssocForRewrite( + source: Seq[Attribute], + destination: Seq[Attribute]): Option[Map[Attribute, Attribute]] = { +// We check the name and type, allowing nullability, exprId, metadata, qualifier be different +// E.g. This could happen during optimization phase. +val rewrite = source.zip(destination).flatMap { case (attr1, attr2) => + if (attr1.name == attr2.name && attr1.dataType == attr2.dataType) { +Some(attr1 -> attr2) + } else { +None + } +}.toMap + +if (rewrite.size == source.size) { + Some(rewrite) +} else { + None +} + } + + private[sql] def rewriteStatsAndConstraints( + logicalPlan: LogicalPlan, + optimizedPlan: LogicalPlan): (Option[Statistics], Option[ExpressionSet]) = { +val rewrite = buildOutputAssocForRewrite(optimizedPlan.output, logicalPlan.output) + +rewrite.map { rw => + val rewrittenStatistics = rewriteStatistics(optimizedPlan.stats, rw) + val rewrittenConstraints = rewriteConstraints(optimizedPlan.constraints, rw) + + (Some(rewrittenStatistics), Some(rewrittenConstraints)) +}.getOrElse { + // can't rewrite stats and constraints, give up + logWarning("The output columns are expected to the same (for name and type) for output " + Review Comment: Yeah my understanding is that the final output after optimization should be semantically same. Looks like exprId can change, but I'm not sure what else we allow to be changed. I'm also just speculating. My feeling is that changes on type or nullability could also break DSv1 sink, if the change is not done with "compatible way". e.g. table in sink has a column nullability set to "false", and optimizer somehow (shouldn't happen but) changes nullability of a column from "false" to "true" (again, shouldn't happen). Same applies to data type as well. I feel column order also matters, but not sure we pretend that the consumer should access column by name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #39082: [SPARK-41539][SQL] Remap stats and constraints against output in logical plan for LogicalRDD
HeartSaVioR commented on code in PR #39082: URL: https://github.com/apache/spark/pull/39082#discussion_r1051306854 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala: ## @@ -183,16 +172,80 @@ object LogicalRDD { } } +val logicalPlan = originDataset.logicalPlan val optimizedPlan = originDataset.queryExecution.optimizedPlan val executedPlan = originDataset.queryExecution.executedPlan +val (stats, constraints) = rewriteStatsAndConstraints(logicalPlan, optimizedPlan) + LogicalRDD( originDataset.logicalPlan.output, rdd, firstLeafPartitioning(executedPlan.outputPartitioning), executedPlan.outputOrdering, isStreaming -)(originDataset.sparkSession, Some(optimizedPlan.stats), Some(optimizedPlan.constraints)) +)(originDataset.sparkSession, stats, constraints) + } + + private[sql] def buildOutputAssocForRewrite( + source: Seq[Attribute], + destination: Seq[Attribute]): Option[Map[Attribute, Attribute]] = { +// We check the name and type, allowing nullability, exprId, metadata, qualifier be different +// E.g. This could happen during optimization phase. +val rewrite = source.zip(destination).flatMap { case (attr1, attr2) => + if (attr1.name == attr2.name && attr1.dataType == attr2.dataType) { +Some(attr1 -> attr2) + } else { +None + } +}.toMap + +if (rewrite.size == source.size) { + Some(rewrite) +} else { + None +} + } + + private[sql] def rewriteStatsAndConstraints( + logicalPlan: LogicalPlan, + optimizedPlan: LogicalPlan): (Option[Statistics], Option[ExpressionSet]) = { +val rewrite = buildOutputAssocForRewrite(optimizedPlan.output, logicalPlan.output) + +rewrite.map { rw => + val rewrittenStatistics = rewriteStatistics(optimizedPlan.stats, rw) + val rewrittenConstraints = rewriteConstraints(optimizedPlan.constraints, rw) + + (Some(rewrittenStatistics), Some(rewrittenConstraints)) +}.getOrElse { + // can't rewrite stats and constraints, give up + logWarning("The output columns are expected to the same (for name and type) for output " + Review Comment: Yeah my understanding is that the final output after optimization should be semantically same. Looks like exprId can change, but I'm not sure what else we allow to be changed. My feeling is that changes on type or nullability could also break DSv1 sink, if the change is not done with "compatible way". e.g. table in sink has a column nullability set to "false", and optimizer somehow (shouldn't happen but) changes nullability of a column from "false" to "true" (again, shouldn't happen). Same applies to data type as well. I feel column order also matters, but not sure we pretend that the consumer should access column by name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dengziming commented on a diff in pull request #38984: [SPARK-41349][CONNECT][PYTHON] Implement DataFrame.hint
dengziming commented on code in PR #38984: URL: https://github.com/apache/spark/pull/38984#discussion_r1051305047 ## python/pyspark/sql/connect/dataframe.py: ## @@ -875,6 +875,30 @@ def to_jcols( melt = unpivot +def hint(self, name: str, *params: Any) -> "DataFrame": +""" +Specifies some hint on the current DataFrame. As an example, the following code specifies +that one of the plan can be broadcasted: `df1.join(df2.hint("broadcast"))` + +.. versionadded:: 3.4.0 + +Parameters +-- +name: str +the name of the hint, for example, "broadcast", "SHUFFLE_MERGE" and "shuffle_hash". +params: tuple +the parameters of the hint Review Comment: Yeah, It's worth improving the docs of the parameters. ## python/pyspark/sql/connect/plan.py: ## @@ -343,6 +343,51 @@ def _repr_html_(self) -> str: """ +class Hint(LogicalPlan): +"""Logical plan object for a Hint operation.""" + +def __init__(self, child: Optional["LogicalPlan"], name: str, params: List[Any]) -> None: +super().__init__(child) +self.name = name +self.params = params + +def _convert_value(self, v: Any) -> proto.Expression.Literal: +value = proto.Expression.Literal() +if v is None: +value.null = True +elif isinstance(v, int): +value.integer = v +else: +value.string = v +return value Review Comment: I improved the error handing logic here, and there are 4 occurrence of this similar logic, I'm planing to refactor it to reuse the existing code. ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -829,6 +829,13 @@ def test_with_columns(self): .toPandas(), ) +def test_hint(self): +# SPARK-41349: Test hint Review Comment: Good catch, I added 4 more test cases for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dengziming commented on pull request #38984: [SPARK-41349][CONNECT][PYTHON] Implement DataFrame.hint
dengziming commented on PR #38984: URL: https://github.com/apache/spark/pull/38984#issuecomment-1356014591 @grundprinzip Thank you for your reviews, I have resolved these comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39093: [SPARK-41420][UI] Protobuf serializer for ApplicationInfoWrapper
gengliangwang commented on code in PR #39093: URL: https://github.com/apache/spark/pull/39093#discussion_r1051304099 ## core/src/test/scala/org/apache/spark/status/protobuf/KVStoreProtobufSerializerSuite.scala: ## @@ -176,4 +176,59 @@ class KVStoreProtobufSerializerSuite extends SparkFunSuite { assert(result.stageId == input.stageId) assert(result.stageAttemptId == input.stageAttemptId) } + + test("Application Info") { +val attempts: Seq[ApplicationAttemptInfo] = Seq( + ApplicationAttemptInfo( +attemptId = Some("001"), +startTime = new Date(1), +endTime = new Date(10), +lastUpdated = new Date(10), Review Comment: Let's have different values in the test case to avoid mistakes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
beliefer commented on PR #39091: URL: https://github.com/apache/spark/pull/39091#issuecomment-1356003824 > @beliefer thanks for working on this. I have one question how are we going to get the observed metrics to the client? This seems to be missing from the implementation. One of the approaches would be to send it in a similar way as the metrics in the result code path. Good question. The result of datasets could passed by grpc server. But the ObservationListener runs on server, it seems we need another way to get. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #39093: [SPARK-41420][UI] Protobuf serializer for ApplicationInfoWrapper
mridulm commented on code in PR #39093: URL: https://github.com/apache/spark/pull/39093#discussion_r1051302962 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -106,3 +106,28 @@ message TaskDataWrapper { int64 stage_id = 40; int32 stage_attempt_id = 41; } + +message ApplicationAttemptInfo { + optional string attempt_id = 1; + int64 start_time = 2; + int64 end_time = 3; + int64 last_updated = 4; + int64 duration = 5; + string spark_user = 6; + bool completed = 7; + string app_spark_version = 8; +} + +message ApplicationInfo { + string id = 1; + string name = 2; + optional int32 cores_granted = 3; + optional int32 max_cores = 4; + optional int32 cores_per_executor = 5; + optional int32 memory_per_executor_mb = 6; + repeated ApplicationAttemptInfo attempts = 7; +} + +message ApplicationInfoWrapper { + ApplicationInfo info = 1; Review Comment: My bad, ended up looking at the wrong file :) Thx for clarifying ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
beliefer commented on code in PR #39091: URL: https://github.com/apache/spark/pull/39091#discussion_r1051302794 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -338,6 +340,22 @@ class SparkConnectPlanner(session: SparkSession) { } } + private def transformCollectMetrics(rel: proto.CollectMetrics): LogicalPlan = { +val metrics = rel.getMetricsList.asScala.map { expr => + Column(transformExpression(expr)) +} + +if (rel.getIsObservation) { Review Comment: The Observation registers `ObservationListener` on `ExecutionListenerManager`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
beliefer commented on code in PR #39091: URL: https://github.com/apache/spark/pull/39091#discussion_r1051302541 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -126,6 +126,20 @@ package object dsl { Expression.UnresolvedFunction.newBuilder().setFunctionName("min").addArguments(e)) .build() +def proto_max(e: Expression): Expression = Review Comment: I just follows up the existing `proto_min`. ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -126,6 +126,20 @@ package object dsl { Expression.UnresolvedFunction.newBuilder().setFunctionName("min").addArguments(e)) .build() +def proto_max(e: Expression): Expression = Review Comment: I just follow up the existing `proto_min`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #39084: [SPARK-41464][CONNECT][PYTHON] Implement `DataFrame.to`
beliefer commented on code in PR #39084: URL: https://github.com/apache/spark/pull/39084#discussion_r1051301969 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1631,6 +1632,47 @@ def inputFiles(self) -> List[str]: query = self._plan.to_proto(self._session.client) return self._session.client._analyze(query).input_files +def to(self, schema: Union[DataType, str]) -> "DataFrame": Review Comment: I have created https://github.com/apache/spark/pull/39103 let `pyspark_types_to_proto_types` support `StructType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer opened a new pull request, #39103: [SPARK-41546][CONNECT][PYTHON] `pyspark_types_to_proto_types` should support StructType.
beliefer opened a new pull request, #39103: URL: https://github.com/apache/spark/pull/39103 ### What changes were proposed in this pull request? Currently, `pyspark_types_to_proto_types` used to transform pyspark datatypes to protobuffer datatypes. But it only supports the primary data type, no struct type. Many connect API need to transform pyspark struct type to protobuffer struct type. For example, `createDataFrame`, `DataFrame.to` and so on. ### Why are the changes needed? This PR let `pyspark_types_to_proto_types` support `StructType`. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yabola commented on pull request #39087: [SPARK-41365][UI][3.3] Stages UI page fails to load for proxy in specific yarn environment
yabola commented on PR #39087: URL: https://github.com/apache/spark/pull/39087#issuecomment-1355977687 @dongjoon-hyun Thank you very much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #38922: [SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks
rangadi commented on code in PR #38922: URL: https://github.com/apache/spark/pull/38922#discussion_r1051293916 ## connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala: ## @@ -693,4 +693,435 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot errorClass = "CANNOT_CONSTRUCT_PROTOBUF_DESCRIPTOR", parameters = Map("descFilePath" -> testFileDescriptor)) } + + test("Verify OneOf field between from_protobuf -> to_protobuf and struct -> from_protobuf") { +val descriptor = ProtobufUtils.buildDescriptor(testFileDesc, "OneOfEvent") +val oneOfEvent = OneOfEvent.newBuilder() + .setKey("key") + .setCol1(123) + .setCol3(109202L) + .setCol2("col2value") + .addCol4("col4value").build() + +val df = Seq(oneOfEvent.toByteArray).toDF("value") + +checkWithFileAndClassName("OneOfEvent") { + case (name, descFilePathOpt) => +val fromProtoDf = df.select( + from_protobuf_wrapper($"value", name, descFilePathOpt) as 'sample) +val toDf = fromProtoDf.select( + to_protobuf_wrapper($"sample", name, descFilePathOpt) as 'toProto) +val toFromDf = toDf.select( + from_protobuf_wrapper($"toProto", name, descFilePathOpt) as 'fromToProto) +checkAnswer(fromProtoDf, toFromDf) +val actualFieldNames = fromProtoDf.select("sample.*").schema.fields.toSeq.map(f => f.name) +descriptor.getFields.asScala.map(f => { + assert(actualFieldNames.contains(f.getName)) +}) + +val eventFromSpark = OneOfEvent.parseFrom( + toDf.select("toProto").take(1).toSeq(0).getAs[Array[Byte]](0)) +// OneOf field: the last set value(by order) will overwrite all previous ones. +assert(eventFromSpark.getCol2.equals("col2value")) +assert(eventFromSpark.getCol3 == 0) +val expectedFields = descriptor.getFields.asScala.map(f => f.getName) +eventFromSpark.getDescriptorForType.getFields.asScala.map(f => { + assert(expectedFields.contains(f.getName)) +}) + +val jsonSchema = + s""" + |{ + | "type" : "struct", + | "fields" : [ { + |"name" : "sample", + |"type" : { + | "type" : "struct", + | "fields" : [ { + |"name" : "key", + |"type" : "string", + |"nullable" : true + | }, { + |"name" : "col_1", + |"type" : "integer", + |"nullable" : true + | }, { + |"name" : "col_2", + |"type" : "string", + |"nullable" : true + | }, { + |"name" : "col_3", + |"type" : "long", + |"nullable" : true + | }, { + |"name" : "col_4", + |"type" : { + | "type" : "array", + | "elementType" : "string", + | "containsNull" : false + |}, + |"nullable" : false + | } ] + |}, + |"nullable" : true + | } ] + |} + |{ + | "type" : "struct", + | "fields" : [ { + |"name" : "sample", + |"type" : { + | "type" : "struct", + | "fields" : [ { + |"name" : "key", + |"type" : "string", + |"nullable" : true + | }, { + |"name" : "col_1", + |"type" : "integer", + |"nullable" : true + | }, { + |"name" : "col_2", + |"type" : "string", + |"nullable" : true + | }, { + |"name" : "col_3", + |"type" : "long", + |"nullable" : true + | }, { + |"name" : "col_4", + |"type" : { + | "type" : "array", + | "elementType" : "string", + | "containsNull" : false + |}, + |"nullable" : false + | } ] + |}, + |"nullable" : true + | } ] + |} + |""".stripMargin +val schema = DataType.fromJson(jsonSchema).asInstanceOf[StructType] +val data = Seq(Row(Row("key", 123, "col2value", 109202L, Seq("col4value" +val dataDf = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) +val dataDfToProto = dataDf.select( + to_protobuf_wrapper
[GitHub] [spark] rangadi commented on a diff in pull request #38922: [SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks
rangadi commented on code in PR #38922: URL: https://github.com/apache/spark/pull/38922#discussion_r1051292604 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala: ## @@ -38,6 +38,14 @@ private[sql] class ProtobufOptions( val parseMode: ParseMode = parameters.get("mode").map(ParseMode.fromString).getOrElse(FailFastMode) + + // Setting the `recursive.fields.max.depth` to 0 allows the field to be recurse once, Review Comment: '0' disables recursion right? Why once? This might be difference in terminology. Thats why giving a quick example is better. Could you add this example?: Consider a simple simple recursive proto 'message Person { string name = 1; Person bff = 2} > What would be spark schema when recursion 0, 1, and 2? I think : - 0: struct - 1: struct> - 2: struct>> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #38922: [SPARK-41396][SQL][PROTOBUF] OneOf field support and recursion checks
rangadi commented on code in PR #38922: URL: https://github.com/apache/spark/pull/38922#discussion_r1051292604 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala: ## @@ -38,6 +38,14 @@ private[sql] class ProtobufOptions( val parseMode: ParseMode = parameters.get("mode").map(ParseMode.fromString).getOrElse(FailFastMode) + + // Setting the `recursive.fields.max.depth` to 0 allows the field to be recurse once, Review Comment: '0' disables recursion right? Why once? This might be difference in terminology. Thats why giving a quick example is better. Could you add this example?: Consider a simple simple recursive proto 'message Person { string name = 1; Person bff = 2} > What would be spark schema when recursion 0, 1, and 2? I think : - 0: struct - 1: struct> - 2: struct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper
gengliangwang closed pull request #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper URL: https://github.com/apache/spark/pull/39096 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper
gengliangwang commented on PR #39096: URL: https://github.com/apache/spark/pull/39096#issuecomment-1355950671 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] monkeyboy123 opened a new pull request, #39102: [SPARK-41555][SQL] Multi sparkSession should share single SQLAppStatusStore
monkeyboy123 opened a new pull request, #39102: URL: https://github.com/apache/spark/pull/39102 ### What changes were proposed in this pull request? It is a bug. ### Why are the changes needed? it is a waste of memory, that causes gc frequently. ### Does this PR introduce _any_ user-facing change? before this pr: it cause multi-SQL Tab in UI, besides, it exsits muti-SQLAppStatusListener in spark, that will cause waste of memory. code like jira:[SPARK-41555](https://issues.apache.org/jira/browse/SPARK-41555) said. ### How was this patch tested? exists UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] monkeyboy123 closed pull request #39101: [SPARK-41555][SQL] Multi sparkSession should share single SQLAppStatusStore
monkeyboy123 closed pull request #39101: [SPARK-41555][SQL] Multi sparkSession should share single SQLAppStatusStore URL: https://github.com/apache/spark/pull/39101 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #39078: [SPARK-41534][CONNECT][SQL] Setup initial client module for Spark Connect
zhenlineo commented on code in PR #39078: URL: https://github.com/apache/spark/pull/39078#discussion_r1051273721 ## project/SparkBuild.scala: ## @@ -1433,7 +1492,10 @@ object CopyDependencies { if (jar.getName.contains("spark-connect") && !SbtPomKeys.profiles.value.contains("noshade-connect")) { Files.copy(fid.toPath, destJar.toPath) - } else if (jar.getName.contains("spark-protobuf") && + } else if (jar.getName.contains("connect-client") && +!SbtPomKeys.profiles.value.contains("noshade-protobuf")) { Review Comment: Just double check if you really meant to use `spark-protobuf` or it is a copy-past typo? The values to these three jars are a bit magical. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] monkeyboy123 opened a new pull request, #39101: [SPARK-41555][SQL] Multi sparkSession should share single SQLAppStatusStore
monkeyboy123 opened a new pull request, #39101: URL: https://github.com/apache/spark/pull/39101 ### What changes were proposed in this pull request? It is a bug. ### Why are the changes needed? it is a waste of memory, that cause gc frequently. ### Does this PR introduce _any_ user-facing change? before this pr: it cause multi-SQL Tab in UI, besides, it exsits muti-SQLAppStatusListener in spark, that will cause waste of memory. code like jira:[SPARK-41555](https://issues.apache.org/jira/browse/SPARK-41555) said. ### How was this patch tested? exists UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38357: [SPARK-40887][K8S] Allow Spark on K8s to integrate w/ Log Service
dongjoon-hyun commented on PR #38357: URL: https://github.com/apache/spark/pull/38357#issuecomment-1355886073 Hi, @pan3793 . I'm interested in this PR for Apache Spark 3.4.0. To do that we need to split this PR in order to discuss and test more. I proposed you the followings first. Please let me know if you can revise. - Spin off pod env variable contributions (`SPARK_DRIVER_POD_NAME`). - Reduce the code change (e.g. Avoiding a whole new class like `KubernetesCoarseGrainedExecutorBackend`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38357: [SPARK-40887][K8S] Allow Spark on K8s to integrate w/ Log Service
dongjoon-hyun commented on code in PR #38357: URL: https://github.com/apache/spark/pull/38357#discussion_r1051268736 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala: ## @@ -65,6 +65,7 @@ private[spark] object Constants { val ENV_JAVA_OPT_PREFIX = "SPARK_JAVA_OPT_" val ENV_CLASSPATH = "SPARK_CLASSPATH" val ENV_DRIVER_BIND_ADDRESS = "SPARK_DRIVER_BIND_ADDRESS" + val ENV_DRIVER_POD_NAME = "SPARK_DRIVER_POD_NAME" Review Comment: We had better put ths line should be before `ENV_DRIVER_BIND_ADDRESS`. BTW, Could you spin off `KUBERNETES_POD_NAME` and `SPARK_DRIVER_POD_NAME ` addition to a new PR, @pan3793 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38357: [SPARK-40887][K8S] Allow Spark on K8s to integrate w/ Log Service
dongjoon-hyun commented on code in PR #38357: URL: https://github.com/apache/spark/pull/38357#discussion_r1051268736 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala: ## @@ -65,6 +65,7 @@ private[spark] object Constants { val ENV_JAVA_OPT_PREFIX = "SPARK_JAVA_OPT_" val ENV_CLASSPATH = "SPARK_CLASSPATH" val ENV_DRIVER_BIND_ADDRESS = "SPARK_DRIVER_BIND_ADDRESS" + val ENV_DRIVER_POD_NAME = "SPARK_DRIVER_POD_NAME" Review Comment: We had better put ths line should be before `ENV_DRIVER_BIND_ADDRESS`. Could you spin off this `ENV_DRIVER_BIND_ADDRESS` addition to a new PR, @pan3793 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38357: [SPARK-40887][K8S] Allow Spark on K8s to integrate w/ Log Service
dongjoon-hyun commented on code in PR #38357: URL: https://github.com/apache/spark/pull/38357#discussion_r1051268075 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala: ## @@ -49,7 +49,7 @@ private[spark] object KubernetesExecutorBackend extends Logging { def main(args: Array[String]): Unit = { val createFn: (RpcEnv, Arguments, SparkEnv, ResourceProfile, String) => CoarseGrainedExecutorBackend = { case (rpcEnv, arguments, env, resourceProfile, execId) => -new CoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, execId, +new KubernetesCoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, execId, arguments.bindAddress, arguments.hostname, arguments.cores, env, arguments.resourcesFileOpt, resourceProfile) Review Comment: Instead of creating a new class `KubernetesCoarseGrainedExecutorBackend`, can we do the following simply? ```scala new CoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, execId, arguments.bindAddress, arguments.hostname, arguments.cores, -env, arguments.resourcesFileOpt, resourceProfile) +env, arguments.resourcesFileOpt, resourceProfile) { + override def getDriverAttributes: Option[Map[String, String]] = Some( +super.getDriverAttributes.getOrElse(Map.empty) ++ Map( + "APP_ID" -> System.getenv(ENV_APPLICATION_ID), + "KUBERNETES_NAMESPACE" -> conf.get(KUBERNETES_NAMESPACE), + "KUBERNETES_POD_NAME" -> System.getenv(ENV_DRIVER_POD_NAME))) +} ``` Then, we can remove the following file from this PR. - resource-managers/kubernetes/core/src/main/scala/org/apache/spark/executor/KubernetesCoarseGrainedExecutorBackend.scala -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38357: [SPARK-40887][K8S] Allow Spark on K8s to integrate w/ Log Service
dongjoon-hyun commented on code in PR #38357: URL: https://github.com/apache/spark/pull/38357#discussion_r1051268075 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala: ## @@ -49,7 +49,7 @@ private[spark] object KubernetesExecutorBackend extends Logging { def main(args: Array[String]): Unit = { val createFn: (RpcEnv, Arguments, SparkEnv, ResourceProfile, String) => CoarseGrainedExecutorBackend = { case (rpcEnv, arguments, env, resourceProfile, execId) => -new CoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, execId, +new KubernetesCoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, execId, arguments.bindAddress, arguments.hostname, arguments.cores, env, arguments.resourcesFileOpt, resourceProfile) Review Comment: Instead of creating a new class `KubernetesCoarseGrainedExecutorBackend`, can we do the following simply? ```scala new CoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, execId, arguments.bindAddress, arguments.hostname, arguments.cores, -env, arguments.resourcesFileOpt, resourceProfile) +env, arguments.resourcesFileOpt, resourceProfile) { + override def getDriverAttributes: Option[Map[String, String]] = Some( +super.getDriverAttributes.getOrElse(Map.empty) ++ Map( + "APP_ID" -> System.getenv(ENV_APPLICATION_ID), + "KUBERNETES_NAMESPACE" -> conf.get(KUBERNETES_NAMESPACE), + "KUBERNETES_POD_NAME" -> System.getenv(ENV_DRIVER_POD_NAME))) +} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39100: [SPARK-41422][UI] Protobuf serializer for ExecutorStageSummary
gengliangwang commented on code in PR #39100: URL: https://github.com/apache/spark/pull/39100#discussion_r1051265349 ## core/src/main/scala/org/apache/spark/status/protobuf/ExecutorMetricsSerializer.scala: ## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.status.protobuf + +import org.apache.spark.executor.ExecutorMetrics +import org.apache.spark.metrics.ExecutorMetricType + +object ExecutorMetricsSerializer { + def serialize(e: ExecutorMetrics): StoreTypes.ExecutorMetrics = { Review Comment: This follows ExecutorMetricsJsonSerializer ## core/src/main/scala/org/apache/spark/status/protobuf/ExecutorMetricsSerializer.scala: ## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.status.protobuf + +import org.apache.spark.executor.ExecutorMetrics +import org.apache.spark.metrics.ExecutorMetricType + +object ExecutorMetricsSerializer { + def serialize(e: ExecutorMetrics): StoreTypes.ExecutorMetrics = { +val builder = StoreTypes.ExecutorMetrics.newBuilder() +ExecutorMetricType.metricToOffset.foreach { case (metric, _) => + builder.putMetrics(metric, e.getMetricValue(metric)) + metric -> e.getMetricValue(metric) +} +builder.build() + } + + def deserialize(binary: StoreTypes.ExecutorMetrics): ExecutorMetrics = { Review Comment: This follows ExecutorMetricsJsonDeserializer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38357: [SPARK-40887][K8S] Allow Spark on K8s to integrate w/ Log Service
dongjoon-hyun commented on code in PR #38357: URL: https://github.com/apache/spark/pull/38357#discussion_r1051265233 ## core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: ## @@ -74,14 +76,24 @@ private[spark] trait SchedulerBackend { * Executors tab for the driver. * @return Map containing the log names and their respective URLs */ - def getDriverLogUrls: Option[Map[String, String]] = None + def getDriverLogUrls: Option[Map[String, String]] = { +val prefix = "SPARK_DRIVER_LOG_URL_" +val driverLogUrls = sys.env.filterKeys(_.startsWith(prefix)) + .map(e => (e._1.substring(prefix.length).toLowerCase(Locale.ROOT), e._2)).toMap +if (driverLogUrls.nonEmpty) Some(driverLogUrls) else None + } /** * Get the attributes on driver. These attributes are used to replace log URLs when * custom log url pattern is specified. * @return Map containing attributes on driver. */ - def getDriverAttributes: Option[Map[String, String]] = None + def getDriverAttributes: Option[Map[String, String]] = { +val prefix = "SPARK_DRIVER_ATTRIBUTE_" +val driverAttributes = sys.env.filterKeys(_.startsWith(prefix)) + .map(e => (e._1.substring(prefix.length).toUpperCase(Locale.ROOT), e._2)).toMap +if (driverAttributes.nonEmpty) Some(driverAttributes) else None Review Comment: This looks like logically duplicated in both `getDriverLogUrls` and `getDriverAttributes` except the variable names. Could you try to refactor more? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39100: [SPARK-41422][UI] Protobuf serializer for ExecutorStageSummary
gengliangwang commented on PR #39100: URL: https://github.com/apache/spark/pull/39100#issuecomment-1355874010 cc @techaddict -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #39100: [SPARK-41422][UI] Protobuf serializer for ExecutorStageSummary
gengliangwang opened a new pull request, #39100: URL: https://github.com/apache/spark/pull/39100 ### What changes were proposed in this pull request? Add Protobuf serializer for ExecutorStageSummary ### Why are the changes needed? Support fast and compact serialization/deserialization for ExecutorStageSummary over RocksDB. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor
github-actions[bot] closed pull request #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor URL: https://github.com/apache/spark/pull/37411 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37790: [SPARK-40288][SQL] After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use
github-actions[bot] closed pull request #37790: [SPARK-40288][SQL] After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression URL: https://github.com/apache/spark/pull/37790 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #37532: [SPARK-39989][SQL][FollowUp] Improve foldable expression stats estimate for string and binary
github-actions[bot] commented on PR #37532: URL: https://github.com/apache/spark/pull/37532#issuecomment-1355860756 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37793: add sparksql wirte mysql support update ,the design from replace into…
github-actions[bot] closed pull request #37793: add sparksql wirte mysql support update ,the design from replace into… URL: https://github.com/apache/spark/pull/37793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39078: [SPARK-41534][CONNECT][SQL] Setup initial client module for Spark Connect
dongjoon-hyun commented on code in PR #39078: URL: https://github.com/apache/spark/pull/39078#discussion_r1051253194 ## connector/connect/client/src/main/scala/org/apache/spark/sql/connect/client/SparkSession.scala: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connect.client + +import scala.language.existentials + +import io.grpc.{ManagedChannel, ManagedChannelBuilder} +import org.apache.arrow.memory.RootAllocator + +import org.apache.spark.SPARK_VERSION +import org.apache.spark.connect.proto + + +class SparkSession( +private val userContext: proto.UserContext, +private val channel: ManagedChannel) + extends AutoCloseable { + private[this] val stub = proto.SparkConnectServiceGrpc.newBlockingStub(channel) + + private[this] val allocator = new RootAllocator() + + /** + * The version of Spark on which this application is running. + */ + def version: String = SPARK_VERSION + + /** + * Returns a `DataFrame` with no rows or columns. + * + * @since 3.4.0 + */ + @transient + lazy val emptyDataFrame: Dataset = newDataset { builder => +builder.getLocalRelationBuilder + } + + /** + * Creates a [[Dataset]] with a single `LongType` column named `id`, containing elements + * in a range from `start` to `end` (exclusive) with a step value, with partition number + * specified. + * + * @since 2.0.0 + */ + def range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset = { +range(start, end, step, Option(numPartitions)) + } + + private def range(start: Long, end: Long, step: Long, numPartitions: Option[Int]): Dataset = { +newDataset { builder => + val rangeBuilder = builder.getRangeBuilder +.setStart(start) +.setEnd(end) +.setStep(step) + numPartitions.foreach(rangeBuilder.setNumPartitions) +} + } + + /** + * Executes a SQL query using Spark, returning the result as a `DataFrame`. + * This API eagerly runs DDL/DML commands, but not for SELECT queries. + * + * @since 2.0.0 + */ + def sql(query: String): Dataset = newDataset { builder => +builder.setSql(proto.SQL.newBuilder().setQuery(query)) + } + + private[client] def newDataset(f: proto.Relation.Builder => Unit): Dataset = { +val builder = proto.Relation.newBuilder() +f(builder) +val plan = proto.Plan.newBuilder().setRoot(builder).build() +new Dataset(this, plan) + } + + private[client] def analyze(plan: proto.Plan): proto.AnalyzePlanResponse = { +val request = proto.AnalyzePlanRequest.newBuilder() + .setPlan(plan) + .setUserContext(userContext) + .build() +stub.analyzePlan(request) + } + + override def close(): Unit = { +channel.shutdownNow() +allocator.close() + } +} + +object SparkSession { + def builder(): Builder = new Builder() + + class Builder() { +private val userContextBuilder = proto.UserContext.newBuilder() +private var _host: String = "localhost" +private var _port: Int = 15002 Review Comment: Please avoid a magic number. If we cannot use the existing `CONNECT_GRPC_BINDING_PORT`, we may need to move the config definition from `server` to `common` first. https://github.com/apache/spark/blob/f3be74e3a421224bb65a01e658de96006de2a4ac/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala#L24-L28 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39078: [SPARK-41534][CONNECT][SQL] Setup initial client module for Spark Connect
dongjoon-hyun commented on code in PR #39078: URL: https://github.com/apache/spark/pull/39078#discussion_r1051252600 ## project/SparkBuild.scala: ## @@ -1433,7 +1492,10 @@ object CopyDependencies { if (jar.getName.contains("spark-connect") && !SbtPomKeys.profiles.value.contains("noshade-connect")) { Files.copy(fid.toPath, destJar.toPath) - } else if (jar.getName.contains("spark-protobuf") && + } else if (jar.getName.contains("connect-client") && +!SbtPomKeys.profiles.value.contains("noshade-protobuf")) { +Files.copy(fidClient.toPath, destJar.toPath) + } else if (jar.getName.contains("spark-protobuf") && Review Comment: Please remove an extra space, `} else` -> `} else`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39078: [SPARK-41534][CONNECT][SQL] Setup initial client module for Spark Connect
dongjoon-hyun commented on code in PR #39078: URL: https://github.com/apache/spark/pull/39078#discussion_r1051251797 ## connector/connect/client/pom.xml: ## @@ -0,0 +1,112 @@ + + + +http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> +4.0.0 + +org.apache.spark +spark-parent_2.12 +3.4.0-SNAPSHOT +../../../pom.xml + + +spark-connect-client_2.12 +jar +Spark Project Connect Client +https://spark.apache.org/ + +connect-client +31.0.1-jre +1.0.1 +1.47.0 +6.0.53 + Review Comment: This seems to be copied from `common` module. Could you try to avoid this kind of duplication? https://github.com/apache/spark/blob/f3be74e3a421224bb65a01e658de96006de2a4ac/connector/connect/common/pom.xml#L33-L39 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anchovYu commented on a diff in pull request #39040: [SPARK-27561][SQL][FOLLOWUP] Support implicit lateral column alias resolution on Aggregate
anchovYu commented on code in PR #39040: URL: https://github.com/apache/spark/pull/39040#discussion_r1051251048 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -129,6 +163,45 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { child = Project(innerProjectList.toSeq, child) ) } + +case agg @ Aggregate(groupingExpressions, aggregateExpressions, _) if agg.resolved +&& aggregateExpressions.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) => + + val newAggExprs = collection.mutable.Set.empty[NamedExpression] + val expressionMap = collection.mutable.LinkedHashMap.empty[Expression, NamedExpression] + val projectExprs = aggregateExpressions.map { exp => +exp.transformDown { + case aggExpr: AggregateExpression => +// Doesn't support referencing a lateral alias in aggregate function +if (aggExpr.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) { + aggExpr.collectFirst { +case lcaRef: LateralColumnAliasReference => + throw QueryCompilationErrors.lateralColumnAliasInAggFuncUnsupportedError( +lcaRef.nameParts, aggExpr) + } +} +val ne = expressionMap.getOrElseUpdate(aggExpr.canonicalized, assignAlias(aggExpr)) +newAggExprs += ne +ne.toAttribute + case e if groupingExpressions.exists(_.semanticEquals(e)) => +// TODO one concern here, is condition here be able to match all grouping Review Comment: I surprisingly found out that this existing query can't analyze: ``` select 1 + dept + 10 from $testTable group by dept + 10 -- error: [MISSING_AGGREGATION] The non-aggregating expression "dept" is based on columns which are not participating in the GROUP BY clause ``` Seems in our checkAnalysis, we don't canonicalize to compare the expressions. It is structured as (1 + dept) + 10, and can't match the grouping expression (dept + 10). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39078: [SPARK-41534][CONNECT][SQL] Setup initial client module for Spark Connect
dongjoon-hyun commented on code in PR #39078: URL: https://github.com/apache/spark/pull/39078#discussion_r1051250909 ## connector/connect/client/pom.xml: ## @@ -0,0 +1,112 @@ + + + +http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> +4.0.0 Review Comment: If you don't mind, please use 2-space indentation in `pom.xml` like the parent pom file. https://github.com/apache/spark/blob/f3be74e3a421224bb65a01e658de96006de2a4ac/pom.xml#L19-L27 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on a diff in pull request #39093: [SPARK-41420][UI] Protobuf serializer for ApplicationInfoWrapper
techaddict commented on code in PR #39093: URL: https://github.com/apache/spark/pull/39093#discussion_r1051244979 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -106,3 +106,28 @@ message TaskDataWrapper { int64 stage_id = 40; int32 stage_attempt_id = 41; } + +message ApplicationAttemptInfo { + optional string attempt_id = 1; + int64 start_time = 2; + int64 end_time = 3; + int64 last_updated = 4; + int64 duration = 5; + string spark_user = 6; + bool completed = 7; + string app_spark_version = 8; +} + +message ApplicationInfo { + string id = 1; + string name = 2; + optional int32 cores_granted = 3; + optional int32 max_cores = 4; + optional int32 cores_per_executor = 5; + optional int32 memory_per_executor_mb = 6; + repeated ApplicationAttemptInfo attempts = 7; +} + +message ApplicationInfoWrapper { + ApplicationInfo info = 1; Review Comment: @mridulm there are 2 `ApplicationInfoWrapper` one in `org.apache.spark.status`(this serializer is for that), and another one in `org.apache.spark.deploy.history` both have different signatures -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39090: [SPARK-41334][CONNECT][PYTHON] Move `SortOrder` proto from relations to expressions
dongjoon-hyun commented on code in PR #39090: URL: https://github.com/apache/spark/pull/39090#discussion_r1051243980 ## python/pyspark/sql/connect/column.py: ## @@ -437,35 +437,38 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression: class SortOrder(Expression): -def __init__(self, child: Expression, ascending: bool = True, nullsLast: bool = False) -> None: +def __init__(self, child: Expression, ascending: bool = True, nullsFirst: bool = True) -> None: super().__init__() self._child = child self._ascending = ascending -self._nullsLast = nullsLast +self._nullsFirst = nullsFirst def __repr__(self) -> str: return ( str(self._child) + (" ASC" if self._ascending else " DESC") -+ (" NULLS LAST" if self._nullsLast else " NULLS FIRST") ++ (" NULLS FIRST" if self._nullsFirst else " NULLS LAST") ) def to_plan(self, session: "SparkConnectClient") -> proto.Expression: -# TODO(SPARK-41334): move SortField from relations.proto to expressions.proto -sort = proto.Sort.SortField() -sort.expression.CopyFrom(self._child.to_plan(session)) +sort = proto.Expression() +sort.sort_order.child.CopyFrom(self._child.to_plan(session)) if self._ascending: -sort.direction = proto.Sort.SortDirection.SORT_DIRECTION_ASCENDING +sort.sort_order.direction = ( + proto.Expression.SortOrder.SortDirection.SORT_DIRECTION_ASCENDING +) else: -sort.direction = proto.Sort.SortDirection.SORT_DIRECTION_DESCENDING +sort.sort_order.direction = ( + proto.Expression.SortOrder.SortDirection.SORT_DIRECTION_DESCENDING +) -if self._nullsLast: -sort.nulls = proto.Sort.SortNulls.SORT_NULLS_LAST +if self._nullsFirst: +sort.sort_order.null_ordering = proto.Expression.SortOrder.NullOrdering.SORT_NULLS_FIRST else: -sort.nulls = proto.Sort.SortNulls.SORT_NULLS_FIRST +sort.sort_order.null_ordering = proto.Expression.SortOrder.NullOrdering.SORT_NULLS_LAST -return cast(proto.Expression, sort) Review Comment: This seems to cause `unused import` linter failure. Could you fix it? ``` ./python/pyspark/sql/connect/column.py:18:1: F401 'typing.cast' imported but unused from typing import ( F401 'typing.cast' imported but unused ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39076: [SPARK-41530][CORE] Rename MedianHeap to PercentileMap and support percentile
dongjoon-hyun commented on PR #39076: URL: https://github.com/apache/spark/pull/39076#issuecomment-1355840094 Merged to master for Apache Spark 3.4. Thank you, @cloud-fan and @mridulm . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39076: [SPARK-41530][CORE] Rename MedianHeap to PercentileMap and support percentile
dongjoon-hyun closed pull request #39076: [SPARK-41530][CORE] Rename MedianHeap to PercentileMap and support percentile URL: https://github.com/apache/spark/pull/39076 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39054: [SPARK-27561][SQL][FOLLOWUP] Move the two rules for Later column alias into one file
dongjoon-hyun commented on PR #39054: URL: https://github.com/apache/spark/pull/39054#issuecomment-1355838611 Could you make a PR, @anchovYu ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anchovYu commented on pull request #39054: [SPARK-27561][SQL][FOLLOWUP] Move the two rules for Later column alias into one file
anchovYu commented on PR #39054: URL: https://github.com/apache/spark/pull/39054#issuecomment-1355828789 This PR changes to use the resolver from the SimpleAnalyzer, which is always case sensitive one instead of the one that is determined by conf. This will cause lca unable to resolve case insensitive alias. I fix this issue in the new pr https://github.com/apache/spark/pull/39040/commits/136a9308e623fee5b1303103e6397f96d8bb6788 by refactoring the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #39068: [SPARK-41434][CONNECT][PYTHON] Initial `LambdaFunction` implementation
grundprinzip commented on code in PR #39068: URL: https://github.com/apache/spark/pull/39068#discussion_r1051232035 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -565,6 +565,47 @@ class SparkConnectPlanner(session: SparkSession) { val children = fun.getArgumentsList.asScala.toSeq.map(transformExpression) Some(In(children.head, children.tail)) + case "___lambda_function___" => +// UnresolvedFunction[___lambda_function___, ["x, y -> x < y", "x", "y"]] Review Comment: This branch deserves it's own function please. ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -565,6 +565,47 @@ class SparkConnectPlanner(session: SparkSession) { val children = fun.getArgumentsList.asScala.toSeq.map(transformExpression) Some(In(children.head, children.tail)) + case "___lambda_function___" => +// UnresolvedFunction[___lambda_function___, ["x, y -> x < y", "x", "y"]] + +if (fun.getArgumentsCount < 2) { + throw InvalidPlanInput( +"LambdaFunction requires at least 2 child expressions: LamdaFunction, Arguments") +} + +val children = fun.getArgumentsList.asScala.toSeq.map(transformExpression) + +val function = children.head + +val variableNames = children.tail.map { + case variable: UnresolvedAttribute if variable.nameParts.length == 1 => +variable.nameParts.head + case other => +throw InvalidPlanInput( + "LambdaFunction requires all arguments to be UnresolvedAttribute with " + +s"single name part, but got $other") Review Comment: There are two interesting issues here: 1. When someone submits to the API an expression that does not transform into `UnresolvedExpression` this would throw a weird error message about the name parts but actually the type does not match. 2. Why the restriction to single part names? Is this a Spark limitation? ## python/pyspark/sql/connect/column.py: ## @@ -543,6 +543,39 @@ def __repr__(self) -> str: return f"({self._col} ({self._data_type}))" +class LambdaFunction(Expression): +def __init__( +self, +function: Expression, +arguments: Sequence[Expression], +) -> None: +super().__init__() + +assert isinstance(function, Expression) + +assert ( +isinstance(arguments, list) +and len(arguments) > 0 +and all(isinstance(arg, ColumnReference) for arg in arguments) +) Review Comment: Adding these assertions here is helpful in the Python client but the server side does not do the same assertion. What happens if we drop the assertion on `ColumnReference` what would happen on the server? Is the analysis exception not better than the Python assertion> ## python/pyspark/sql/connect/functions.py: ## @@ -80,6 +84,80 @@ def _invoke_binary_math_function(name: str, col1: Any, col2: Any) -> Column: return _invoke_function(name, *_cols) +def _get_lambda_parameters(f: Callable) -> ValuesView[inspect.Parameter]: +signature = inspect.signature(f) +parameters = signature.parameters.values() + +# We should exclude functions that use +# variable args and keyword argnames +# as well as keyword only args +supported_parameter_types = { +inspect.Parameter.POSITIONAL_OR_KEYWORD, +inspect.Parameter.POSITIONAL_ONLY, +} + +# Validate that +# function arity is between 1 and 3 Review Comment: ```suggestion # Validate that the function arity is between 1 and 3. ``` ## python/pyspark/sql/connect/functions.py: ## @@ -80,6 +84,80 @@ def _invoke_binary_math_function(name: str, col1: Any, col2: Any) -> Column: return _invoke_function(name, *_cols) +def _get_lambda_parameters(f: Callable) -> ValuesView[inspect.Parameter]: +signature = inspect.signature(f) +parameters = signature.parameters.values() + +# We should exclude functions that use +# variable args and keyword argnames +# as well as keyword only args +supported_parameter_types = { +inspect.Parameter.POSITIONAL_OR_KEYWORD, +inspect.Parameter.POSITIONAL_ONLY, +} + +# Validate that +# function arity is between 1 and 3 Review Comment: one line? ## python/pyspark/sql/connect/functions.py: ## @@ -80,6 +84,80 @@ def _invoke_binary_math_function(name: str, col1: Any, col2: Any) -> Column: return _invoke_function(name, *_cols) +def _get_lambda_parameters(f: Callable) -> ValuesView[inspect.Parameter]: +signature = inspect.signature(f) +parameters = signature.parameters.values() + +# We should exclude functions that use +# variable args and keyword
[GitHub] [spark] fe2s opened a new pull request, #39099: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…
fe2s opened a new pull request, #39099: URL: https://github.com/apache/spark/pull/39099 …ore than 18 ### What changes were proposed in this pull request? Fix `Decimal` scaling that is stored as compact long internally when scale decreased by more than 18. For example, ``` Decimal(1, 38, 19).changePrecision(38, 0) ``` produces an exception ``` java.lang.ArrayIndexOutOfBoundsException: 19 at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377) at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328) ``` Another way to reproduce it with SQL query ``` sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as decimal(38,37)) as decimal(38,17)) from range(3)").show ``` The bug exists for Decimal that is stored using compact long only, it works fine with Decimal that uses `scala.math.BigDecimal` internally. ### Why are the changes needed? Not able to execute the SQL query mentioned above. Please note, for my use case the SQL query is generated programatically, so I cannot optimize it manually. ### Does this PR introduce _any_ user-facing change? Yes, it will allow scale Decimal properly that is not currently possible due to the exception. ### How was this patch tested? Tests were added. The fix affects the scale decrease only, but I decided to also include tests for scale increase as I didn't find them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anchovYu commented on pull request #39040: [SPARK-27561][SQL][FOLLOWUP] Support implicit lateral column alias resolution on Aggregate
anchovYu commented on PR #39040: URL: https://github.com/apache/spark/pull/39040#issuecomment-1355792680 I will shortly address the left todos in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38984: [SPARK-41349][CONNECT][PYTHON] Implement DataFrame.hint
grundprinzip commented on code in PR #38984: URL: https://github.com/apache/spark/pull/38984#discussion_r1051222085 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -829,6 +829,13 @@ def test_with_columns(self): .toPandas(), ) +def test_hint(self): +# SPARK-41349: Test hint Review Comment: Please add additional tests for: * unsupported param types * unsupported hint name * invalid combination of hint and param Please check if there is additional coverage needed. ## python/pyspark/sql/connect/plan.py: ## @@ -343,6 +343,51 @@ def _repr_html_(self) -> str: """ +class Hint(LogicalPlan): +"""Logical plan object for a Hint operation.""" + +def __init__(self, child: Optional["LogicalPlan"], name: str, params: List[Any]) -> None: +super().__init__(child) +self.name = name +self.params = params + +def _convert_value(self, v: Any) -> proto.Expression.Literal: +value = proto.Expression.Literal() +if v is None: +value.null = True +elif isinstance(v, int): +value.integer = v +else: +value.string = v +return value Review Comment: This code has weird error behavior if v is not `None` or `int`. If for example, I were to assign a float, I would receive an error message from protobuf that is not actionable for the user. I think it would be good to either use the existing Python to Literal conversion code that we have or throw an exception. ## python/pyspark/sql/connect/dataframe.py: ## @@ -875,6 +875,30 @@ def to_jcols( melt = unpivot +def hint(self, name: str, *params: Any) -> "DataFrame": +""" +Specifies some hint on the current DataFrame. As an example, the following code specifies +that one of the plan can be broadcasted: `df1.join(df2.hint("broadcast"))` + +.. versionadded:: 3.4.0 + +Parameters +-- +name: str +the name of the hint, for example, "broadcast", "SHUFFLE_MERGE" and "shuffle_hash". +params: tuple +the parameters of the hint Review Comment: I know that the documentation is most likeley directly from PySpark, but I'm wondering if we can add more context around what types can the params have? If I read through the code it can be `any` here but later only `Optional[Union[str, int]]`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #39093: [SPARK-41420][UI] Protobuf serializer for ApplicationInfoWrapper
mridulm commented on code in PR #39093: URL: https://github.com/apache/spark/pull/39093#discussion_r1051020503 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -106,3 +106,28 @@ message TaskDataWrapper { int64 stage_id = 40; int32 stage_attempt_id = 41; } + +message ApplicationAttemptInfo { + optional string attempt_id = 1; + int64 start_time = 2; + int64 end_time = 3; + int64 last_updated = 4; + int64 duration = 5; + string spark_user = 6; + bool completed = 7; + string app_spark_version = 8; +} + +message ApplicationInfo { + string id = 1; + string name = 2; + optional int32 cores_granted = 3; + optional int32 max_cores = 4; + optional int32 cores_per_executor = 5; + optional int32 memory_per_executor_mb = 6; + repeated ApplicationAttemptInfo attempts = 7; +} + +message ApplicationInfoWrapper { + ApplicationInfo info = 1; Review Comment: Missing `AttemptInfoWrapper` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anchovYu commented on a diff in pull request #39040: [SPARK-27561][SQL][FOLLOWUP] Support implicit lateral column alias resolution on Aggregate
anchovYu commented on code in PR #39040: URL: https://github.com/apache/spark/pull/39040#discussion_r1051207085 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -129,6 +163,45 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { child = Project(innerProjectList.toSeq, child) ) } + +case agg @ Aggregate(groupingExpressions, aggregateExpressions, _) if agg.resolved +&& aggregateExpressions.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) => + + val newAggExprs = collection.mutable.Set.empty[NamedExpression] + val expressionMap = collection.mutable.LinkedHashMap.empty[Expression, NamedExpression] + val projectExprs = aggregateExpressions.map { exp => +exp.transformDown { + case aggExpr: AggregateExpression => +// Doesn't support referencing a lateral alias in aggregate function +if (aggExpr.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) { + aggExpr.collectFirst { +case lcaRef: LateralColumnAliasReference => + throw QueryCompilationErrors.lateralColumnAliasInAggFuncUnsupportedError( +lcaRef.nameParts, aggExpr) + } +} +val ne = expressionMap.getOrElseUpdate(aggExpr.canonicalized, assignAlias(aggExpr)) +newAggExprs += ne Review Comment: Yes we do. This is the general strategy of this design, extract **all** aggregation functions or grouping expressions. It should be simple and avoid many extra considerations of code path (e.g., mark if the alias is used as lateral alias by later expressions). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on pull request #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper
techaddict commented on PR #39096: URL: https://github.com/apache/spark/pull/39096#issuecomment-1355722437 @gengliangwang Thanks for the reivew, addressed comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38983: [SPARK-41447][CORE] Reduce the number of `doMergeApplicationListing` invocations
dongjoon-hyun commented on PR #38983: URL: https://github.com/apache/spark/pull/38983#issuecomment-1355707241 @shuyouZZ . I added you to the Apache Spark contributor group and assigned SPARK-41447 to you. Welcome to the Apache Spark community! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #38983: [SPARK-41447][CORE] Reduce the number of `doMergeApplicationListing` invocations
dongjoon-hyun closed pull request #38983: [SPARK-41447][CORE] Reduce the number of `doMergeApplicationListing` invocations URL: https://github.com/apache/spark/pull/38983 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper
gengliangwang commented on PR #39096: URL: https://github.com/apache/spark/pull/39096#issuecomment-1355701397 @techaddict LGTM except for two minor comments. Thanks for working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper
gengliangwang commented on code in PR #39096: URL: https://github.com/apache/spark/pull/39096#discussion_r1051192373 ## core/src/test/scala/org/apache/spark/status/protobuf/KVStoreProtobufSerializerSuite.scala: ## @@ -176,4 +177,83 @@ class KVStoreProtobufSerializerSuite extends SparkFunSuite { assert(result.stageId == input.stageId) assert(result.stageAttemptId == input.stageAttemptId) } + + test("Application Environment Info") { +val input = new ApplicationEnvironmentInfoWrapper( + new ApplicationEnvironmentInfo( +runtime = new RuntimeInfo( + javaVersion = "1.8", + javaHome = "/tmp/java", + scalaVersion = "2.13"), +sparkProperties = Seq(("1", "2")), +hadoopProperties = Seq(("1", "2")), Review Comment: I would suggest setting different values for these fields in case of mistakes on the fields of serializer and deserializer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper
gengliangwang commented on code in PR #39096: URL: https://github.com/apache/spark/pull/39096#discussion_r1051188448 ## core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto: ## @@ -106,3 +106,45 @@ message TaskDataWrapper { int64 stage_id = 40; int32 stage_attempt_id = 41; } + +message ExecutorResourceRequest { + string resource_name = 1; + int64 amount = 2; + string discoveryScript = 3; + string vendor = 4; +} + +message TaskResourceRequest { + string resource_name = 1; + double amount = 2; +} + +message ResourceProfileInfo { + int32 id = 1; + map executor_resources = 2; + map task_resources = 3; +} + +message RuntimeInfo { + string java_version = 1; + string java_home = 2; + string scala_version = 3; +} + +message ApplicationEnvironmentInfo { + message PairSS { Review Comment: Shall we name it `PairStrings` and move it out of `ApplicationEnvironmentInfo`? In the future this might be reused. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tomvanbussel commented on pull request #38941: [SPARK-41498] Propagate metadata through Union
tomvanbussel commented on PR #38941: URL: https://github.com/apache/spark/pull/38941#issuecomment-1355603538 @cloud-fan Could you help us find someone to take a look at this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39094: [SPARK-41552][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.1
dongjoon-hyun closed pull request #39094: [SPARK-41552][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.1 URL: https://github.com/apache/spark/pull/39094 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39094: [SPARK-41552][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.1
dongjoon-hyun commented on PR #39094: URL: https://github.com/apache/spark/pull/39094#issuecomment-1355597529 Merged to master for Apache Spark 3.4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen opened a new pull request, #39098: [SPARK-41553][PS] Change `num_files` to `repartition`
bjornjorgensen opened a new pull request, #39098: URL: https://github.com/apache/spark/pull/39098 ### What changes were proposed in this pull request? Change num_files to repartition ### Why are the changes needed? `num_files` has been deprecated and might be removed in a future version. " "Use `DataFrame.spark.repartition` instead.", ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] NarekDW opened a new pull request, #39097: [SPARK-41049][SQL] make to_csv function deterministic
NarekDW opened a new pull request, #39097: URL: https://github.com/apache/spark/pull/39097 ### What changes were proposed in this pull request? This PR enhances `StructsToCsv` class with `doGenCode` function instead of extending it from `CodegenFallback` trait, in order to make it deterministic. Example: ```scala import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ val sparkSession = SparkSession.builder().getOrCreate() val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") val v1 = rand().*(lit(1)).cast(IntegerType) val v2 = to_csv(struct(v1.as("a"))) df.select(v1, v1, v2, v2).show() ``` before this changes, the result was something like: ```scala +++++ | a| b| c| d| +++++ | 922| 922| 922|2028| |5571|5571|5571|1640| |5612|5612|5612|769 | |2068|2068|2068|8924| |5755|5755|5755|2731| +++++ ``` With current changes, the result looks like: ```scala +++++ | a| b| c| d| +++++ | 922| 922| 922| 922| |5571|5571|5571|5571| |5612|5612|5612|5612| |2068|2068|2068|2068| |5755|5755|5755|5755| +++++ ``` ### Why are the changes needed? To make to_csv function deterministic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? An additional test case was added to [CsvFunctionsSuite](https://github.com/NarekDW/spark/blob/13ab259815ad2b9351010df15ab23aacfa7ee14e/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala) ```scala test("SPARK-41049: make to_csv function deterministic") { ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict opened a new pull request, #39096: [SPARK-41421][UI] Protobuf serializer for ApplicationEnvironmentInfoWrapper
techaddict opened a new pull request, #39096: URL: https://github.com/apache/spark/pull/39096 ### What changes were proposed in this pull request? Add Protobuf serializer for ApplicationEnvironmentInfoWrapper ### Why are the changes needed? Support fast and compact serialization/deserialization for ApplicationEnvironmentInfoWrapper over RocksDB. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39094: [SPARK-41552][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.1
dongjoon-hyun commented on PR #39094: URL: https://github.com/apache/spark/pull/39094#issuecomment-1355493567 Thank you, @viirya and @bjornjorgensen . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39087: [SPARK-41365][UI][3.3] Stages UI page fails to load for proxy in specific yarn environment
dongjoon-hyun commented on PR #39087: URL: https://github.com/apache/spark/pull/39087#issuecomment-1355492859 I added you to the Apache Spark contributor group and assign SPARK-41365 to you. Welcome to the Apache Spark community, @yabola . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39087: [SPARK-41365][UI][3.3] Stages UI page fails to load for proxy in specific yarn environment
dongjoon-hyun closed pull request #39087: [SPARK-41365][UI][3.3] Stages UI page fails to load for proxy in specific yarn environment URL: https://github.com/apache/spark/pull/39087 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39078: [SPARK-41534][CONNECT][SQL] Setup initial client module for Spark Connect
amaliujia commented on code in PR #39078: URL: https://github.com/apache/spark/pull/39078#discussion_r1051078343 ## connector/connect/client/src/main/scala/org/apache/spark/sql/connect/client/SparkSession.scala: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connect.client + +import scala.language.existentials + +import io.grpc.{ManagedChannel, ManagedChannelBuilder} +import org.apache.arrow.memory.RootAllocator + +import org.apache.spark.SPARK_VERSION +import org.apache.spark.connect.proto + + +class SparkSession( +private val userContext: proto.UserContext, +private val channel: ManagedChannel) + extends AutoCloseable { + private[this] val stub = proto.SparkConnectServiceGrpc.newBlockingStub(channel) + + private[this] val allocator = new RootAllocator() + + /** + * The version of Spark on which this application is running. + */ + def version: String = SPARK_VERSION + + /** + * Returns a `DataFrame` with no rows or columns. + * + * @since 3.4.0 + */ + @transient + lazy val emptyDataFrame: Dataset = newDataset { builder => +builder.getLocalRelationBuilder + } + + /** + * Creates a [[Dataset]] with a single `LongType` column named `id`, containing elements + * in a range from `start` to `end` (exclusive) with a step value, with partition number + * specified. + * + * @since 2.0.0 Review Comment: The client starts from 3.4.0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39084: [SPARK-41464][CONNECT][PYTHON] Implement `DataFrame.to`
amaliujia commented on code in PR #39084: URL: https://github.com/apache/spark/pull/39084#discussion_r1051076834 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -270,6 +271,24 @@ class SparkConnectPlanner(session: SparkSession) { .logicalPlan } + private def transformToSchema(rel: proto.ToSchema): LogicalPlan = { +val schemaType = if (rel.hasDatatype) { + DataTypeProtoConverter.toCatalystType(rel.getDatatype) +} else { + parseDatatypeString(rel.getDatatypeStr) +} + +val schemaStruct = schemaType match { + case s: StructType => s + case d => StructType(Seq(StructField("value", d))) Review Comment: so it is not guaranteed that the client side always sends a StructType as schema when it is not a string? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39084: [SPARK-41464][CONNECT][PYTHON] Implement `DataFrame.to`
amaliujia commented on code in PR #39084: URL: https://github.com/apache/spark/pull/39084#discussion_r1051075228 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -601,3 +602,19 @@ message Unpivot { // (Required) Name of the value column. string value_column_name = 5; } + +message ToSchema { + // (Required) The input relation. + Relation input = 1; + + // (Required) The user provided schema. + // + // The Sever side will update the dataframe with this schema. + oneof schema { Review Comment: This requires clients side to implement more. For example, client side always convert StructType to string representation then Spark will convert it back. This basically asks clients to understand Spark's protocol on the string based schema and then implement it right? ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -601,3 +602,19 @@ message Unpivot { // (Required) Name of the value column. string value_column_name = 5; } + +message ToSchema { + // (Required) The input relation. + Relation input = 1; + + // (Required) The user provided schema. + // + // The Sever side will update the dataframe with this schema. + oneof schema { Review Comment: This requires clients side to implement more. For example, client side always convert StructType to string representation then Spark will convert it back. This basically asks clients to understand Spark's protocol on the string based schema and then implement it right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39084: [SPARK-41464][CONNECT][PYTHON] Implement `DataFrame.to`
amaliujia commented on code in PR #39084: URL: https://github.com/apache/spark/pull/39084#discussion_r1051074565 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -601,3 +602,19 @@ message Unpivot { // (Required) Name of the value column. string value_column_name = 5; } + +message ToSchema { + // (Required) The input relation. + Relation input = 1; + + // (Required) The user provided schema. + // + // The Sever side will update the dataframe with this schema. + oneof schema { + +DataType datatype = 2; Review Comment: Maybe comment here to say this must be a StructType? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39084: [SPARK-41464][CONNECT][PYTHON] Implement `DataFrame.to`
amaliujia commented on code in PR #39084: URL: https://github.com/apache/spark/pull/39084#discussion_r1051074168 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1631,6 +1632,47 @@ def inputFiles(self) -> List[str]: query = self._plan.to_proto(self._session.client) return self._session.client._analyze(query).input_files +def to(self, schema: Union[DataType, str]) -> "DataFrame": Review Comment: yes we should have StructType in `pyspark_types_to_proto_types`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #39094: [SPARK-41552][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.1
bjornjorgensen commented on PR #39094: URL: https://github.com/apache/spark/pull/39094#issuecomment-1355444859 Yes, then I can close this one https://github.com/bjornjorgensen/spark/pull/94 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
amaliujia commented on code in PR #39091: URL: https://github.com/apache/spark/pull/39091#discussion_r1051070657 ## sql/core/src/main/scala/org/apache/spark/sql/Observation.scala: ## @@ -45,7 +45,7 @@ import org.apache.spark.sql.util.QueryExecutionListener * @param name name of the metric * @since 3.3.0 */ -class Observation(name: String) { +class Observation(val name: String) { Review Comment: ah without `val`, `name` is treated as a method. Nice catch on this. ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -126,6 +126,20 @@ package object dsl { Expression.UnresolvedFunction.newBuilder().setFunctionName("min").addArguments(e)) .build() +def proto_max(e: Expression): Expression = Review Comment: Same for below -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
amaliujia commented on code in PR #39091: URL: https://github.com/apache/spark/pull/39091#discussion_r1051069651 ## python/pyspark/sql/connect/dataframe.py: ## @@ -891,6 +892,75 @@ def to_jcols( melt = unpivot +def observe( +self, +observation: Union["Observation", str], +*exprs: Column, +) -> "DataFrame": +"""Define (named) metrics to observe on the DataFrame. This method returns an 'observed' +DataFrame that returns the same result as the input, with the following guarantees: + +* It will compute the defined aggregates (metrics) on all the data that is flowing through +the Dataset at that point. + +* It will report the value of the defined aggregate columns as soon as we reach a completion +point. A completion point is either the end of a query (batch mode) or the end of a +streaming epoch. The value of the aggregates only reflects the data processed since +the previous completion point. + +The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or +more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that +contain references to the input Dataset's columns must always be wrapped in an aggregate +function. + +A user can observe these metrics by adding +Python's :class:`~pyspark.sql.streaming.StreamingQueryListener`, +Scala/Java's ``org.apache.spark.sql.streaming.StreamingQueryListener`` or Scala/Java's +``org.apache.spark.sql.util.QueryExecutionListener`` to the spark session. + +.. versionadded:: 3.3.0 Review Comment: 3.4.0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
amaliujia commented on code in PR #39091: URL: https://github.com/apache/spark/pull/39091#discussion_r1051068680 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -126,6 +126,20 @@ package object dsl { Expression.UnresolvedFunction.newBuilder().setFunctionName("min").addArguments(e)) .build() +def proto_max(e: Expression): Expression = Review Comment: There is need to add `proto_` prefix? Just call it max? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39094: [SPARK-41552][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.1
dongjoon-hyun commented on PR #39094: URL: https://github.com/apache/spark/pull/39094#issuecomment-1355430439 cc @panbingkun , @Yikun , @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk opened a new pull request, #39095: [WIP][SQL] Add the error class `UNRESOLVED_ROUTINE`
MaxGekk opened a new pull request, #39095: URL: https://github.com/apache/spark/pull/39095 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39065: [SPARK-41521][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.0
dongjoon-hyun commented on PR #39065: URL: https://github.com/apache/spark/pull/39065#issuecomment-1355403342 The upstream released a fixed version. - https://github.com/fabric8io/kubernetes-client/releases/tag/v6.3.1 I made a PR for that. - https://github.com/apache/spark/pull/39094 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #39094: [SPARK-41552][BUILD][K8S] Upgrade `kubernetes-client` to 6.3.1
dongjoon-hyun opened a new pull request, #39094: URL: https://github.com/apache/spark/pull/39094 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38864: [SPARK-41271][SQL] Support parameterized SQL queries by `sql()`
srielau commented on code in PR #38864: URL: https://github.com/apache/spark/pull/38864#discussion_r1051043683 ## core/src/main/resources/error/error-classes.json: ## @@ -795,6 +795,11 @@ } } }, + "INVALID_SQL_ARG" : { +"message" : [ + "The argument of `sql()` is invalid. Consider to replace it by a SQL literal statement." Review Comment: That's an expression then. Statements are top level. (Like SELECT, UPDATE, CRAETE, SET). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on a diff in pull request #38941: [SPARK-41498] Propagate metadata through Union
juliuszsompolski commented on code in PR #38941: URL: https://github.com/apache/spark/pull/38941#discussion_r1051033080 ## sql/core/src/test/scala/org/apache/spark/sql/connector/MetadataColumnSuite.scala: ## @@ -232,4 +237,175 @@ class MetadataColumnSuite extends DatasourceV2SQLBase { ) } } + + test("SPARK-41498: Metadata column is propagated through union") { Review Comment: could we also add a positive example with more complicated columns? e.g. selecting _partition and _metadata; or projecting _metadata to only select file_name so that nested column pruning kicks in? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anchovYu commented on pull request #39040: [SPARK-27561][SQL][FOLLOWUP] Support implicit lateral column alias resolution on Aggregate
anchovYu commented on PR #39040: URL: https://github.com/apache/spark/pull/39040#issuecomment-1355358289 @gengliangwang sure, will do today, fyi i was sick yesterday. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
hvanhovell commented on PR #39091: URL: https://github.com/apache/spark/pull/39091#issuecomment-1355312686 @beliefer thanks for working on this. I have one question how are we going to get the observed metrics to the client? This seems to be missing from the implementation. One of the approaches would be to send it in a similar way as the metrics in the result code path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement DataFrame.observe
hvanhovell commented on code in PR #39091: URL: https://github.com/apache/spark/pull/39091#discussion_r1051008772 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -338,6 +340,22 @@ class SparkConnectPlanner(session: SparkSession) { } } + private def transformCollectMetrics(rel: proto.CollectMetrics): LogicalPlan = { +val metrics = rel.getMetricsList.asScala.map { expr => + Column(transformExpression(expr)) +} + +if (rel.getIsObservation) { Review Comment: What is the difference between the code paths? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #39092: [SPARK-41548][CONNECT][TESTS] Disable ANSI mode in pyspark.sql.tests.connect.test_connect_functions
dongjoon-hyun closed pull request #39092: [SPARK-41548][CONNECT][TESTS] Disable ANSI mode in pyspark.sql.tests.connect.test_connect_functions URL: https://github.com/apache/spark/pull/39092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38861: [SPARK-41294][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1203 / 1168
MaxGekk commented on code in PR #38861: URL: https://github.com/apache/spark/pull/38861#discussion_r1050943156 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -2018,15 +2007,29 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { "tableColumns" -> expected.map(c => s"'${c.name}'").mkString(", "), "dataColumns" -> query.output.map(c => s"'${c.name}'").mkString(", "))) } + + def numColumnsMismatchError( + operator: LogicalPlan, + firstNumColumns: Int, + invalidOrdinalNum: Int, + invalidNumColumns: Int): Throwable = { + +def ordinalNumber(i: Int): String = i match { + case 0 => "first" + case 1 => "second" + case 2 => "third" + case i => s"${i + 1}th" +} - def cannotWriteNotEnoughColumnsToTableError( - tableName: String, expected: Seq[Attribute], query: LogicalPlan): Throwable = { new AnalysisException( - errorClass = "_LEGACY_ERROR_TEMP_1203", + errorClass = "NUM_COLUMNS_MISMATCH", messageParameters = Map( -"tableName" -> tableName, -"tableColumns" -> expected.map(c => s"'${c.name}'").mkString(", "), -"dataColumns" -> query.output.map(c => s"'${c.name}'").mkString(", "))) +"operator" -> toSQLStmt(operator.nodeName), Review Comment: Maybe, we should introduce `prettyName()` similar to `prettyName` in `Expression`. @cloud-fan @HyukjinKwon WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict opened a new pull request, #39093: [SPARK-41420][UI] Protobuf serializer for ApplicationInfoWrapper
techaddict opened a new pull request, #39093: URL: https://github.com/apache/spark/pull/39093 **What changes were proposed in this pull request?** Add Protobuf serializer for ApplicationInfoWrapper **Why are the changes needed?** Support fast and compact serialization/deserialization for ApplicationInfoWrapper over RocksDB. **Does this PR introduce any user-facing change?** No **How was this patch tested?** New UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] roczei commented on pull request #38828: [SPARK-35084][CORE] Spark 3: supporting --packages in k8s cluster mode
roczei commented on PR #38828: URL: https://github.com/apache/spark/pull/38828#issuecomment-1354933514 @holdenk, @HyukjinKwon, @dongjoon-hyun Could you please take a look when you have some time? This fixes a k8s --packages issue which is part of Spark 3 since 3.0.0. It would be nice to solve it. Here you can see the old branch-3.0 where the conditional codes of the "if" / "else if" / "else" are equal to the latest master version: https://github.com/apache/spark/blob/branch-3.0/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L316-L328 vs. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L317-L330 These conditions were restructured a bit by @ocworld I have already added my test results above. The fix works. This K8S PR is a follow-up PR for #32397. It has been closed by github-action because it hasn't been updated in a while and there was no unit test. The requested unit test has been added, now we need just someone from the Spark committer team who can review it again. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org