[GitHub] [spark] williamhyun opened a new pull request, #38724: [SPARK-41202][BUILD] Update ORC to 1.7.7
williamhyun opened a new pull request, #38724: URL: https://github.com/apache/spark/pull/38724 ### What changes were proposed in this pull request? This PR aims to update ORC to 1.7.7. ### Why are the changes needed? This will bring the latest bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38693: [SPARK-41196] [CONNECT] Homogenize the protobuf version across the Spark connect server to use the same major version.
AmplabJenkins commented on PR #38693: URL: https://github.com/apache/spark/pull/38693#issuecomment-1320825239 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
grundprinzip commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1027046810 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -271,8 +273,12 @@ class SparkConnectPlanner(session: SparkSession) { } private def transformLocalRelation(rel: proto.LocalRelation): LogicalPlan = { -val attributes = rel.getAttributesList.asScala.map(transformAttribute(_)).toSeq -new org.apache.spark.sql.catalyst.plans.logical.LocalRelation(attributes) +val (rows, structType) = ArrowConverters.fromBatchWithSchemaIterator( Review Comment: This looks very good. ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala: ## @@ -354,4 +365,16 @@ class SparkConnectPlannerSuite extends SparkFunSuite with SparkConnectPlanTest { transform(proto.Relation.newBuilder.setSetOp(intersect).build())) assert(e2.getMessage.contains("Intersect does not support union_by_name")) } + + test("transform LocalRelation") { +val inputRows = (0 until 10).map(InternalRow(_)) Review Comment: The test here is kind of bare bones. Before we fully approve the PR we need to extend the test coverage a bit. ## sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala: ## @@ -76,21 +72,26 @@ private[sql] object ArrowConverters extends Logging { schema: StructType, maxRecordsPerBatch: Long, timeZoneId: String, - context: TaskContext) extends Iterator[Array[Byte]] { + context: TaskContext) + extends Iterator[Array[Byte]] { Review Comment: why these changes? ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -271,9 +271,7 @@ message Deduplicate { // A relation that does not need to be qualified by name. message LocalRelation { - // (Optional) A list qualified attributes. - repeated Expression.QualifiedAttribute attributes = 1; - // TODO: support local data. + bytes data = 1; Review Comment: Please add a comment mentioning that the data is stored as arrow IPC message streams and that since the ipc streams contain the schema we don't need to qualify it. ## sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala: ## @@ -37,10 +34,9 @@ import org.apache.spark.sql.catalyst.expressions.{UnsafeProjection, UnsafeRow} import org.apache.spark.sql.catalyst.plans.logical.LocalRelation import org.apache.spark.sql.types._ import org.apache.spark.sql.util.ArrowUtils -import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, ColumnVector} +import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnVector, ColumnarBatch} Review Comment: afaik this change should break scala Style as CB is before CV -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078
MaxGekk closed pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078 URL: https://github.com/apache/spark/pull/38696 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078
MaxGekk commented on PR #38696: URL: https://github.com/apache/spark/pull/38696#issuecomment-1320812183 +1, LGTM. Merging to master. Thank you, @panbingkun and @srielau for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
amaliujia commented on PR #38723: URL: https://github.com/apache/spark/pull/38723#issuecomment-1320811915 @zhengruifeng @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
amaliujia opened a new pull request, #38723: URL: https://github.com/apache/spark/pull/38723 ### What changes were proposed in this pull request? Implement `DataFrame.SelectExpr` in Python client. `SelectExpr` also has a good amount of usage. ### Why are the changes needed? API coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078
AmplabJenkins commented on PR #38696: URL: https://github.com/apache/spark/pull/38696#issuecomment-1320808013 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WangGuangxin opened a new pull request, #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY
WangGuangxin opened a new pull request, #38722: URL: https://github.com/apache/spark/pull/38722 ### What changes were proposed in this pull request? In BytesToBytesMap, the longArray size can be up to `MAX_CAPACITY` instead `MAX_CAPACITY/2` since `MAX_CAPACITY` already take `two array entries per key` into account. ### Why are the changes needed? To handle larger dataset in BytestoBytesMap ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existings UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate
wangyum commented on PR #38682: URL: https://github.com/apache/spark/pull/38682#issuecomment-1320804263 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38721: [WIP][SPARK-41172][SQL] Migrate the ambiguous ref error to an error class
panbingkun opened a new pull request, #38721: URL: https://github.com/apache/spark/pull/38721 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on pull request #38699: [SPARK-41188][CORE][ML] Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
WeichenXu123 commented on PR #38699: URL: https://github.com/apache/spark/pull/38699#issuecomment-1320790862 > If we are setting it in `SparkContext`, do we want to get rid of this from other places like `PythonRunner.compute` ? I think we can remove code in PythonRunner.compute -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya closed pull request #38716: [SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch
viirya closed pull request #38716: [SPARK-X][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch URL: https://github.com/apache/spark/pull/38716 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra and replace `list_run_infos` with `search_runs` in mlflow doctest
Yikun commented on PR #38698: URL: https://github.com/apache/spark/pull/38698#issuecomment-1320737666 Merge to master, @HyukjinKwon @harupy Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun closed pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra and replace `list_run_infos` with `search_runs` in mlflow doctest
Yikun closed pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra and replace `list_run_infos` with `search_runs` in mlflow doctest URL: https://github.com/apache/spark/pull/38698 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on PR #38682: URL: https://github.com/apache/spark/pull/38682#issuecomment-1320735591 @wankunde Please fix the PR title and description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1027001153 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala: ## @@ -117,4 +117,29 @@ object ExprUtils extends QueryErrorsBase { TypeCheckSuccess } } + + /** + * Combine a number of boolean expressions into a balanced expression tree. These expressions are + * either combined by a logical [[And]] or a logical [[Or]]. + * + * A balanced binary tree is created because regular left recursive trees cause considerable + * performance degradations and can cause stack overflows. + */ + def reduceToExpressionTree( + expressions: Seq[Expression], + expressionCombiner: (Expression, Expression) => Expression): Expression = { +assert(expressions.size > 0) Review Comment: Remove this `assert`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used
viirya commented on code in PR #38719: URL: https://github.com/apache/spark/pull/38719#discussion_r1026999005 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala: ## @@ -345,7 +345,14 @@ trait ProgressReporter extends Logging { val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves() if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) { val execLeafToSource = allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap { - case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep -> source } + case (_, ep: MicroBatchScanExec) => +// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder instance for DSv2 +// streaming source, hence we cannot lookup the actual source from the map. +// The physical node for DSv2 streaming source contains the information of the source +// by itself, so leverage it. +Some(ep -> ep.stream) Review Comment: Oh, got it. I saw that there is a `onlyDataSourceV2Sources` and a dedicated block for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra and replace `list_run_infos` with `search_runs` in mlflow doctest
HyukjinKwon commented on PR #38698: URL: https://github.com/apache/spark/pull/38698#issuecomment-1320711448 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python
HyukjinKwon closed pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python URL: https://github.com/apache/spark/pull/38718 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python
HyukjinKwon commented on PR #38718: URL: https://github.com/apache/spark/pull/38718#issuecomment-1320707595 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used
HeartSaVioR commented on code in PR #38719: URL: https://github.com/apache/spark/pull/38719#discussion_r1026994040 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala: ## @@ -345,7 +345,14 @@ trait ProgressReporter extends Logging { val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves() if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) { val execLeafToSource = allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap { - case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep -> source } + case (_, ep: MicroBatchScanExec) => +// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder instance for DSv2 +// streaming source, hence we cannot lookup the actual source from the map. +// The physical node for DSv2 streaming source contains the information of the source +// by itself, so leverage it. +Some(ep -> ep.stream) Review Comment: So all the problems from streaming query metrics are due to DSv1 which we do not have a dedicated logical node / physical node for source. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used
HeartSaVioR commented on code in PR #38719: URL: https://github.com/apache/spark/pull/38719#discussion_r1026993747 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala: ## @@ -345,7 +345,14 @@ trait ProgressReporter extends Logging { val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves() if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) { val execLeafToSource = allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap { - case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep -> source } + case (_, ep: MicroBatchScanExec) => +// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder instance for DSv2 +// streaming source, hence we cannot lookup the actual source from the map. +// The physical node for DSv2 streaming source contains the information of the source +// by itself, so leverage it. +Some(ep -> ep.stream) Review Comment: We have two different paths - if there are only DSv2 streaming sources, we don't even try to match the logical plan and physical plan. We just collect the metrics out from physical plan, which is always accurate. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L296-L316 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell opened a new pull request, #38720: [SPARK-41165][SPARK-41184][CONNECT] Fix arrow collect (again) and reenable tests.
hvanhovell opened a new pull request, #38720: URL: https://github.com/apache/spark/pull/38720 ### What changes were proposed in this pull request? The arrow collect code path for connect contains a bug where it would always fall back to JSON. This was caused by the assumption that `NonFatal(e)` does not match nulls, it unfortunately does. This has been fixed by doing explicit null checks and by reordering the checks in `SparkConnectStreamHandler.processAsArrowBatches`. ### Why are the changes needed? The previous code had a bug and would always fallback to JSON. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I added a new test, and I re-enabled the python test disabled in SPARK-41184. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used
viirya commented on code in PR #38719: URL: https://github.com/apache/spark/pull/38719#discussion_r1026993007 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala: ## @@ -345,7 +345,14 @@ trait ProgressReporter extends Logging { val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves() if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) { val execLeafToSource = allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap { - case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep -> source } + case (_, ep: MicroBatchScanExec) => +// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder instance for DSv2 +// streaming source, hence we cannot lookup the actual source from the map. +// The physical node for DSv2 streaming source contains the information of the source +// by itself, so leverage it. +Some(ep -> ep.stream) Review Comment: Why it is only for DS v1 mixing with DS v2? Seems DS v2 stream source always cannot be matched by `logicalPlanLeafToSource` because it is `OffsetHolder`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36695: [SPARK-38474][CORE] Use error class in org.apache.spark.security
github-actions[bot] closed pull request #36695: [SPARK-38474][CORE] Use error class in org.apache.spark.security URL: https://github.com/apache/spark/pull/36695 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36767: [SPARK-39363][K8S] Deprecate k8s memory overhead and make it optional
github-actions[bot] commented on PR #36767: URL: https://github.com/apache/spark/pull/36767#issuecomment-1320692121 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37359: [SPARK-25342][CORE][SQL]Support rolling back a result stage and rerunning all result tasks when writing files
github-actions[bot] closed pull request #37359: [SPARK-25342][CORE][SQL]Support rolling back a result stage and rerunning all result tasks when writing files URL: https://github.com/apache/spark/pull/37359 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join
github-actions[bot] closed pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join URL: https://github.com/apache/spark/pull/37129 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #37460: [WIP][SPARK-40031][SQL] Remove unnecessary TryEval in TryCast
github-actions[bot] commented on PR #37460: URL: https://github.com/apache/spark/pull/37460#issuecomment-1320692056 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1026987556 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala: ## @@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase { TypeCheckSuccess } } + + /** + * Combine a number of boolean expressions into a balanced expression tree. These expressions are + * either combined by a logical [[And]] or a logical [[Or]]. + * + * A balanced binary tree is created because regular left recursive trees cause considerable + * performance degradations and can cause stack overflows. + */ + def reduceToExpressionTree( + patterns: Seq[Expression], Review Comment: `patterns` -> `expressions`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #38716: [SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch
viirya commented on PR #38716: URL: https://github.com/apache/spark/pull/38716#issuecomment-1320636937 retest this please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liuzqt commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollec
liuzqt commented on code in PR #38704: URL: https://github.com/apache/spark/pull/38704#discussion_r1026966457 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest with SharedSparkSession { override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") - test("collect data with single partition larger than 2GB bytes array limit") { + // SPARK-41193: Ignore this suite because it cannot run successfully with Spark + // default Java Options, if user need do local test, please make the following changes: + // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in `sql/core/pom.xml` to `-Xmx10g` + // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` to `-Xmx10g` + ignore("collect data with single partition larger than 2GB bytes array limit") { Review Comment: Yes @LuciferYang is right, need to change `-Xmx4g` to `-Xmx10g` to make it work (it works for both shared local session and local cluster, but without the change neither work). Thanks for the fix! Previously I only tested this using IDE and I guess it increased the mem under the hood..Sorry for the inconvenience. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38702: [SPARK-41187][Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
AmplabJenkins commented on PR #38702: URL: https://github.com/apache/spark/pull/38702#issuecomment-1320614168 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist
AmplabJenkins commented on PR #38703: URL: https://github.com/apache/spark/pull/38703#issuecomment-1320614138 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tedyu commented on pull request #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release
tedyu commented on PR #38715: URL: https://github.com/apache/spark/pull/38715#issuecomment-1320568475 @HeartSaVioR Can you take a look ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used
HeartSaVioR commented on PR #38719: URL: https://github.com/apache/spark/pull/38719#issuecomment-1320531883 cc. @zsxwing @viirya @xuanyuanking Please take a look. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR commented on PR #38717: URL: https://github.com/apache/spark/pull/38717#issuecomment-1320531472 cc. @zsxwing @cloud-fan @viirya Please take a look. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on pull request #35969: [SPARK-38651][SQL] Add configuration to support writing out empty schemas in supported filebased datasources
xkrogen commented on PR #35969: URL: https://github.com/apache/spark/pull/35969#issuecomment-1320527156 @cloud-fan , any more concerns on this approach based on what @thejdeep shared? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #38719: [SPARK-41999][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used
HeartSaVioR opened a new pull request, #38719: URL: https://github.com/apache/spark/pull/38719 ### What changes were proposed in this pull request? This PR proposes to fix the metrics issue for streaming query when DSv1 streaming source and DSv2 streaming source are co-used. If the streaming query has both DSv1 streaming source and DSv2 streaming source, only DSv1 streaming source produced correct metrics. There is a bug in ProgressReporter that it tries to match logical node for DSv2 streaming source with OffsetHolder, which will be never matched. Given that physical node for DSv2 streaming source contains both source information and metrics, we can simply deduce all the necessary information from the physical node rather than trying to find the source from association map. ### Why are the changes needed? The logic of collecting metrics does not collect metrics for DSv2 streaming sources properly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python
amaliujia commented on PR #38718: URL: https://github.com/apache/spark/pull/38718#issuecomment-1320510474 @zhengruifeng @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python
amaliujia opened a new pull request, #38718: URL: https://github.com/apache/spark/pull/38718 ### What changes were proposed in this pull request? Fix out of sync generated files for Python. This happens on a rare case for protobuf version change. There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However the PR was based on old code before https://github.com/apache/spark/pull/38638 so the protobuf generates based on stale code which leads to stale generated files. ### Why are the changes needed? Fix out of sync generated files for Python. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078
MaxGekk commented on PR #38696: URL: https://github.com/apache/spark/pull/38696#issuecomment-1320493195 cc @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR opened a new pull request, #38717: URL: https://github.com/apache/spark/pull/38717 ### What changes were proposed in this pull request? This PR proposes to fix the broken metrics when the streaming query has CTE, via applying InlineCTE manually against analyzed plan when collecting metrics. Suppose a streaming query contains below part as batch side which is joined with streaming source: ``` with batch_tbl as ( SELECT col1, col2 FROM parquet_tbl ) SELECT col1 AS key, col2 as value_batch FROM batch_tbl ``` Currently, Spark adds WithCTE node with CTERelationDef and CTERelationRef when there is a usage of CTE. Below is an analyzed plan: ``` WriteToMicroBatchDataSource MemorySink, 2cbb2afa-6513-4a23-b4c2-37910fc9cdf9, Append, 0 +- Project [key#15, value_stream#16, value_batch#9L] +- Join Inner, (cast(key#15 as bigint) = key#8L) :- SubqueryAlias spark_catalog.default.parquet_streaming_tbl : +- Project [key#55 AS key#15, value_stream#56 AS value_stream#16] : +- Relation spark_catalog.default.parquet_streaming_tbl[key#55,value_stream#56] parquet +- WithCTE :- CTERelationDef 0, false : +- SubqueryAlias batch_tbl : +- Project [col1#10L, col2#11L] :+- SubqueryAlias spark_catalog.default.parquet_tbl : +- Relation spark_catalog.default.parquet_tbl[col1#10L,col2#11L] parquet +- Project [col1#10L AS key#8L, col2#11L AS value_batch#9L] +- SubqueryAlias batch_tbl +- CTERelationRef 0, true, [col1#10L, col2#11L] ``` Here, there are 3 leaf nodes in the plan, but the actual sources in the leaf nodes are 2. During the optimization, inlining CTE happens and there are 2 leaf nodes. Below is the optimized plan: ``` WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@622c7c7f] +- Project [key#55, value_stream#56, value_batch#9L] +- Join Inner, (cast(key#55 as bigint) = key#8L) :- Filter isnotnull(key#55) : +- Relation spark_catalog.default.parquet_streaming_tbl[key#55,value_stream#56] parquet +- Project [col1#10L AS key#8L, col2#11L AS value_batch#9L] +- Filter isnotnull(col1#10L) +- Relation spark_catalog.default.parquet_tbl[col1#10L,col2#11L] parquet ``` Hence executed plan will also have 2 leaf nodes, which does not match with the number of leaf nodes in analyzed plan, and ProgressReporter will give up collecting metrics. Applying InlineCTE against analyzed plan during collecting metrics would resolve this. For example, below is the logical plan which applies InlineCTE against above analyzed plan. ``` WriteToMicroBatchDataSource MemorySink, 2cbb2afa-6513-4a23-b4c2-37910fc9cdf9, Append, 0 +- Project [key#15, value_stream#16, value_batch#9L] +- Join Inner, (cast(key#15 as bigint) = key#8L) :- SubqueryAlias spark_catalog.default.parquet_streaming_tbl : +- Project [key#55 AS key#15, value_stream#56 AS value_stream#16] : +- Relation spark_catalog.default.parquet_streaming_tbl[key#55,value_stream#56] parquet +- Project [col1#10L AS key#8L, col2#11L AS value_batch#9L] +- SubqueryAlias batch_tbl +- SubqueryAlias batch_tbl +- Project [col1#10L, col2#11L] +- SubqueryAlias spark_catalog.default.parquet_tbl +- Relation spark_catalog.default.parquet_tbl[col1#10L,col2#11L] parquet ``` Note that this is only required for the case where there is at least one of DSv1 streaming source in the streaming query. If streaming query only contains DSv2 data sources as streaming sources, ProgressReporter can just pick up dedicated physical node(s) from executed plan. ### Why are the changes needed? The metrics in streaming query are broken if the query contains CTE. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request, #38716: [SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch
viirya opened a new pull request, #38716: URL: https://github.com/apache/spark/pull/38716 ### What changes were proposed in this pull request? This patch changes `currentBatchId` when `MicroBatchExecution` tries to resume from late batch from offset log. Previously it takes `latestBatchId` from offset log. This patch changes it to `latestCommittedBatchId`. ### Why are the changes needed? We have customer streaming job which is unable to restart from failed status while it failed to commit delta files. For example, if previous run failed to commit 14.delta, when the job restarted, it tried to read 14.delta. Because 14.delta doesn't exist (not committed), the job cannot be restarted and resume from late batch. When `MicroBatchExecution` populates start offsets, it reads late batch id (`latestBatchId) from offset log and committed batch id (`latestCommittedBatchId`) from commit log. Currently if `latestCommittedBatchId` == `latestBatchId` - 1, it means that we resume from late batch. But it uses `latestBatchId` as `currentBatchId` to run batch. Obviously, `latestBatchId` is 14 for above example, and `latestCommittedBatchId` is 13. Because `IncrementalExecution` uses `currentBatchId` to load checkpointed states, it tries to load version 14 of delta files. But version 14 is not committed in late run. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] geofflangenderfer commented on pull request #4093: [SPARK-5307] SerializationDebugger to help debug NotSerializableException
geofflangenderfer commented on PR #4093: URL: https://github.com/apache/spark/pull/4093#issuecomment-1320457789 could someone give a simple example of how to read the graph? I'm not sure where to start -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching
ryan-johnson-databricks commented on code in PR #38692: URL: https://github.com/apache/spark/pull/38692#discussion_r1026804602 ## sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala: ## @@ -192,6 +192,23 @@ class SparkSessionExtensionSuite extends SparkFunSuite with SQLHelper { testInjectColumnar(false) } + test("inject plan normalization rules") { +val extensions = create { extensions => + extensions.injectPlanNormalizationRules { session => +org.apache.spark.sql.catalyst.optimizer.PushDownPredicates + } +} +withSession(extensions) { session => + import session.implicits._ + val df = Seq((1, "a"), (2, "b")).toDF("i", "s") + df.select("i").filter($"i" > 1).cache() + assert(df.filter($"i" > 1).select("i").queryExecution.executedPlan.find { +case _: org.apache.spark.sql.execution.columnar.InMemoryTableScanExec => true +case _ => false Review Comment: Should we add a negative test that verifies this? Might be overkill... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions
MaxGekk closed pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions URL: https://github.com/apache/spark/pull/38705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions
MaxGekk commented on PR #38705: URL: https://github.com/apache/spark/pull/38705#issuecomment-1320429350 +1, LGTM. Merging to master. Thank you, @LuciferYang. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tedyu opened a new pull request, #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release
tedyu opened a new pull request, #38715: URL: https://github.com/apache/spark/pull/38715 ### What changes were proposed in this pull request? This PR upgrades Kafka to 3.3.0 release. ### Why are the changes needed? Kafka 3.3.0 release has new features along with bug fixes: https://www.confluent.io/blog/apache-kafka-3-3-0-new-features-and-updates/ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ahshahid opened a new pull request, #38714: [WIP][SPARK-41141]. avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it
ahshahid opened a new pull request, #38714: URL: https://github.com/apache/spark/pull/38714 ### What changes were proposed in this pull request? This is a PR for improvement When a subquery references the outer query's aggregate functions, in some cases, it ends up introducing extra aggregate functions which are not needed. Though they would get eventually eliminated in the optimizer, but atleast in analyzer phase would add an extra project node etc. The change is in the code of identification of OuterReference in subquery.scala. Currently whenever an aggregate expression is found, it is assumed to be the Outer Reference. With this change, the code checks whether the parent Expression can also be potentially part of the OuterReference too. So if we consider a query select cos (sum (a) ) , b from t1 having exists select 1 from t2 where x = cos ( sum(a) ) the OuterReference detected would be cos ( sum(a) ) instead of just sum(a). As a result, no extra aggregate would be added. ### Why are the changes needed? To avoid adding unnecessary aggregate in outer query thereby reducing the number of expressions to analyze, clone and also avoid adding an extra project node. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the precheckin tests and added new tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38693: [SPARK-41196] [CONNECT] Homogenize the protobuf version across the Spark connect server to use the same major version.
amaliujia commented on PR #38693: URL: https://github.com/apache/spark/pull/38693#issuecomment-1320392539 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #38693: [SPARK-41196] [CONNECT] Homogenize the protobuf version across the Spark connect server to use the same major version.
hvanhovell closed pull request #38693: [SPARK-41196] [CONNECT] Homogenize the protobuf version across the Spark connect server to use the same major version. URL: https://github.com/apache/spark/pull/38693 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
otterc commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1320389162 Looks good to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042
AmplabJenkins commented on PR #38707: URL: https://github.com/apache/spark/pull/38707#issuecomment-1320346254 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
EnricoMi commented on code in PR #38676: URL: https://github.com/apache/spark/pull/38676#discussion_r1026695873 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -1938,7 +1940,10 @@ case class LateralJoin( joinType: JoinType, condition: Option[Expression]) extends UnaryNode { - require(Seq(Inner, LeftOuter, Cross).contains(joinType), + require(Seq(Inner, LeftOuter, Cross).contains(joinType match { Review Comment: just needed by `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/DeduplicateRelationsSuite.scala`: val originalQuery = left.lateralJoin(right, UsingJoin(Inner, Seq("a"))) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
EnricoMi commented on PR #38676: URL: https://github.com/apache/spark/pull/38676#issuecomment-1320300231 Problem is that `DeduplicateRelations` is only considering duplicates between left `output` and right `output`, and not duplicates between left `references` and right `output`. I have sketched a fix for `Join` and `LateralJoin`, including a proper test. This could potentially be done for all operators specifically handled in `DeduplicateRelations.apply()`, i.e. `AsOfJoin`, `Intersect`, `Except`, `Union` and `MergeIntoTable`. Deduplicating attributes that are already referenced will break the plan as those references break. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultColle
mridulm commented on code in PR #38704: URL: https://github.com/apache/spark/pull/38704#discussion_r1026679895 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest with SharedSparkSession { override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") - test("collect data with single partition larger than 2GB bytes array limit") { + // SPARK-41193: Ignore this suite because it cannot run successfully with Spark + // default Java Options, if user need do local test, please make the following changes: + // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in `sql/core/pom.xml` to `-Xmx10g` + // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` to `-Xmx10g` + ignore("collect data with single partition larger than 2GB bytes array limit") { Review Comment: @liuzqt, I know this was iterated on multiple times to get it to work - instead of the shared local spark session, did it work locally when using a local spark cluster instead ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38699: [SPARK-41188][CORE][ML] Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
mridulm commented on PR #38699: URL: https://github.com/apache/spark/pull/38699#issuecomment-1320294062 If we are setting it in `SparkContext`, do we want to get rid of this from other places like `PythonRunner.compute` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children
srielau commented on code in PR #38713: URL: https://github.com/apache/spark/pull/38713#discussion_r1026672022 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -697,12 +697,12 @@ setQuantifier ; relation -: LATERAL? relationPrimary joinRelation* +: LATERAL? relationPrimary pivotClause? unpivotClause? joinRelation* ; joinRelation -: (joinType) JOIN LATERAL? right=relationPrimary joinCriteria? -| NATURAL joinType JOIN LATERAL? right=relationPrimary +: (joinType) JOIN LATERAL? right=relationPrimary joinCriteria? pivotClause? unpivotClause? +| NATURAL joinType JOIN LATERAL? right=relationPrimary pivotClause? unpivotClause? Review Comment: Have you tried simply: ```suggestion : (joinType) JOIN LATERAL? right=relationPrimary joinCriteria? pivotClause? unpivotClause? | NATURAL joinType JOIN LATERAL? right=relationPrimary | pivotClause | unpivotClause ``` Also removing the relation entries above? ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -697,12 +697,12 @@ setQuantifier ; relation -: LATERAL? relationPrimary joinRelation* +: LATERAL? relationPrimary pivotClause? unpivotClause? joinRelation* Review Comment: This doesn't look right. As you say PIVOT and UNPIVOT are like JOINs, so why are do you have this strange ordered optional clauses instead of merging them in as if they were another type of join. I can't put my finger on exactly how it breaks without running a bunch of tests. For example I should be able to chain a string UNPIVOTs and PIVOTs in any order without the need for a JOIN (or a braces) in between. I don't see how that works here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] antonipp commented on a diff in pull request #38376: [SPARK-40817] [Kubernetes] Do not discard remote user-specified files when launching Spark jobs on Kubernetes
antonipp commented on code in PR #38376: URL: https://github.com/apache/spark/pull/38376#discussion_r1026638180 ## core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala: ## @@ -1609,6 +1609,16 @@ class TestFileSystem extends org.apache.hadoop.fs.LocalFileSystem { } override def open(path: Path): FSDataInputStream = super.open(local(path)) + + // No-op methods + + override def copyFromLocalFile( Review Comment: Well, it took me a a bit of time to get to it but I finally have a working local setup and I wrote an integration test for this PR. I added it in https://github.com/apache/spark/pull/38376/commits/cbda5403b00f4e66ad4651ceda7786de1f5e1e1a It worked for me locally with: ```bash mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \ -Pkubernetes -Pkubernetes-integration-tests -Phadoop-3 \ -Dspark.kubernetes.test.sparkTgz=/path/to/the/tgz/built/from/this/branch \ -Dspark.kubernetes.test.deployMode=minikube \ -Dspark.kubernetes.test.imageRepo=docker.io/kubespark \ -Dspark.kubernetes.test.namespace=spark \ -Dspark.kubernetes.test.serviceAccountName=spark ``` @dongjoon-hyun @holdenk, would appreciate your reviews as well 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092
AmplabJenkins commented on PR #38710: URL: https://github.com/apache/spark/pull/38710#issuecomment-1320213406 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
AmplabJenkins commented on PR #38711: URL: https://github.com/apache/spark/pull/38711#issuecomment-1320213339 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching
cloud-fan commented on code in PR #38692: URL: https://github.com/apache/spark/pull/38692#discussion_r1026569958 ## sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala: ## @@ -217,6 +218,22 @@ class SparkSessionExtensions { checkRuleBuilders += builder } + private[this] val planNormalizationRules = mutable.Buffer.empty[RuleBuilder] + + def buildPlanNormalizationRules(session: SparkSession): Seq[Rule[LogicalPlan]] = { +planNormalizationRules.map(_.apply(session)).toSeq Review Comment: I'm following the existing code style in this file. I assume the reason is people who are not familiar with Scala may be confused when reading the code `.map(_(session))` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching
ryan-johnson-databricks commented on code in PR #38692: URL: https://github.com/apache/spark/pull/38692#discussion_r1026540525 ## sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala: ## @@ -217,6 +218,22 @@ class SparkSessionExtensions { checkRuleBuilders += builder } + private[this] val planNormalizationRules = mutable.Buffer.empty[RuleBuilder] + + def buildPlanNormalizationRules(session: SparkSession): Seq[Rule[LogicalPlan]] = { +planNormalizationRules.map(_.apply(session)).toSeq Review Comment: nit: isn't `.apply(...)` redundant with just `(...)` ? ## sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala: ## @@ -105,12 +105,30 @@ class QueryExecution( case other => other } + lazy val normalized: LogicalPlan = { +val normalizationRules = sparkSession.sessionState.planNormalizationRules +if (normalizationRules.isEmpty) { + commandExecuted +} else { + val planChangeLogger = new PlanChangeLogger[LogicalPlan]() + val normalized = normalizationRules.foldLeft(commandExecuted)((p, rule) => { Review Comment: nit: `(... => { ... })` is equivalent to just `{ ... => ... }` in this context -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children
cloud-fan commented on code in PR #38713: URL: https://github.com/apache/spark/pull/38713#discussion_r1026557894 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/UnpivotParserSuite.scala: ## @@ -192,4 +193,131 @@ class UnpivotParserSuite extends AnalysisTest { ) } + test("unpivot - with joins") { Review Comment: I didn't add tests for pivot because: 1. there is no pivot parser suite 2. pivot/unpivot syntax is exactly the same regarding joins, no need to test both -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children
cloud-fan commented on PR #38713: URL: https://github.com/apache/spark/pull/38713#issuecomment-1320161201 cc @viirya @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan opened a new pull request, #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children
cloud-fan opened a new pull request, #38713: URL: https://github.com/apache/spark/pull/38713 ### What changes were proposed in this pull request? Today, our SQL parser only supports PIVOT/UNPIVOT at the end of the FROM clause. This is quite limited and it's better to allow PIVOT/UNPIVOT in the join children as well. As a reference, snowflake supports it: https://docs.snowflake.com/en/sql-reference/constructs/from.html This PR supports the following SQL syntaxes: ``` FROM t1 PIVOT/UNPIVOT ... JOIN t2 // pivot/unpivot the left table FROM t1 JOIN t2 PIVOT/UNPIVOT ... // pivot/unpivot the join result. This is the same before this PR FROM t1 JOIN (t2 PIVOT/UNPIVOT ...) // pivot/unpivot the right table ``` ### Why are the changes needed? make PIVOT/UNPIVOT syntax more flexible ### Does this PR introduce _any_ user-facing change? Yes, new SQL syntax without any breaking change ### How was this patch tested? new parser tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1026534660 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/LikeAnyBenchmark.scala: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.Random + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.internal.SQLConf + +/** + * Benchmark to measure like any expressions performance. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * --jars , + * 2. build/sbt "sql/Test/runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain " + * Results will be written to "benchmarks/LikeAnyBenchmark-results.txt". + * }}} + */ +object LikeAnyBenchmark extends SqlBasedBenchmark { Review Comment: Please remove this benchmark test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1026531799 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala: ## @@ -207,11 +207,17 @@ class LikeSimplificationSuite extends PlanTest { val optimized = Optimize.execute(originalQuery.analyze) val correctAnswer = testRelation - .where((StartsWith($"a", "abc") || EndsWith($"a", "xyz")) || -(Length($"a") >= 6 && (StartsWith($"a", "abc") && EndsWith($"a", "def" || -Contains($"a", "mn")) || ($"a" === "")) || ($"a" === "abc")) || -($"a" likeAny("abc\\%", "abc\\%def", "%mn\\%"))) - .analyze + .where( +( + ( Review Comment: Please fix the format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38075: [WIP][SPARK-40633][BUILD] Upgrade janino to 3.1.8
LuciferYang commented on code in PR #38075: URL: https://github.com/apache/spark/pull/38075#discussion_r1026500589 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -1310,7 +1310,7 @@ case class CatalystToExternalMap private( val tupleClass = classOf[(_, _)].getName val appendToBuilder = s""" - $tupleClass $tupleLoopValue; + $tupleClass $tupleLoopValue = null; Review Comment: The root cause of the change has not been investigated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38075: [WIP][SPARK-40633][BUILD] Upgrade janino to 3.1.8
LuciferYang commented on code in PR #38075: URL: https://github.com/apache/spark/pull/38075#discussion_r1026497690 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -1310,7 +1310,7 @@ case class CatalystToExternalMap private( val tupleClass = classOf[(_, _)].getName val appendToBuilder = s""" - $tupleClass $tupleLoopValue; + $tupleClass $tupleLoopValue = null; Review Comment: init to null for fix ``` /* 063 */ scala.Tuple2 tupleLoopValue_0; /* 064 */ /* 065 */ if (false) { /* 066 */ tupleLoopValue_0 = new scala.Tuple2(value_CatalystToExternalMap_key_lambda_variable_1, null); /* 067 */ } else { /* 068 */ tupleLoopValue_0 = new scala.Tuple2(value_CatalystToExternalMap_key_lambda_variable_1, value_CatalystToExternalMap_value_lambda_variable_2); /* 069 */ } /* 070 */ /* 071 */ builderValue_0.$plus$eq(tupleLoopValue_0); /* 072 */ ``` ``` Caused by: org.codehaus.commons.compiler.InternalCompilerException: File 'generated.java', Line 71, Column 16: Compiling "builderValue_0.$plus$eq(tupleLoopValue_0)" at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5819) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:4053) at org.codehaus.janino.UnitCompiler.access$6100(UnitCompiler.java:236) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:4028) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:4003) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5470) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:4003) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2487) ... 155 more Caused by: org.codehaus.commons.compiler.InternalCompilerException: File 'generated.java', Line 71, Column 25: Compiling "tupleLoopValue_0" at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5819) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5323) at org.codehaus.janino.UnitCompiler.access$9300(UnitCompiler.java:236) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4698) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4674) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5470) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4674) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5817) ... 162 more Caused by: org.codehaus.commons.compiler.InternalCompilerException: Invalid local variable index 10 at org.codehaus.janino.UnitCompiler.getLocalVariableTypeInfo(UnitCompiler.java:13498) at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:12500) at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:12475) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4745) at org.codehaus.janino.UnitCompiler.access$8200(UnitCompiler.java:236) at org.codehaus.janino.UnitCompiler$16$1.visitLocalVariableAccess(UnitCompiler.java:4684) at org.codehaus.janino.UnitCompiler$16$1.visitLocalVariableAccess(UnitCompiler.java:4678) at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:4661) at org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4678) at org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4674) at org.codehaus.janino.Java$Lvalue.accept(Java.java:4528) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4674) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4741) at org.codehaus.janino.UnitCompiler.access$7700(UnitCompiler.java:236) at org.codehaus.janino.UnitCompiler$16$1.visitAmbiguousName(UnitCompiler.java:4679) at org.codehaus.janino.UnitCompiler$16$1.visitAmbiguousName(UnitCompiler.java:4678) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4603) at org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4678) at org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4674) at org.codehaus.janino.Java$Lvalue.accept(Java.java:4528) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4674) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5817) ... 169 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions
LuciferYang commented on code in PR #38705: URL: https://github.com/apache/spark/pull/38705#discussion_r1026475352 ## sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out: ## @@ -14,7 +29,7 @@ select format_string() struct<> -- !query output org.apache.spark.sql.AnalysisException -requirement failed: format_string() should take at least 1 argument; line 1 pos 7 +0; line 1 pos 7 Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions
LuciferYang commented on code in PR #38705: URL: https://github.com/apache/spark/pull/38705#discussion_r1026474420 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala: ## @@ -1662,8 +1675,7 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression = Litera // scalastyle:on line.size.limit case class FormatString(children: Expression*) extends Expression with ImplicitCastInputTypes { - require(children.nonEmpty, s"$prettyName() should take at least 1 argument") - if (!SQLConf.get.getConf(SQLConf.ALLOW_ZERO_INDEX_IN_FORMAT_STRING)) { + if (children.nonEmpty && !SQLConf.get.getConf(SQLConf.ALLOW_ZERO_INDEX_IN_FORMAT_STRING)) { Review Comment: My mistake, need to add nonEmpty condition -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1026470585 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala: ## @@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase { TypeCheckSuccess } } + + /** + * Combine a number of boolean expressions into a balanced expression tree. These expressions are + * either combined by a logical [[And]] or a logical [[Or]]. + * + * A balanced binary tree is created because regular left recursive trees cause considerable + * performance degradations and can cause stack overflows. + */ + def reduceToExpressionTree( + patterns: Seq[Expression], + expressionCombiner: (Expression, Expression) => Expression): Expression = { +assert(patterns.size > 0) +var res = patterns +while (res.size > 1) { + res = res.sliding(2, 2).toSeq +.map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0)) +} +res.head Review Comment: It seems that the original implementation is better than this one? ```scala def reduceToExpressionTree1( patterns: Seq[Expression], expressionCombiner: (Expression, Expression) => Expression): Expression = { var res = patterns while (res.size > 1) { res = res.sliding(2, 2).toSeq .map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0)) } res.head } def reduceToExpressionTree2( expressions: Seq[Expression], expressionCombiner: (Expression, Expression) => Expression): Expression = { def reduceToExpressionTree(low: Int, high: Int): Expression = high - low match { case 0 => expressions(low) case 1 => expressionCombiner(expressions(low), expressions(high)) case x => val mid = low + x / 2 expressionCombiner( reduceToExpressionTree(low, mid), reduceToExpressionTree(mid + 1, high)) } reduceToExpressionTree(0, expressions.size - 1) } Seq(1, 10, 100).foreach { N => val expressions = Range(1, N).map(Literal(_)) val benchmark = new Benchmark(s"Benchmark reduceToExpressionTree with $N elements", N, minNumIters = 10) benchmark.addCase("reduceToExpressionTree1") { _ => reduceToExpressionTree1(expressions, Or.apply) } benchmark.addCase("reduceToExpressionTree2") { _ => reduceToExpressionTree2(expressions, Or.apply) } benchmark.run() } ``` ``` OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Benchmark reduceToExpressionTree with 100 elements: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- reduceToExpressionTree1 823 1012 212 1.2 822.5 1.0X reduceToExpressionTree2 150 183 25 6.7 149.7 5.5X ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1026470585 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala: ## @@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase { TypeCheckSuccess } } + + /** + * Combine a number of boolean expressions into a balanced expression tree. These expressions are + * either combined by a logical [[And]] or a logical [[Or]]. + * + * A balanced binary tree is created because regular left recursive trees cause considerable + * performance degradations and can cause stack overflows. + */ + def reduceToExpressionTree( + patterns: Seq[Expression], + expressionCombiner: (Expression, Expression) => Expression): Expression = { +assert(patterns.size > 0) +var res = patterns +while (res.size > 1) { + res = res.sliding(2, 2).toSeq +.map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0)) +} +res.head Review Comment: seems that the original implementation is better than this one? ```scala def reduceToExpressionTree1( patterns: Seq[Expression], expressionCombiner: (Expression, Expression) => Expression): Expression = { var res = patterns while (res.size > 1) { res = res.sliding(2, 2).toSeq .map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) else tup(0)) } res.head } def reduceToExpressionTree2( expressions: Seq[Expression], expressionCombiner: (Expression, Expression) => Expression): Expression = { def reduceToExpressionTree(low: Int, high: Int): Expression = high - low match { case 0 => expressions(low) case 1 => expressionCombiner(expressions(low), expressions(high)) case x => val mid = low + x / 2 expressionCombiner( reduceToExpressionTree(low, mid), reduceToExpressionTree(mid + 1, high)) } reduceToExpressionTree(0, expressions.size - 1) } Seq(1, 10, 100).foreach { N => val expressions = Range(1, N).map(Literal(_)) val benchmark = new Benchmark(s"Benchmark reduceToExpressionTree with $N elements", N, minNumIters = 10) benchmark.addCase("reduceToExpressionTree1") { _ => reduceToExpressionTree1(expressions, Or.apply) } benchmark.addCase("reduceToExpressionTree2") { _ => reduceToExpressionTree2(expressions, Or.apply) } benchmark.run() } ``` ``` OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Benchmark reduceToExpressionTree with 100 elements: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- reduceToExpressionTree1 823 1012 212 1.2 822.5 1.0X reduceToExpressionTree2 150 183 25 6.7 149.7 5.5X ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on PR #38711: URL: https://github.com/apache/spark/pull/38711#issuecomment-1320030900 cc @mridulm @Ngone51 FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching
cloud-fan commented on code in PR #38692: URL: https://github.com/apache/spark/pull/38692#discussion_r1026457875 ## sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala: ## @@ -192,6 +192,23 @@ class SparkSessionExtensionSuite extends SparkFunSuite with SQLHelper { testInjectColumnar(false) } + test("inject plan normalization rules") { +val extensions = create { extensions => + extensions.injectPlanNormalizationRules { session => +org.apache.spark.sql.catalyst.optimizer.PushDownPredicates + } +} +withSession(extensions) { session => + import session.implicits._ + val df = Seq((1, "a"), (2, "b")).toDF("i", "s") + df.select("i").filter($"i" > 1).cache() + assert(df.filter($"i" > 1).select("i").queryExecution.executedPlan.find { +case _: org.apache.spark.sql.execution.columnar.InMemoryTableScanExec => true +case _ => false Review Comment: yup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38497: [SPARK-40999] Hint propagation to subqueries
cloud-fan closed pull request #38497: [SPARK-40999] Hint propagation to subqueries URL: https://github.com/apache/spark/pull/38497 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38497: [SPARK-40999] Hint propagation to subqueries
cloud-fan commented on PR #38497: URL: https://github.com/apache/spark/pull/38497#issuecomment-1320017014 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec
cloud-fan commented on PR #38687: URL: https://github.com/apache/spark/pull/38687#issuecomment-1320010391 there is another cache in `SessionCatalog.tableRelationCache`, shall we update it as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38706: [TEST ONLY] Come back to collect.foreach(send)
HyukjinKwon commented on code in PR #38706: URL: https://github.com/apache/spark/pull/38706#discussion_r1026369143 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -57,13 +55,7 @@ class SparkConnectStreamHandler(responseObserver: StreamObserver[ExecutePlanResp // Extract the plan from the request and convert it to a logical plan val planner = new SparkConnectPlanner(session) val dataframe = Dataset.ofRows(session, planner.transformRelation(request.getPlan.getRoot)) -try { - processAsArrowBatches(request.getClientId, dataframe) -} catch { - case e: Exception => -logWarning(e.getMessage) -processAsJsonBatches(request.getClientId, dataframe) -} +processAsArrowBatches(request.getClientId, dataframe) } def processAsJsonBatches(clientId: String, dataframe: DataFrame): Unit = { Review Comment: let's also remove this and protobuf definition -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
HyukjinKwon commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1026348675 ## sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala: ## @@ -21,24 +21,22 @@ import java.nio.charset.StandardCharsets import java.sql.{Date, Timestamp} import java.text.SimpleDateFormat import java.util.Locale - Review Comment: Let's keep these newlines. I think Scala linter would complain about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38675: [SPARK-41161][BUILD] Upgrade scala-parser-combinators to 2.1.1
HyukjinKwon closed pull request #38675: [SPARK-41161][BUILD] Upgrade scala-parser-combinators to 2.1.1 URL: https://github.com/apache/spark/pull/38675 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38675: [SPARK-41161][BUILD] Upgrade scala-parser-combinators to 2.1.1
HyukjinKwon commented on PR #38675: URL: https://github.com/apache/spark/pull/38675#issuecomment-1319891935 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1026330362 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -743,6 +743,21 @@ object LikeSimplification extends Rule[LogicalPlan] { } } + def combinePatterns( Review Comment: Remove it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions
MaxGekk commented on code in PR #38705: URL: https://github.com/apache/spark/pull/38705#discussion_r1026301316 ## sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out: ## @@ -14,7 +29,7 @@ select format_string() struct<> -- !query output org.apache.spark.sql.AnalysisException -requirement failed: format_string() should take at least 1 argument; line 1 pos 7 +0; line 1 pos 7 Review Comment: hmm, why is that? We should see similar error as for `select concat_ws()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
EnricoMi commented on PR #38676: URL: https://github.com/apache/spark/pull/38676#issuecomment-1319839395 > Could we fix the `DeduplicateRelations`? Interesting, that sounds like a better solution. I'll look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38688: [SPARK-41166][SQL][TESTS] Check errorSubClass of DataTypeMismatch in *ExpressionSuites
MaxGekk closed pull request #38688: [SPARK-41166][SQL][TESTS] Check errorSubClass of DataTypeMismatch in *ExpressionSuites URL: https://github.com/apache/spark/pull/38688 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38688: [SPARK-41166][TESTS] Check errorSubClass of DataTypeMismatch in *ExpressionSuites
MaxGekk commented on PR #38688: URL: https://github.com/apache/spark/pull/38688#issuecomment-1319815373 +1, LGTM. Merging to master. Thank you, @panbingkun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance
wangyum commented on code in PR #38682: URL: https://github.com/apache/spark/pull/38682#discussion_r1026264942 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -756,16 +771,16 @@ object LikeSimplification extends Rule[LogicalPlan] { } else { multi match { case l: LikeAll => - val and = replacements.reduceLeft(And) + val and = combinePatterns(replacements, And.apply).get Review Comment: Could you reuse these code to create a balanced tree: https://github.com/apache/spark/blob/5c43da6858721664318c3cdbcb051231b0e98175/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L1657-L1669 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38664: [SPARK-41147][SQL] Assign a name to the legacy error class `_LEGACY_ERROR_TEMP_1042`
MaxGekk commented on code in PR #38664: URL: https://github.com/apache/spark/pull/38664#discussion_r1026264084 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala: ## @@ -146,7 +147,10 @@ object FunctionRegistryBase { .filter(_.getParameterTypes.forall(_ == classOf[Expression])) .map(_.getParameterCount).distinct.sorted throw QueryCompilationErrors.invalidFunctionArgumentNumberError( -validParametersCount, name, params.length) +expressions.map(toPrettySQL(_)).mkString(","), Review Comment: Could you pass `Seq[Expression]` into `invalidFunctionArgumentNumberError()`, and form a string inside of the method as we do in other places like: ```scala def moreThanOneGeneratorError(generators: Seq[Expression], clause: String): Throwable = { new AnalysisException( errorClass = "UNSUPPORTED_GENERATOR.MULTI_GENERATOR", messageParameters = Map( "clause" -> clause, "num" -> generators.size.toString, "generators" -> generators.map(toSQLExpr).mkString(", "))) } ``` For instance, if we change the implementation of `toSQLExpr()`, your exceptions won't pick up new impl. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38650: [SPARK-41135][SQL] Rename `UNSUPPORTED_EMPTY_LOCATION` to `INVALID_EMPTY_LOCATION`
MaxGekk commented on code in PR #38650: URL: https://github.com/apache/spark/pull/38650#discussion_r1026258331 ## core/src/main/resources/error/error-classes.json: ## @@ -656,6 +656,11 @@ ], "sqlState" : "42000" }, + "INVALID_EMPTY_LOCATION" : { +"message" : [ + "The location name cannot be empty string or null, but `` was given." Review Comment: Could you add a test when the location is NULL. I just worry that parameters substitution might not handle this. Please, check this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
MaxGekk closed pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE` URL: https://github.com/apache/spark/pull/38644 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
MaxGekk commented on PR #38644: URL: https://github.com/apache/spark/pull/38644#issuecomment-1319797393 +1, LGTM. Merging to master. Thank you, @itholic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
MaxGekk commented on code in PR #38644: URL: https://github.com/apache/spark/pull/38644#discussion_r1026251775 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala: ## @@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with QueryErrorsBase { Decimal("12345678901234567890123456789012345678")) checkExceptionInExpression[ArithmeticException]( cast("123456789012345678901234567890123456789", DecimalType(38, 0)), - "Out of decimal type range") + "NUMERIC_OUT_OF_SUPPORTED_RANGE") Review Comment: > I think we might need to similar test utils for expression test such as checkErrorInExpression something like that. Good point. Could you open an JIRA to not forget about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38665: [SPARK-41156][SQL] Remove the class `TypeCheckFailure`
MaxGekk closed pull request #38665: [SPARK-41156][SQL] Remove the class `TypeCheckFailure` URL: https://github.com/apache/spark/pull/38665 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38665: [SPARK-41156][SQL] Remove the class `TypeCheckFailure`
MaxGekk commented on PR #38665: URL: https://github.com/apache/spark/pull/38665#issuecomment-1319784931 > There are still some uses in spark-rapids. I haven't found other uses in other famous repositories ok. Let's leave `TypeCheckFailure` as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk opened a new pull request, #38712: [WIP][SQL] Parameterized SQL queries
MaxGekk opened a new pull request, #38712: URL: https://github.com/apache/spark/pull/38712 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly *PlanParserSuite" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] toujours33 opened a new pull request, #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
toujours33 opened a new pull request, #38711: URL: https://github.com/apache/spark/pull/38711 ### What changes were proposed in this pull request? ExecutorAllocationManager only record count for speculative task, `stageAttemptToNumSpeculativeTasks` increment when speculative task submit, and only decrement when speculative task end. If task finished before speculative task start, the speculative task will never be scheduled, which will cause leak of `stageAttemptToNumSpeculativeTasks` and mislead the calculation of target executors. This PR fixes the issue by add task index in `SparkListenerSpeculativeTaskSubmitted` event, and record speculative task with task index, when task finished, the speculative task will also decrement. ### Why are the changes needed? To fix idle executors caused by pending speculative task from task that has been finished ### Does this PR introduce _any_ user-facing change? DeveloperApi `SparkListenerSpeculativeTaskSubmitted` add taskIndex with default value -1 ### How was this patch tested? Add a comprehensive unit test. Pass the GA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org