date:20221118

[GitHub] [spark] williamhyun opened a new pull request, #38724: [SPARK-41202][BUILD] Update ORC to 1.7.7

2022-11-18 Thread GitBox



williamhyun opened a new pull request, #38724:
URL: https://github.com/apache/spark/pull/38724

   ### What changes were proposed in this pull request?
   This PR aims to update ORC to 1.7.7.
   
   ### Why are the changes needed?
   This will bring the latest bug fixes. 
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38693: [SPARK-41196] [CONNECT] Homogenize the protobuf version across the Spark connect server to use the same major version.

2022-11-18 Thread GitBox



AmplabJenkins commented on PR #38693:
URL: https://github.com/apache/spark/pull/38693#issuecomment-1320825239

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-18 Thread GitBox



grundprinzip commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1027046810


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -271,8 +273,12 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformLocalRelation(rel: proto.LocalRelation): LogicalPlan = {
-val attributes = 
rel.getAttributesList.asScala.map(transformAttribute(_)).toSeq
-new org.apache.spark.sql.catalyst.plans.logical.LocalRelation(attributes)
+val (rows, structType) = ArrowConverters.fromBatchWithSchemaIterator(

Review Comment:
   This looks very good. 



##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala:
##
@@ -354,4 +365,16 @@ class SparkConnectPlannerSuite extends SparkFunSuite with 
SparkConnectPlanTest {
   transform(proto.Relation.newBuilder.setSetOp(intersect).build()))
 assert(e2.getMessage.contains("Intersect does not support union_by_name"))
   }
+
+  test("transform LocalRelation") {
+val inputRows = (0 until 10).map(InternalRow(_))

Review Comment:
   The test here is kind of bare bones. Before we fully approve the PR we need 
to extend the test coverage a bit. 



##
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala:
##
@@ -76,21 +72,26 @@ private[sql] object ArrowConverters extends Logging {
   schema: StructType,
   maxRecordsPerBatch: Long,
   timeZoneId: String,
-  context: TaskContext) extends Iterator[Array[Byte]] {
+  context: TaskContext)
+  extends Iterator[Array[Byte]] {

Review Comment:
   why these changes?



##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -271,9 +271,7 @@ message Deduplicate {
 
 // A relation that does not need to be qualified by name.
 message LocalRelation {
-  // (Optional) A list qualified attributes.
-  repeated Expression.QualifiedAttribute attributes = 1;
-  // TODO: support local data.
+  bytes data = 1;

Review Comment:
   Please add a comment mentioning that the data is stored as arrow IPC message 
streams and that since the ipc streams contain the schema we don't need to 
qualify it. 



##
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala:
##
@@ -37,10 +34,9 @@ import 
org.apache.spark.sql.catalyst.expressions.{UnsafeProjection, UnsafeRow}
 import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.util.ArrowUtils
-import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, 
ColumnVector}
+import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnVector, 
ColumnarBatch}

Review Comment:
   afaik this change should break scala Style as CB is before CV



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078

2022-11-18 Thread GitBox



MaxGekk closed pull request #38696: [SPARK-41175][SQL] Assign a name to the 
error class _LEGACY_ERROR_TEMP_1078
URL: https://github.com/apache/spark/pull/38696


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078

2022-11-18 Thread GitBox



MaxGekk commented on PR #38696:
URL: https://github.com/apache/spark/pull/38696#issuecomment-1320812183

   +1, LGTM. Merging to master.
   Thank you, @panbingkun and @srielau for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-18 Thread GitBox



amaliujia commented on PR #38723:
URL: https://github.com/apache/spark/pull/38723#issuecomment-1320811915

   @zhengruifeng @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-18 Thread GitBox



amaliujia opened a new pull request, #38723:
URL: https://github.com/apache/spark/pull/38723

   
   
   ### What changes were proposed in this pull request?
   
   Implement `DataFrame.SelectExpr` in Python client. `SelectExpr` also has a 
good amount of usage.
   
   ### Why are the changes needed?
   
   API coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078

2022-11-18 Thread GitBox



AmplabJenkins commented on PR #38696:
URL: https://github.com/apache/spark/pull/38696#issuecomment-1320808013

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WangGuangxin opened a new pull request, #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

2022-11-18 Thread GitBox



WangGuangxin opened a new pull request, #38722:
URL: https://github.com/apache/spark/pull/38722

   ### What changes were proposed in this pull request?
   In BytesToBytesMap, the longArray size can be up to `MAX_CAPACITY` instead 
`MAX_CAPACITY/2` since `MAX_CAPACITY` already take `two array entries per key` 
into account.
   
   ### Why are the changes needed?
   To handle larger dataset in BytestoBytesMap
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existings UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38682: [SPARK-41167][SQL] Improve multi like performance by creating a balanced expression tree predicate

2022-11-18 Thread GitBox



wangyum commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1320804263

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #38721: [WIP][SPARK-41172][SQL] Migrate the ambiguous ref error to an error class

2022-11-18 Thread GitBox



panbingkun opened a new pull request, #38721:
URL: https://github.com/apache/spark/pull/38721

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on pull request #38699: [SPARK-41188][CORE][ML] Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes

2022-11-18 Thread GitBox



WeichenXu123 commented on PR #38699:
URL: https://github.com/apache/spark/pull/38699#issuecomment-1320790862

   > If we are setting it in `SparkContext`, do we want to get rid of this from 
other places like `PythonRunner.compute` ?
   
   I think we can remove code in PythonRunner.compute


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya closed pull request #38716: [SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch

2022-11-18 Thread GitBox



viirya closed pull request #38716: [SPARK-X][SS] Use latestCommittedBatchId 
as currentBatchId when resuming late batch
URL: https://github.com/apache/spark/pull/38716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra and replace `list_run_infos` with `search_runs` in mlflow doctest

2022-11-18 Thread GitBox



Yikun commented on PR #38698:
URL: https://github.com/apache/spark/pull/38698#issuecomment-1320737666

   Merge to master, @HyukjinKwon @harupy Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun closed pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra and replace `list_run_infos` with `search_runs` in mlflow doctest

2022-11-18 Thread GitBox



Yikun closed pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra 
and replace `list_run_infos` with `search_runs` in mlflow doctest
URL: https://github.com/apache/spark/pull/38698


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on PR #38682:
URL: https://github.com/apache/spark/pull/38682#issuecomment-1320735591

   @wankunde Please fix the PR title and description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1027001153


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##
@@ -117,4 +117,29 @@ object ExprUtils extends QueryErrorsBase {
   TypeCheckSuccess
 }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. 
These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees 
cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+   expressions: Seq[Expression],
+   expressionCombiner: (Expression, Expression) => Expression): Expression 
= {
+assert(expressions.size > 0)

Review Comment:
   Remove this `assert`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used

2022-11-18 Thread GitBox



viirya commented on code in PR #38719:
URL: https://github.com/apache/spark/pull/38719#discussion_r1026999005


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala:
##
@@ -345,7 +345,14 @@ trait ProgressReporter extends Logging {
   val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves()
   if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) {
 val execLeafToSource = 
allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap {
-  case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep 
-> source }
+  case (_, ep: MicroBatchScanExec) =>
+// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder 
instance for DSv2
+// streaming source, hence we cannot lookup the actual source from 
the map.
+// The physical node for DSv2 streaming source contains the 
information of the source
+// by itself, so leverage it.
+Some(ep -> ep.stream)

Review Comment:
   Oh, got it. I saw that there is a `onlyDataSourceV2Sources` and a dedicated 
block for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38698: [SPARK-41186][INFRA][PS][TESTS] Upgrade infra and replace `list_run_infos` with `search_runs` in mlflow doctest

2022-11-18 Thread GitBox



HyukjinKwon commented on PR #38698:
URL: https://github.com/apache/spark/pull/38698#issuecomment-1320711448

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python

2022-11-18 Thread GitBox



HyukjinKwon closed pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix 
out of sync generated files for Python
URL: https://github.com/apache/spark/pull/38718


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python

2022-11-18 Thread GitBox



HyukjinKwon commented on PR #38718:
URL: https://github.com/apache/spark/pull/38718#issuecomment-1320707595

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used

2022-11-18 Thread GitBox



HeartSaVioR commented on code in PR #38719:
URL: https://github.com/apache/spark/pull/38719#discussion_r1026994040


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala:
##
@@ -345,7 +345,14 @@ trait ProgressReporter extends Logging {
   val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves()
   if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) {
 val execLeafToSource = 
allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap {
-  case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep 
-> source }
+  case (_, ep: MicroBatchScanExec) =>
+// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder 
instance for DSv2
+// streaming source, hence we cannot lookup the actual source from 
the map.
+// The physical node for DSv2 streaming source contains the 
information of the source
+// by itself, so leverage it.
+Some(ep -> ep.stream)

Review Comment:
   So all the problems from streaming query metrics are due to DSv1 which we do 
not have a dedicated logical node / physical node for source.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used

2022-11-18 Thread GitBox



HeartSaVioR commented on code in PR #38719:
URL: https://github.com/apache/spark/pull/38719#discussion_r1026993747


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala:
##
@@ -345,7 +345,14 @@ trait ProgressReporter extends Logging {
   val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves()
   if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) {
 val execLeafToSource = 
allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap {
-  case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep 
-> source }
+  case (_, ep: MicroBatchScanExec) =>
+// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder 
instance for DSv2
+// streaming source, hence we cannot lookup the actual source from 
the map.
+// The physical node for DSv2 streaming source contains the 
information of the source
+// by itself, so leverage it.
+Some(ep -> ep.stream)

Review Comment:
   We have two different paths - if there are only DSv2 streaming sources, we 
don't even try to match the logical plan and physical plan. We just collect the 
metrics out from physical plan, which is always accurate.
   
   
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L296-L316



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell opened a new pull request, #38720: [SPARK-41165][SPARK-41184][CONNECT] Fix arrow collect (again) and reenable tests.

2022-11-18 Thread GitBox



hvanhovell opened a new pull request, #38720:
URL: https://github.com/apache/spark/pull/38720

   ### What changes were proposed in this pull request?
   The arrow collect code path for connect contains a bug where it would always 
fall back to JSON. This was caused by the assumption that `NonFatal(e)` does 
not match nulls, it unfortunately does. This has been fixed by doing explicit 
null checks and by reordering the checks in 
`SparkConnectStreamHandler.processAsArrowBatches`.
   
   ### Why are the changes needed?
   The previous code had a bug and would always fallback to JSON.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   I added a new test, and I re-enabled the python test disabled in SPARK-41184.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used

2022-11-18 Thread GitBox



viirya commented on code in PR #38719:
URL: https://github.com/apache/spark/pull/38719#discussion_r1026993007


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala:
##
@@ -345,7 +345,14 @@ trait ProgressReporter extends Logging {
   val allExecPlanLeaves = lastExecution.executedPlan.collectLeaves()
   if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) {
 val execLeafToSource = 
allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap {
-  case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => ep 
-> source }
+  case (_, ep: MicroBatchScanExec) =>
+// SPARK-41199: `logicalPlanLeafToSource` contains OffsetHolder 
instance for DSv2
+// streaming source, hence we cannot lookup the actual source from 
the map.
+// The physical node for DSv2 streaming source contains the 
information of the source
+// by itself, so leverage it.
+Some(ep -> ep.stream)

Review Comment:
   Why it is only for DS v1 mixing with DS v2? Seems DS v2 stream source always 
cannot be matched by `logicalPlanLeafToSource` because it is `OffsetHolder`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #36695: [SPARK-38474][CORE] Use error class in org.apache.spark.security

2022-11-18 Thread GitBox



github-actions[bot] closed pull request #36695: [SPARK-38474][CORE] Use error 
class in org.apache.spark.security
URL: https://github.com/apache/spark/pull/36695


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36767: [SPARK-39363][K8S] Deprecate k8s memory overhead and make it optional

2022-11-18 Thread GitBox



github-actions[bot] commented on PR #36767:
URL: https://github.com/apache/spark/pull/36767#issuecomment-1320692121

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #37359: [SPARK-25342][CORE][SQL]Support rolling back a result stage and rerunning all result tasks when writing files

2022-11-18 Thread GitBox



github-actions[bot] closed pull request #37359: [SPARK-25342][CORE][SQL]Support 
rolling back a result stage and rerunning all result tasks when writing files
URL: https://github.com/apache/spark/pull/37359


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-11-18 Thread GitBox



github-actions[bot] closed pull request #37129: [SPARK-39710][SQL] Support push 
local topK through outer join
URL: https://github.com/apache/spark/pull/37129


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #37460: [WIP][SPARK-40031][SQL] Remove unnecessary TryEval in TryCast

2022-11-18 Thread GitBox



github-actions[bot] commented on PR #37460:
URL: https://github.com/apache/spark/pull/37460#issuecomment-1320692056

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026987556


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##
@@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase {
   TypeCheckSuccess
 }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. 
These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees 
cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+   patterns: Seq[Expression],

Review Comment:
   `patterns` -> `expressions`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #38716: [SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch

2022-11-18 Thread GitBox



viirya commented on PR #38716:
URL: https://github.com/apache/spark/pull/38716#issuecomment-1320636937

   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liuzqt commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollec

2022-11-18 Thread GitBox



liuzqt commented on code in PR #38704:
URL: https://github.com/apache/spark/pull/38704#discussion_r1026966457


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest
   with SharedSparkSession {
 
   override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
-  test("collect data with single partition larger than 2GB bytes array limit") 
{
+  // SPARK-41193: Ignore this suite because it cannot run successfully with 
Spark
+  // default Java Options, if user need do local test, please make the 
following changes:
+  // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in 
`sql/core/pom.xml` to `-Xmx10g`
+  // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` 
to `-Xmx10g`
+  ignore("collect data with single partition larger than 2GB bytes array 
limit") {

Review Comment:
   Yes @LuciferYang  is right, need to change `-Xmx4g` to `-Xmx10g` to make it 
work (it works for both shared local session and local cluster, but without the 
change neither work).  
   
   Thanks for the fix!  Previously I only tested this using IDE and I guess it 
increased the mem under the hood..Sorry for the inconvenience.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38702: [SPARK-41187][Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

2022-11-18 Thread GitBox



AmplabJenkins commented on PR #38702:
URL: https://github.com/apache/spark/pull/38702#issuecomment-1320614168

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist

2022-11-18 Thread GitBox



AmplabJenkins commented on PR #38703:
URL: https://github.com/apache/spark/pull/38703#issuecomment-1320614138

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tedyu commented on pull request #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release

2022-11-18 Thread GitBox



tedyu commented on PR #38715:
URL: https://github.com/apache/spark/pull/38715#issuecomment-1320568475

   @HeartSaVioR 
   Can you take a look ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #38719: [SPARK-41199][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used

2022-11-18 Thread GitBox



HeartSaVioR commented on PR #38719:
URL: https://github.com/apache/spark/pull/38719#issuecomment-1320531883

   cc. @zsxwing @viirya @xuanyuanking Please take a look. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-18 Thread GitBox



HeartSaVioR commented on PR #38717:
URL: https://github.com/apache/spark/pull/38717#issuecomment-1320531472

   cc. @zsxwing @cloud-fan @viirya Please take a look. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #35969: [SPARK-38651][SQL] Add configuration to support writing out empty schemas in supported filebased datasources

2022-11-18 Thread GitBox



xkrogen commented on PR #35969:
URL: https://github.com/apache/spark/pull/35969#issuecomment-1320527156

   @cloud-fan , any more concerns on this approach based on what @thejdeep 
shared?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR opened a new pull request, #38719: [SPARK-41999][SS] Fix metrics issue when DSv1 streaming source and DSv2 streaming source are co-used

2022-11-18 Thread GitBox



HeartSaVioR opened a new pull request, #38719:
URL: https://github.com/apache/spark/pull/38719

   ### What changes were proposed in this pull request?
   
   This PR proposes to fix the metrics issue for streaming query when DSv1 
streaming source and DSv2 streaming source are co-used. If the streaming query 
has both DSv1 streaming source and DSv2 streaming source, only DSv1 streaming 
source produced correct metrics.
   
   There is a bug in ProgressReporter that it tries to match logical node for 
DSv2 streaming source with OffsetHolder, which will be never matched. Given 
that physical node for DSv2 streaming source contains both source information 
and metrics, we can simply deduce all the necessary information from the 
physical node rather than trying to find the source from association map.
   
   ### Why are the changes needed?
   
   The logic of collecting metrics does not collect metrics for DSv2 streaming 
sources properly.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python

2022-11-18 Thread GitBox



amaliujia commented on PR #38718:
URL: https://github.com/apache/spark/pull/38718#issuecomment-1320510474

   @zhengruifeng @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python

2022-11-18 Thread GitBox



amaliujia opened a new pull request, #38718:
URL: https://github.com/apache/spark/pull/38718

   
   
   ### What changes were proposed in this pull request?
   
   Fix out of sync generated files for Python.
   
   This happens on a rare case for protobuf version change. There were 
something not generated before but with the protobuf version downgraded that 
was generated (and this is why there was no merge conflict). However the PR was 
based on old code before https://github.com/apache/spark/pull/38638 so the 
protobuf generates based on stale code which leads to stale generated files. 
   
   
   
   ### Why are the changes needed?
   
   Fix out of sync generated files for Python.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078

2022-11-18 Thread GitBox



MaxGekk commented on PR #38696:
URL: https://github.com/apache/spark/pull/38696#issuecomment-1320493195

   cc @srielau


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR opened a new pull request, #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-18 Thread GitBox



HeartSaVioR opened a new pull request, #38717:
URL: https://github.com/apache/spark/pull/38717

   ### What changes were proposed in this pull request?
   
   This PR proposes to fix the broken metrics when the streaming query has CTE, 
via applying InlineCTE manually against analyzed plan when collecting metrics.
   
   Suppose a streaming query contains below part as batch side which is joined 
with streaming source:
   
   ```
   with batch_tbl as (
 SELECT col1, col2 FROM parquet_tbl
   )
   
   SELECT col1 AS key, col2 as value_batch FROM batch_tbl
   ```
   
   Currently, Spark adds WithCTE node with CTERelationDef and CTERelationRef 
when there is a usage of CTE. Below is an analyzed plan:
   
   ```
   WriteToMicroBatchDataSource MemorySink, 
2cbb2afa-6513-4a23-b4c2-37910fc9cdf9, Append, 0
   +- Project [key#15, value_stream#16, value_batch#9L]
  +- Join Inner, (cast(key#15 as bigint) = key#8L)
 :- SubqueryAlias spark_catalog.default.parquet_streaming_tbl
 :  +- Project [key#55 AS key#15, value_stream#56 AS value_stream#16]
 : +- Relation 
spark_catalog.default.parquet_streaming_tbl[key#55,value_stream#56] parquet
 +- WithCTE
:- CTERelationDef 0, false
:  +- SubqueryAlias batch_tbl
: +- Project [col1#10L, col2#11L]
:+- SubqueryAlias spark_catalog.default.parquet_tbl
:   +- Relation 
spark_catalog.default.parquet_tbl[col1#10L,col2#11L] parquet
+- Project [col1#10L AS key#8L, col2#11L AS value_batch#9L]
   +- SubqueryAlias batch_tbl
  +- CTERelationRef 0, true, [col1#10L, col2#11L]
   ```
   
   Here, there are 3 leaf nodes in the plan, but the actual sources in the leaf 
nodes are 2. During the optimization, inlining CTE happens and there are 2 leaf 
nodes. Below is the optimized plan:
   
   ```
   WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@622c7c7f]
   +- Project [key#55, value_stream#56, value_batch#9L]
  +- Join Inner, (cast(key#55 as bigint) = key#8L)
 :- Filter isnotnull(key#55)
 :  +- Relation 
spark_catalog.default.parquet_streaming_tbl[key#55,value_stream#56] parquet
 +- Project [col1#10L AS key#8L, col2#11L AS value_batch#9L]
+- Filter isnotnull(col1#10L)
   +- Relation spark_catalog.default.parquet_tbl[col1#10L,col2#11L] 
parquet
   ```
   
   Hence executed plan will also have 2 leaf nodes, which does not match with 
the number of leaf nodes in analyzed plan, and ProgressReporter will give up 
collecting metrics.
   
   Applying InlineCTE against analyzed plan during collecting metrics would 
resolve this. For example, below is the logical plan which applies InlineCTE 
against above analyzed plan.
   
   ```
   WriteToMicroBatchDataSource MemorySink, 
2cbb2afa-6513-4a23-b4c2-37910fc9cdf9, Append, 0
   +- Project [key#15, value_stream#16, value_batch#9L]
  +- Join Inner, (cast(key#15 as bigint) = key#8L)
 :- SubqueryAlias spark_catalog.default.parquet_streaming_tbl
 :  +- Project [key#55 AS key#15, value_stream#56 AS value_stream#16]
 : +- Relation 
spark_catalog.default.parquet_streaming_tbl[key#55,value_stream#56] parquet
 +- Project [col1#10L AS key#8L, col2#11L AS value_batch#9L]
+- SubqueryAlias batch_tbl
   +- SubqueryAlias batch_tbl
  +- Project [col1#10L, col2#11L]
 +- SubqueryAlias spark_catalog.default.parquet_tbl
+- Relation 
spark_catalog.default.parquet_tbl[col1#10L,col2#11L] parquet
   ```
   
   Note that this is only required for the case where there is at least one of 
DSv1 streaming source in the streaming query. If streaming query only contains 
DSv2 data sources as streaming sources, ProgressReporter can just pick up 
dedicated physical node(s) from executed plan.
   
   ### Why are the changes needed?
   
   The metrics in streaming query are broken if the query contains CTE. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New test case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya opened a new pull request, #38716: [SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch

2022-11-18 Thread GitBox



viirya opened a new pull request, #38716:
URL: https://github.com/apache/spark/pull/38716

   
   
   ### What changes were proposed in this pull request?
   
   
   This patch changes `currentBatchId` when `MicroBatchExecution` tries to 
resume from late batch from offset log. Previously it takes `latestBatchId` 
from offset log. This patch changes it to `latestCommittedBatchId`.
   
   ### Why are the changes needed?
   
   
   We have customer streaming job which is unable to restart from failed status 
while it failed to commit delta files. For example, if previous run failed to 
commit 14.delta, when the job restarted, it tried to read 14.delta. Because 
14.delta doesn't exist (not committed), the job cannot be restarted and resume 
from late batch.
   
   When `MicroBatchExecution` populates start offsets, it reads late batch id 
(`latestBatchId) from offset log and committed batch id 
(`latestCommittedBatchId`) from commit log. Currently if 
`latestCommittedBatchId` == `latestBatchId` - 1, it means that we resume from 
late batch. But it uses `latestBatchId` as `currentBatchId` to run batch. 
Obviously, `latestBatchId` is 14 for above example, and 
`latestCommittedBatchId` is 13. 
   
   Because `IncrementalExecution` uses `currentBatchId` to load checkpointed 
states, it tries to load version 14 of delta files. But version 14 is not 
committed in late run.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   
   ### How was this patch tested?
   
   
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] geofflangenderfer commented on pull request #4093: [SPARK-5307] SerializationDebugger to help debug NotSerializableException

2022-11-18 Thread GitBox



geofflangenderfer commented on PR #4093:
URL: https://github.com/apache/spark/pull/4093#issuecomment-1320457789

   could someone give a simple example of how to read the graph? I'm not sure 
where to start


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching

2022-11-18 Thread GitBox



ryan-johnson-databricks commented on code in PR #38692:
URL: https://github.com/apache/spark/pull/38692#discussion_r1026804602


##
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala:
##
@@ -192,6 +192,23 @@ class SparkSessionExtensionSuite extends SparkFunSuite 
with SQLHelper {
 testInjectColumnar(false)
   }
 
+  test("inject plan normalization rules") {
+val extensions = create { extensions =>
+  extensions.injectPlanNormalizationRules { session =>
+org.apache.spark.sql.catalyst.optimizer.PushDownPredicates
+  }
+}
+withSession(extensions) { session =>
+  import session.implicits._
+  val df = Seq((1, "a"), (2, "b")).toDF("i", "s")
+  df.select("i").filter($"i" > 1).cache()
+  assert(df.filter($"i" > 1).select("i").queryExecution.executedPlan.find {
+case _: org.apache.spark.sql.execution.columnar.InMemoryTableScanExec 
=> true
+case _ => false

Review Comment:
   Should we add a negative test that verifies this? Might be overkill...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions

2022-11-18 Thread GitBox



MaxGekk closed pull request #38705: [SPARK-41173][SQL] Move `require()` out 
from the constructors of string expressions
URL: https://github.com/apache/spark/pull/38705


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions

2022-11-18 Thread GitBox



MaxGekk commented on PR #38705:
URL: https://github.com/apache/spark/pull/38705#issuecomment-1320429350

   +1, LGTM. Merging to master.
   Thank you, @LuciferYang.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tedyu opened a new pull request, #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release

2022-11-18 Thread GitBox



tedyu opened a new pull request, #38715:
URL: https://github.com/apache/spark/pull/38715

   ### What changes were proposed in this pull request?
   This PR upgrades Kafka to 3.3.0 release.
   
   ### Why are the changes needed?
   Kafka 3.3.0 release has new features along with bug fixes: 
https://www.confluent.io/blog/apache-kafka-3-3-0-new-features-and-updates/
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing test suite


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ahshahid opened a new pull request, #38714: [WIP][SPARK-41141]. avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it

2022-11-18 Thread GitBox



ahshahid opened a new pull request, #38714:
URL: https://github.com/apache/spark/pull/38714

   ### What changes were proposed in this pull request?
   This is a PR for improvement
   When a subquery references the outer query's aggregate functions,  in some 
cases, it ends up introducing extra aggregate functions which are not needed. 
Though they would get eventually eliminated in the optimizer, but atleast in 
analyzer phase would add an extra project node etc.
   The change is in the code of identification of OuterReference in 
subquery.scala.
   Currently whenever an aggregate expression is found, it is assumed to be the 
Outer Reference.
   With this change,  the code checks whether the parent Expression can also be 
potentially part of the OuterReference too.
   So if we consider a query
   select cos (sum (a) ) , b from t1 having exists select 1 from t2 where x = 
cos ( sum(a) ) 
   
   the OuterReference detected would be cos ( sum(a) ) instead of just sum(a).
   As a result, no extra aggregate would be added.
   
   
   ### Why are the changes needed?
   To avoid adding unnecessary aggregate in outer query thereby reducing the 
number of expressions to analyze, clone and also avoid adding an extra project 
node.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Ran the precheckin tests and added new tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38693: [SPARK-41196] [CONNECT] Homogenize the protobuf version across the Spark connect server to use the same major version.

2022-11-18 Thread GitBox



amaliujia commented on PR #38693:
URL: https://github.com/apache/spark/pull/38693#issuecomment-1320392539

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #38693: [SPARK-41196] [CONNECT] Homogenize the protobuf version across the Spark connect server to use the same major version.

2022-11-18 Thread GitBox



hvanhovell closed pull request #38693: [SPARK-41196] [CONNECT] Homogenize the 
protobuf version across the Spark connect server to use the same major version.
URL: https://github.com/apache/spark/pull/38693


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] otterc commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-18 Thread GitBox



otterc commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1320389162

   Looks good to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042

2022-11-18 Thread GitBox



AmplabJenkins commented on PR #38707:
URL: https://github.com/apache/spark/pull/38707#issuecomment-1320346254

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

2022-11-18 Thread GitBox



EnricoMi commented on code in PR #38676:
URL: https://github.com/apache/spark/pull/38676#discussion_r1026695873


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -1938,7 +1940,10 @@ case class LateralJoin(
 joinType: JoinType,
 condition: Option[Expression]) extends UnaryNode {
 
-  require(Seq(Inner, LeftOuter, Cross).contains(joinType),
+  require(Seq(Inner, LeftOuter, Cross).contains(joinType match {

Review Comment:
   just needed by 
`sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/DeduplicateRelationsSuite.scala`:
   
   val originalQuery = left.lateralJoin(right, UsingJoin(Inner, Seq("a")))



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

2022-11-18 Thread GitBox



EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1320300231

   Problem is that `DeduplicateRelations` is only considering duplicates 
between left `output` and right `output`, and not duplicates between left 
`references` and right `output`. I have sketched a fix for `Join` and 
`LateralJoin`, including a proper test.
   
   This could potentially be done for all operators specifically handled in 
`DeduplicateRelations.apply()`, i.e. `AsOfJoin`, `Intersect`, `Except`, `Union` 
and `MergeIntoTable`.
   
   Deduplicating attributes that are already referenced will break the plan as 
those references break.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultColle

2022-11-18 Thread GitBox



mridulm commented on code in PR #38704:
URL: https://github.com/apache/spark/pull/38704#discussion_r1026679895


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest
   with SharedSparkSession {
 
   override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
-  test("collect data with single partition larger than 2GB bytes array limit") 
{
+  // SPARK-41193: Ignore this suite because it cannot run successfully with 
Spark
+  // default Java Options, if user need do local test, please make the 
following changes:
+  // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in 
`sql/core/pom.xml` to `-Xmx10g`
+  // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` 
to `-Xmx10g`
+  ignore("collect data with single partition larger than 2GB bytes array 
limit") {

Review Comment:
   @liuzqt, I know this was iterated on multiple times to get it to work - 
instead of the shared local spark session, did it work locally when using a 
local spark cluster instead ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #38699: [SPARK-41188][CORE][ML] Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes

2022-11-18 Thread GitBox



mridulm commented on PR #38699:
URL: https://github.com/apache/spark/pull/38699#issuecomment-1320294062

   If we are setting it in `SparkContext`, do we want to get rid of this from 
other places like `PythonRunner.compute` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children

2022-11-18 Thread GitBox



srielau commented on code in PR #38713:
URL: https://github.com/apache/spark/pull/38713#discussion_r1026672022


##
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -697,12 +697,12 @@ setQuantifier
 ;
 
 relation
-: LATERAL? relationPrimary joinRelation*
+: LATERAL? relationPrimary pivotClause? unpivotClause? joinRelation*
 ;
 
 joinRelation
-: (joinType) JOIN LATERAL? right=relationPrimary joinCriteria?
-| NATURAL joinType JOIN LATERAL? right=relationPrimary
+: (joinType) JOIN LATERAL? right=relationPrimary joinCriteria? 
pivotClause? unpivotClause?
+| NATURAL joinType JOIN LATERAL? right=relationPrimary pivotClause? 
unpivotClause?

Review Comment:
   Have you tried simply:
   ```suggestion
   : (joinType) JOIN LATERAL? right=relationPrimary joinCriteria? 
pivotClause? unpivotClause?
   | NATURAL joinType JOIN LATERAL? right=relationPrimary
   | pivotClause
   | unpivotClause
   ```
   
   Also removing the relation entries above?



##
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -697,12 +697,12 @@ setQuantifier
 ;
 
 relation
-: LATERAL? relationPrimary joinRelation*
+: LATERAL? relationPrimary pivotClause? unpivotClause? joinRelation*

Review Comment:
   This doesn't look right.
   As you say PIVOT and UNPIVOT are like JOINs, so why are do you have this 
strange ordered optional clauses instead of merging them in as if they were 
another type  of join. 
   
   I can't put my finger on exactly how it breaks without running a bunch of 
tests.
   For example I should be able to chain a string UNPIVOTs and PIVOTs in any 
order without the need for a JOIN (or a braces) in between. I don't see how 
that works here.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] antonipp commented on a diff in pull request #38376: [SPARK-40817] [Kubernetes] Do not discard remote user-specified files when launching Spark jobs on Kubernetes

2022-11-18 Thread GitBox



antonipp commented on code in PR #38376:
URL: https://github.com/apache/spark/pull/38376#discussion_r1026638180


##
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:
##
@@ -1609,6 +1609,16 @@ class TestFileSystem extends 
org.apache.hadoop.fs.LocalFileSystem {
   }
 
   override def open(path: Path): FSDataInputStream = super.open(local(path))
+
+  // No-op methods
+
+  override def copyFromLocalFile(

Review Comment:
   Well, it took me a a bit of time to get to it but I finally have a working 
local setup and I wrote an integration test for this PR. I added it in 
https://github.com/apache/spark/pull/38376/commits/cbda5403b00f4e66ad4651ceda7786de1f5e1e1a
   
   It worked for me locally with:
   ```bash
   mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
   -Pkubernetes -Pkubernetes-integration-tests 
-Phadoop-3 \
   
-Dspark.kubernetes.test.sparkTgz=/path/to/the/tgz/built/from/this/branch \
   -Dspark.kubernetes.test.deployMode=minikube \
   
-Dspark.kubernetes.test.imageRepo=docker.io/kubespark \
   -Dspark.kubernetes.test.namespace=spark \
   -Dspark.kubernetes.test.serviceAccountName=spark
   ```
   
   @dongjoon-hyun @holdenk, would appreciate your reviews as well 🙏 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092

2022-11-18 Thread GitBox



AmplabJenkins commented on PR #38710:
URL: https://github.com/apache/spark/pull/38710#issuecomment-1320213406

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-18 Thread GitBox



AmplabJenkins commented on PR #38711:
URL: https://github.com/apache/spark/pull/38711#issuecomment-1320213339

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching

2022-11-18 Thread GitBox



cloud-fan commented on code in PR #38692:
URL: https://github.com/apache/spark/pull/38692#discussion_r1026569958


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala:
##
@@ -217,6 +218,22 @@ class SparkSessionExtensions {
 checkRuleBuilders += builder
   }
 
+  private[this] val planNormalizationRules = mutable.Buffer.empty[RuleBuilder]
+
+  def buildPlanNormalizationRules(session: SparkSession): 
Seq[Rule[LogicalPlan]] = {
+planNormalizationRules.map(_.apply(session)).toSeq

Review Comment:
   I'm following the existing code style in this file. I assume the reason is 
people who are not familiar with Scala may be confused when reading the code 
`.map(_(session))`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching

2022-11-18 Thread GitBox



ryan-johnson-databricks commented on code in PR #38692:
URL: https://github.com/apache/spark/pull/38692#discussion_r1026540525


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala:
##
@@ -217,6 +218,22 @@ class SparkSessionExtensions {
 checkRuleBuilders += builder
   }
 
+  private[this] val planNormalizationRules = mutable.Buffer.empty[RuleBuilder]
+
+  def buildPlanNormalizationRules(session: SparkSession): 
Seq[Rule[LogicalPlan]] = {
+planNormalizationRules.map(_.apply(session)).toSeq

Review Comment:
   nit: isn't `.apply(...)` redundant with just `(...)` ?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala:
##
@@ -105,12 +105,30 @@ class QueryExecution(
 case other => other
   }
 
+  lazy val normalized: LogicalPlan = {
+val normalizationRules = sparkSession.sessionState.planNormalizationRules
+if (normalizationRules.isEmpty) {
+  commandExecuted
+} else {
+  val planChangeLogger = new PlanChangeLogger[LogicalPlan]()
+  val normalized = normalizationRules.foldLeft(commandExecuted)((p, rule) 
=> {

Review Comment:
   nit: `(... => { ... })` is equivalent to just `{ ... => ... }` in this 
context



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children

2022-11-18 Thread GitBox



cloud-fan commented on code in PR #38713:
URL: https://github.com/apache/spark/pull/38713#discussion_r1026557894


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/UnpivotParserSuite.scala:
##
@@ -192,4 +193,131 @@ class UnpivotParserSuite extends AnalysisTest {
 )
   }
 
+  test("unpivot - with joins") {

Review Comment:
   I didn't add tests for pivot because:
   1. there is no pivot parser suite
   2. pivot/unpivot syntax is exactly the same regarding joins, no need to test 
both



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children

2022-11-18 Thread GitBox



cloud-fan commented on PR #38713:
URL: https://github.com/apache/spark/pull/38713#issuecomment-1320161201

   cc @viirya @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan opened a new pull request, #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children

2022-11-18 Thread GitBox



cloud-fan opened a new pull request, #38713:
URL: https://github.com/apache/spark/pull/38713

   
   
   ### What changes were proposed in this pull request?
   
   Today, our SQL parser only supports PIVOT/UNPIVOT at the end of the FROM 
clause. This is quite limited and it's better to allow PIVOT/UNPIVOT in the 
join children as well. As a reference, snowflake supports it: 
https://docs.snowflake.com/en/sql-reference/constructs/from.html
   
   This PR supports the following SQL syntaxes:
   ```
   FROM t1 PIVOT/UNPIVOT ... JOIN t2  // pivot/unpivot the left table
   FROM t1 JOIN t2 PIVOT/UNPIVOT ...  // pivot/unpivot the join result. This is 
the same before this PR
   FROM t1 JOIN (t2 PIVOT/UNPIVOT ...)  // pivot/unpivot the right table
   ```
   
   
   ### Why are the changes needed?
   
   make PIVOT/UNPIVOT syntax more flexible
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, new SQL syntax without any breaking change
   
   ### How was this patch tested?
   
   new parser tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026534660


##
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/LikeAnyBenchmark.scala:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.Random
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Benchmark to measure like any expressions performance.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class 
+ * --jars , 
+ *   2. build/sbt "sql/Test/runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/Test/runMain "
+ *  Results will be written to "benchmarks/LikeAnyBenchmark-results.txt".
+ * }}}
+ */
+object LikeAnyBenchmark extends SqlBasedBenchmark {

Review Comment:
   Please remove this benchmark test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026531799


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LikeSimplificationSuite.scala:
##
@@ -207,11 +207,17 @@ class LikeSimplificationSuite extends PlanTest {
 
 val optimized = Optimize.execute(originalQuery.analyze)
 val correctAnswer = testRelation
-  .where((StartsWith($"a", "abc") || EndsWith($"a", "xyz")) ||
-(Length($"a") >= 6 && (StartsWith($"a", "abc") && EndsWith($"a", 
"def" ||
-Contains($"a", "mn")) || ($"a" === "")) || ($"a" === "abc")) ||
-($"a" likeAny("abc\\%", "abc\\%def", "%mn\\%")))
-  .analyze
+  .where(
+(
+  (

Review Comment:
   Please fix the format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38075: [WIP][SPARK-40633][BUILD] Upgrade janino to 3.1.8

2022-11-18 Thread GitBox



LuciferYang commented on code in PR #38075:
URL: https://github.com/apache/spark/pull/38075#discussion_r1026500589


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -1310,7 +1310,7 @@ case class CatalystToExternalMap private(
 
 val tupleClass = classOf[(_, _)].getName
 val appendToBuilder = s"""
-  $tupleClass $tupleLoopValue;
+  $tupleClass $tupleLoopValue = null;

Review Comment:
   The root cause of the change has not been investigated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38075: [WIP][SPARK-40633][BUILD] Upgrade janino to 3.1.8

2022-11-18 Thread GitBox



LuciferYang commented on code in PR #38075:
URL: https://github.com/apache/spark/pull/38075#discussion_r1026497690


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -1310,7 +1310,7 @@ case class CatalystToExternalMap private(
 
 val tupleClass = classOf[(_, _)].getName
 val appendToBuilder = s"""
-  $tupleClass $tupleLoopValue;
+  $tupleClass $tupleLoopValue = null;

Review Comment:
   init to null for fix
   
   ```
   /* 063 */ scala.Tuple2 tupleLoopValue_0;
   /* 064 */
   /* 065 */ if (false) {
   /* 066 */   tupleLoopValue_0 = new 
scala.Tuple2(value_CatalystToExternalMap_key_lambda_variable_1, null);
   /* 067 */ } else {
   /* 068 */   tupleLoopValue_0 = new 
scala.Tuple2(value_CatalystToExternalMap_key_lambda_variable_1, 
value_CatalystToExternalMap_value_lambda_variable_2);
   /* 069 */ }
   /* 070 */
   /* 071 */ builderValue_0.$plus$eq(tupleLoopValue_0);
   /* 072 */
   ```
   
   ```
   Caused by: org.codehaus.commons.compiler.InternalCompilerException: File 
'generated.java', Line 71, Column 16: Compiling 
"builderValue_0.$plus$eq(tupleLoopValue_0)"
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5819)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:4053)
at org.codehaus.janino.UnitCompiler.access$6100(UnitCompiler.java:236)
at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:4028)
at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:4003)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5470)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:4003)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2487)
... 155 more
   Caused by: org.codehaus.commons.compiler.InternalCompilerException: File 
'generated.java', Line 71, Column 25: Compiling "tupleLoopValue_0"
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5819)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5323)
at org.codehaus.janino.UnitCompiler.access$9300(UnitCompiler.java:236)
at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4698)
at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4674)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5470)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4674)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5817)
... 162 more
   Caused by: org.codehaus.commons.compiler.InternalCompilerException: Invalid 
local variable index 10
at 
org.codehaus.janino.UnitCompiler.getLocalVariableTypeInfo(UnitCompiler.java:13498)
at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:12500)
at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:12475)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4745)
at org.codehaus.janino.UnitCompiler.access$8200(UnitCompiler.java:236)
at 
org.codehaus.janino.UnitCompiler$16$1.visitLocalVariableAccess(UnitCompiler.java:4684)
at 
org.codehaus.janino.UnitCompiler$16$1.visitLocalVariableAccess(UnitCompiler.java:4678)
at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:4661)
at 
org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4678)
at 
org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4674)
at org.codehaus.janino.Java$Lvalue.accept(Java.java:4528)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4674)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4741)
at org.codehaus.janino.UnitCompiler.access$7700(UnitCompiler.java:236)
at 
org.codehaus.janino.UnitCompiler$16$1.visitAmbiguousName(UnitCompiler.java:4679)
at 
org.codehaus.janino.UnitCompiler$16$1.visitAmbiguousName(UnitCompiler.java:4678)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4603)
at 
org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4678)
at 
org.codehaus.janino.UnitCompiler$16.visitLvalue(UnitCompiler.java:4674)
at org.codehaus.janino.Java$Lvalue.accept(Java.java:4528)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4674)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5817)
... 169 more
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions

2022-11-18 Thread GitBox



LuciferYang commented on code in PR #38705:
URL: https://github.com/apache/spark/pull/38705#discussion_r1026475352


##
sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out:
##
@@ -14,7 +29,7 @@ select format_string()
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-requirement failed: format_string() should take at least 1 argument; line 1 
pos 7
+0; line 1 pos 7

Review Comment:
   fixed 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions

2022-11-18 Thread GitBox



LuciferYang commented on code in PR #38705:
URL: https://github.com/apache/spark/pull/38705#discussion_r1026474420


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -1662,8 +1675,7 @@ case class StringRPad(str: Expression, len: Expression, 
pad: Expression = Litera
 // scalastyle:on line.size.limit
 case class FormatString(children: Expression*) extends Expression with 
ImplicitCastInputTypes {
 
-  require(children.nonEmpty, s"$prettyName() should take at least 1 argument")
-  if (!SQLConf.get.getConf(SQLConf.ALLOW_ZERO_INDEX_IN_FORMAT_STRING)) {
+  if (children.nonEmpty && 
!SQLConf.get.getConf(SQLConf.ALLOW_ZERO_INDEX_IN_FORMAT_STRING)) {

Review Comment:
   My mistake, need to add nonEmpty condition
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026470585


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##
@@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase {
   TypeCheckSuccess
 }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. 
These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees 
cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+   patterns: Seq[Expression],
+   expressionCombiner: (Expression, Expression) => Expression): Expression 
= {
+assert(patterns.size > 0)
+var res = patterns
+while (res.size > 1) {
+  res = res.sliding(2, 2).toSeq
+.map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) 
else tup(0))
+}
+res.head

Review Comment:
   It seems that the original implementation is better than this one?
   ```scala
 def reduceToExpressionTree1(
 patterns: Seq[Expression],
 expressionCombiner: (Expression, Expression) => Expression): 
Expression = {
   var res = patterns
   while (res.size > 1) {
 res = res.sliding(2, 2).toSeq
   .map(tup => if (tup.size == 2) expressionCombiner(tup.head, 
tup.last) else tup(0))
   }
   res.head
 }
   
 def reduceToExpressionTree2(
 expressions: Seq[Expression],
 expressionCombiner: (Expression, Expression) => Expression): 
Expression = {
   def reduceToExpressionTree(low: Int, high: Int): Expression = high - low 
match {
 case 0 =>
   expressions(low)
 case 1 =>
   expressionCombiner(expressions(low), expressions(high))
 case x =>
   val mid = low + x / 2
   expressionCombiner(
 reduceToExpressionTree(low, mid),
 reduceToExpressionTree(mid + 1, high))
   }
   reduceToExpressionTree(0, expressions.size - 1)
 }
   
   Seq(1, 10, 100).foreach { N =>
 val expressions = Range(1, N).map(Literal(_))
   
 val benchmark = new Benchmark(s"Benchmark reduceToExpressionTree with 
$N elements", N, minNumIters = 10)
 benchmark.addCase("reduceToExpressionTree1") { _ =>
   reduceToExpressionTree1(expressions, Or.apply)
 }
   
 benchmark.addCase("reduceToExpressionTree2") { _ =>
   reduceToExpressionTree2(expressions, Or.apply)
 }
   
 benchmark.run()
   }
   ```
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 10.16
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   Benchmark reduceToExpressionTree with 100 elements:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   
--
   reduceToExpressionTree1   823
   1012 212  1.2 822.5   1.0X
   reduceToExpressionTree2   150
183  25  6.7 149.7   5.5X
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026470585


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##
@@ -117,4 +117,23 @@ object ExprUtils extends QueryErrorsBase {
   TypeCheckSuccess
 }
   }
+
+  /**
+   * Combine a number of boolean expressions into a balanced expression tree. 
These expressions are
+   * either combined by a logical [[And]] or a logical [[Or]].
+   *
+   * A balanced binary tree is created because regular left recursive trees 
cause considerable
+   * performance degradations and can cause stack overflows.
+   */
+  def reduceToExpressionTree(
+   patterns: Seq[Expression],
+   expressionCombiner: (Expression, Expression) => Expression): Expression 
= {
+assert(patterns.size > 0)
+var res = patterns
+while (res.size > 1) {
+  res = res.sliding(2, 2).toSeq
+.map(tup => if (tup.size == 2) expressionCombiner(tup.head, tup.last) 
else tup(0))
+}
+res.head

Review Comment:
   seems that the original implementation is better than this one?
   ```scala
 def reduceToExpressionTree1(
 patterns: Seq[Expression],
 expressionCombiner: (Expression, Expression) => Expression): 
Expression = {
   var res = patterns
   while (res.size > 1) {
 res = res.sliding(2, 2).toSeq
   .map(tup => if (tup.size == 2) expressionCombiner(tup.head, 
tup.last) else tup(0))
   }
   res.head
 }
   
 def reduceToExpressionTree2(
 expressions: Seq[Expression],
 expressionCombiner: (Expression, Expression) => Expression): 
Expression = {
   def reduceToExpressionTree(low: Int, high: Int): Expression = high - low 
match {
 case 0 =>
   expressions(low)
 case 1 =>
   expressionCombiner(expressions(low), expressions(high))
 case x =>
   val mid = low + x / 2
   expressionCombiner(
 reduceToExpressionTree(low, mid),
 reduceToExpressionTree(mid + 1, high))
   }
   reduceToExpressionTree(0, expressions.size - 1)
 }
   
   Seq(1, 10, 100).foreach { N =>
 val expressions = Range(1, N).map(Literal(_))
   
 val benchmark = new Benchmark(s"Benchmark reduceToExpressionTree with 
$N elements", N, minNumIters = 10)
 benchmark.addCase("reduceToExpressionTree1") { _ =>
   reduceToExpressionTree1(expressions, Or.apply)
 }
   
 benchmark.addCase("reduceToExpressionTree2") { _ =>
   reduceToExpressionTree2(expressions, Or.apply)
 }
   
 benchmark.run()
   }
   ```
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 10.16
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   Benchmark reduceToExpressionTree with 100 elements:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   
--
   reduceToExpressionTree1   823
   1012 212  1.2 822.5   1.0X
   reduceToExpressionTree2   150
183  25  6.7 149.7   5.5X
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-18 Thread GitBox



LuciferYang commented on PR #38711:
URL: https://github.com/apache/spark/pull/38711#issuecomment-1320030900

   cc @mridulm @Ngone51 FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching

2022-11-18 Thread GitBox



cloud-fan commented on code in PR #38692:
URL: https://github.com/apache/spark/pull/38692#discussion_r1026457875


##
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala:
##
@@ -192,6 +192,23 @@ class SparkSessionExtensionSuite extends SparkFunSuite 
with SQLHelper {
 testInjectColumnar(false)
   }
 
+  test("inject plan normalization rules") {
+val extensions = create { extensions =>
+  extensions.injectPlanNormalizationRules { session =>
+org.apache.spark.sql.catalyst.optimizer.PushDownPredicates
+  }
+}
+withSession(extensions) { session =>
+  import session.implicits._
+  val df = Seq((1, "a"), (2, "b")).toDF("i", "s")
+  df.select("i").filter($"i" > 1).cache()
+  assert(df.filter($"i" > 1).select("i").queryExecution.executedPlan.find {
+case _: org.apache.spark.sql.execution.columnar.InMemoryTableScanExec 
=> true
+case _ => false

Review Comment:
   yup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #38497: [SPARK-40999] Hint propagation to subqueries

2022-11-18 Thread GitBox



cloud-fan closed pull request #38497: [SPARK-40999] Hint propagation to 
subqueries
URL: https://github.com/apache/spark/pull/38497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38497: [SPARK-40999] Hint propagation to subqueries

2022-11-18 Thread GitBox



cloud-fan commented on PR #38497:
URL: https://github.com/apache/spark/pull/38497#issuecomment-1320017014

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec

2022-11-18 Thread GitBox



cloud-fan commented on PR #38687:
URL: https://github.com/apache/spark/pull/38687#issuecomment-1320010391

   there is another cache in `SessionCatalog.tableRelationCache`, shall we 
update it as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38706: [TEST ONLY] Come back to collect.foreach(send)

2022-11-18 Thread GitBox



HyukjinKwon commented on code in PR #38706:
URL: https://github.com/apache/spark/pull/38706#discussion_r1026369143


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala:
##
@@ -57,13 +55,7 @@ class SparkConnectStreamHandler(responseObserver: 
StreamObserver[ExecutePlanResp
 // Extract the plan from the request and convert it to a logical plan
 val planner = new SparkConnectPlanner(session)
 val dataframe = Dataset.ofRows(session, 
planner.transformRelation(request.getPlan.getRoot))
-try {
-  processAsArrowBatches(request.getClientId, dataframe)
-} catch {
-  case e: Exception =>
-logWarning(e.getMessage)
-processAsJsonBatches(request.getClientId, dataframe)
-}
+processAsArrowBatches(request.getClientId, dataframe)
   }
 
   def processAsJsonBatches(clientId: String, dataframe: DataFrame): Unit = {

Review Comment:
   let's also remove this and protobuf definition



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-18 Thread GitBox



HyukjinKwon commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1026348675


##
sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala:
##
@@ -21,24 +21,22 @@ import java.nio.charset.StandardCharsets
 import java.sql.{Date, Timestamp}
 import java.text.SimpleDateFormat
 import java.util.Locale
-

Review Comment:
   Let's keep these newlines. I think Scala linter would complain about this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38675: [SPARK-41161][BUILD] Upgrade scala-parser-combinators to 2.1.1

2022-11-18 Thread GitBox



HyukjinKwon closed pull request #38675: [SPARK-41161][BUILD] Upgrade 
scala-parser-combinators to 2.1.1
URL: https://github.com/apache/spark/pull/38675


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38675: [SPARK-41161][BUILD] Upgrade scala-parser-combinators to 2.1.1

2022-11-18 Thread GitBox



HyukjinKwon commented on PR #38675:
URL: https://github.com/apache/spark/pull/38675#issuecomment-1319891935

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026330362


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -743,6 +743,21 @@ object LikeSimplification extends Rule[LogicalPlan] {
 }
   }
 
+  def combinePatterns(

Review Comment:
   Remove it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions

2022-11-18 Thread GitBox



MaxGekk commented on code in PR #38705:
URL: https://github.com/apache/spark/pull/38705#discussion_r1026301316


##
sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out:
##
@@ -14,7 +29,7 @@ select format_string()
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-requirement failed: format_string() should take at least 1 argument; line 1 
pos 7
+0; line 1 pos 7

Review Comment:
   hmm, why is that? We should see similar error as for `select concat_ws()`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

2022-11-18 Thread GitBox



EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1319839395

   > Could we fix the `DeduplicateRelations`?
   
   Interesting, that sounds like a better solution. I'll look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #38688: [SPARK-41166][SQL][TESTS] Check errorSubClass of DataTypeMismatch in *ExpressionSuites

2022-11-18 Thread GitBox



MaxGekk closed pull request #38688: [SPARK-41166][SQL][TESTS] Check 
errorSubClass of DataTypeMismatch in *ExpressionSuites
URL: https://github.com/apache/spark/pull/38688


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38688: [SPARK-41166][TESTS] Check errorSubClass of DataTypeMismatch in *ExpressionSuites

2022-11-18 Thread GitBox



MaxGekk commented on PR #38688:
URL: https://github.com/apache/spark/pull/38688#issuecomment-1319815373

   +1, LGTM. Merging to master.
   Thank you, @panbingkun.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38682: [SPARK-41167][SQL] Optimize LikeSimplification rule to improve multi like performance

2022-11-18 Thread GitBox



wangyum commented on code in PR #38682:
URL: https://github.com/apache/spark/pull/38682#discussion_r1026264942


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -756,16 +771,16 @@ object LikeSimplification extends Rule[LogicalPlan] {
 } else {
   multi match {
 case l: LikeAll =>
-  val and = replacements.reduceLeft(And)
+  val and = combinePatterns(replacements, And.apply).get

Review Comment:
   Could you reuse these code to create a balanced tree:
   
https://github.com/apache/spark/blob/5c43da6858721664318c3cdbcb051231b0e98175/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L1657-L1669



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38664: [SPARK-41147][SQL] Assign a name to the legacy error class `_LEGACY_ERROR_TEMP_1042`

2022-11-18 Thread GitBox



MaxGekk commented on code in PR #38664:
URL: https://github.com/apache/spark/pull/38664#discussion_r1026264084


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -146,7 +147,10 @@ object FunctionRegistryBase {
 .filter(_.getParameterTypes.forall(_ == classOf[Expression]))
 .map(_.getParameterCount).distinct.sorted
   throw QueryCompilationErrors.invalidFunctionArgumentNumberError(
-validParametersCount, name, params.length)
+expressions.map(toPrettySQL(_)).mkString(","),

Review Comment:
   Could you pass `Seq[Expression]` into 
`invalidFunctionArgumentNumberError()`, and form a string inside of the method 
as we do in other places like:
   ```scala
 def moreThanOneGeneratorError(generators: Seq[Expression], clause: 
String): Throwable = {
   new AnalysisException(
 errorClass = "UNSUPPORTED_GENERATOR.MULTI_GENERATOR",
 messageParameters = Map(
   "clause" -> clause,
   "num" -> generators.size.toString,
   "generators" -> generators.map(toSQLExpr).mkString(", ")))
 }
   ```
   
   For instance, if we change the implementation of `toSQLExpr()`, your 
exceptions won't pick up new impl.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38650: [SPARK-41135][SQL] Rename `UNSUPPORTED_EMPTY_LOCATION` to `INVALID_EMPTY_LOCATION`

2022-11-18 Thread GitBox



MaxGekk commented on code in PR #38650:
URL: https://github.com/apache/spark/pull/38650#discussion_r1026258331


##
core/src/main/resources/error/error-classes.json:
##
@@ -656,6 +656,11 @@
 ],
 "sqlState" : "42000"
   },
+  "INVALID_EMPTY_LOCATION" : {
+"message" : [
+  "The location name cannot be empty string or null, but `` was 
given."

Review Comment:
   Could you add a test when the location is NULL. I just worry that parameters 
substitution might not handle this. Please, check this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`

2022-11-18 Thread GitBox



MaxGekk closed pull request #38644: [SPARK-41130][SQL] Rename 
`OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
URL: https://github.com/apache/spark/pull/38644


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`

2022-11-18 Thread GitBox



MaxGekk commented on PR #38644:
URL: https://github.com/apache/spark/pull/38644#issuecomment-1319797393

   +1, LGTM. Merging to master.
   Thank you, @itholic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`

2022-11-18 Thread GitBox



MaxGekk commented on code in PR #38644:
URL: https://github.com/apache/spark/pull/38644#discussion_r1026251775


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala:
##
@@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with 
QueryErrorsBase {
   Decimal("12345678901234567890123456789012345678"))
 checkExceptionInExpression[ArithmeticException](
   cast("123456789012345678901234567890123456789", DecimalType(38, 0)),
-  "Out of decimal type range")
+  "NUMERIC_OUT_OF_SUPPORTED_RANGE")

Review Comment:
   > I think we might need to similar test utils for expression test such as 
checkErrorInExpression something like that.
   
   Good point. Could you open an JIRA to not forget about this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #38665: [SPARK-41156][SQL] Remove the class `TypeCheckFailure`

2022-11-18 Thread GitBox



MaxGekk closed pull request #38665: [SPARK-41156][SQL] Remove the class 
`TypeCheckFailure`
URL: https://github.com/apache/spark/pull/38665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38665: [SPARK-41156][SQL] Remove the class `TypeCheckFailure`

2022-11-18 Thread GitBox



MaxGekk commented on PR #38665:
URL: https://github.com/apache/spark/pull/38665#issuecomment-1319784931

   > There are still some uses in spark-rapids. I haven't found other uses in 
other famous repositories
   
   ok. Let's leave `TypeCheckFailure` as is. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk opened a new pull request, #38712: [WIP][SQL] Parameterized SQL queries

2022-11-18 Thread GitBox



MaxGekk opened a new pull request, #38712:
URL: https://github.com/apache/spark/pull/38712

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   By running new tests:
   ```
   $ build/sbt "test:testOnly *PlanParserSuite"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] toujours33 opened a new pull request, #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-18 Thread GitBox



toujours33 opened a new pull request, #38711:
URL: https://github.com/apache/spark/pull/38711

   ### What changes were proposed in this pull request?
   ExecutorAllocationManager only record count for speculative task, 
`stageAttemptToNumSpeculativeTasks` increment when speculative task submit, and 
only decrement when speculative task end.
   If task finished before speculative task start, the speculative task will 
never be scheduled, which will cause leak of 
`stageAttemptToNumSpeculativeTasks` and mislead the calculation of target 
executors.
   
   This PR fixes the issue by add task index in 
`SparkListenerSpeculativeTaskSubmitted` event, and record speculative task with 
task index, when task finished, the speculative task will also decrement.
   
   ### Why are the changes needed?
   To fix idle executors caused by pending speculative task from task that has 
been finished
   
   ### Does this PR introduce _any_ user-facing change?
   DeveloperApi `SparkListenerSpeculativeTaskSubmitted` add taskIndex with 
default value -1
   
   ### How was this patch tested?
   Add a comprehensive unit test.
   Pass the GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 116 matches

Mail list logo