date:20230309

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132040956


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala:
##
@@ -166,4 +166,14 @@ case class InMemoryTableScanExec(
   protected override def doExecuteColumnar(): RDD[ColumnarBatch] = {
 columnarInputRDD
   }
+
+  def isMaterialized: Boolean = 
relation.cacheBuilder.isCachedColumnBuffersLoaded
+
+  /**
+   * This method is only used by AQE which executes the actually cached RDD 
that without filter and
+   * serialization of row/columnar.
+   */
+  def executeCache(): RDD[CachedBatch] = {

Review Comment:
   it doesn't execute anything and the current name is confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132040751


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala:
##
@@ -166,4 +166,14 @@ case class InMemoryTableScanExec(
   protected override def doExecuteColumnar(): RDD[ColumnarBatch] = {
 columnarInputRDD
   }
+
+  def isMaterialized: Boolean = 
relation.cacheBuilder.isCachedColumnBuffersLoaded
+
+  /**
+   * This method is only used by AQE which executes the actually cached RDD 
that without filter and
+   * serialization of row/columnar.
+   */
+  def executeCache(): RDD[CachedBatch] = {

Review Comment:
   `baseCacheRDD`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132039721


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:
##
@@ -275,10 +272,19 @@ case class CachedRDDBuilder(
 storageLevel,
 cachedPlan.conf)
 }
-val cached = cb.map { batch =>
-  sizeInBytesStats.add(batch.sizeInBytes)
-  rowCountStats.add(batch.numRows)
-  batch
+val cached = cb.mapPartitionsInternal { it =>
+  new Iterator[CachedBatch] {
+TaskContext.get().addTaskCompletionListener[Unit](_ => {

Review Comment:
   we can register this listener before returning the wrapping iterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132037463


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,14 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// Only the root `AdaptiveSparkPlanExec` of the main query that triggers 
this query execution
+// should update UI.

Review Comment:
   ```suggestion
   // need to do a final plan update for the UI.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132037463


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,14 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// Only the root `AdaptiveSparkPlanExec` of the main query that triggers 
this query execution
+// should update UI.

Review Comment:
   ```suggestion
   // need to do a final plan update.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132036220


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -520,6 +526,14 @@ case class AdaptiveSparkPlanExec(
   }
   }
 
+case i: InMemoryTableScanExec =>

Review Comment:
   question: if the table cache is already materialized (second access of the 
cache), do we still need to wrap it with `TableCacheQueryStage`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132036803


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -345,7 +350,7 @@ case class AdaptiveSparkPlanExec(
 // Subqueries that don't belong to any query stage of the main query will 
execute after the
 // last UI update in `getFinalPhysicalPlan`, so we need to update UI here 
again to make sure
 // the newly generated nodes of those subqueries are updated.
-if (!isSubquery && currentPhysicalPlan.exists(_.subqueries.nonEmpty)) {
+if (shouldUpdatePlan && currentPhysicalPlan.exists(_.subqueries.nonEmpty)) 
{

Review Comment:
   I think it clearer to rename `shouldUpdatePlan` to `needFinalPlanUpdate`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-03-09 Thread via GitHub



AngersZh commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463391491

   @cloud-fan Seems this code https://github.com/apache/spark/pull/32283 first 
want to fix issue in k8s, then @dongjoon-hyun make it limit in k8s env. But 
this also can work for yarn env


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-03-09 Thread via GitHub



AngersZh commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463389076

   Failed UT should not related to this pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132032012


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   addressed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub



cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132027166


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   how about we put it this way: only the root `AdaptiveSparkPlanExec` of the 
main query that triggers this query execution should update UI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] thousandhu commented on pull request #40361: [SPARK_42742]access apiserver by pod env

2023-03-09 Thread via GitHub



thousandhu commented on PR #40361:
URL: https://github.com/apache/spark/pull/40361#issuecomment-1463380353

   I've enabled GitHub Actions in your forked repository. How to rerun the 
build check failed above?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-03-09 Thread via GitHub



cloud-fan commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463378857

   @dongjoon-hyun do you have more context about 
https://github.com/apache/spark/pull/33403? Why do we limit the stopping spark 
context behavior to k8s only?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] thousandhu opened a new pull request, #40361: [SPARK_42742]access apiserver by pod env

2023-03-09 Thread via GitHub

thousandhu opened a new pull request, #40361:
URL: https://github.com/apache/spark/pull/40361

### What changes were proposed in this pull request?
When start spark on k8s，driver pod use spark.kubernetes.driver.master to
get apiserver address. This config us https://kubernetes.default.svc/ as
default and do not care about the apiserver port.

In our case, apiserver port is not 443 will driver will throw
connectException. As k8s doc mentioned
（https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#directly-accessing-the-rest-api）,
we can get master url by getting KUBERNETES_SERVICE_HOST and
KUBERNETES_SERVICE_PORT_HTTPS environment variables from pod. So we add a new
conf spark.kubernetes.driver.master.from.pod.env to allow driver get master url
from env in cluster mode on k8s

### Why are the changes needed?
Add a new conf spark.kubernetes.driver.master.from.pod.env to let the
driver pod get apiserver automatically from pod env instead of by
spark.kubernetes.driver.master.

### Does this PR introduce _any_ user-facing change?
Yes. When user set new conf spark.kubernetes.driver.master.from.pod.env as
true, the logic of driver get apiserver url will changed. In some case it will
help user to get right apiserver url.
By default, the conf spark.kubernetes.driver.master.from.pod.env is false,
and the driver logic changes nothing.

### How was this patch tested?
No. the apiserver is mocked in unit test. we tested this feature in our k8s
cluster

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

1 2 >

1 - 100 of 195 matches

Mail list logo