[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132040956


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala:
##
@@ -166,4 +166,14 @@ case class InMemoryTableScanExec(
   protected override def doExecuteColumnar(): RDD[ColumnarBatch] = {
 columnarInputRDD
   }
+
+  def isMaterialized: Boolean = 
relation.cacheBuilder.isCachedColumnBuffersLoaded
+
+  /**
+   * This method is only used by AQE which executes the actually cached RDD 
that without filter and
+   * serialization of row/columnar.
+   */
+  def executeCache(): RDD[CachedBatch] = {

Review Comment:
   it doesn't execute anything and the current name is confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132040751


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala:
##
@@ -166,4 +166,14 @@ case class InMemoryTableScanExec(
   protected override def doExecuteColumnar(): RDD[ColumnarBatch] = {
 columnarInputRDD
   }
+
+  def isMaterialized: Boolean = 
relation.cacheBuilder.isCachedColumnBuffersLoaded
+
+  /**
+   * This method is only used by AQE which executes the actually cached RDD 
that without filter and
+   * serialization of row/columnar.
+   */
+  def executeCache(): RDD[CachedBatch] = {

Review Comment:
   `baseCacheRDD`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132039721


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:
##
@@ -275,10 +272,19 @@ case class CachedRDDBuilder(
 storageLevel,
 cachedPlan.conf)
 }
-val cached = cb.map { batch =>
-  sizeInBytesStats.add(batch.sizeInBytes)
-  rowCountStats.add(batch.numRows)
-  batch
+val cached = cb.mapPartitionsInternal { it =>
+  new Iterator[CachedBatch] {
+TaskContext.get().addTaskCompletionListener[Unit](_ => {

Review Comment:
   we can register this listener before returning the wrapping iterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132037463


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,14 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// Only the root `AdaptiveSparkPlanExec` of the main query that triggers 
this query execution
+// should update UI.

Review Comment:
   ```suggestion
   // need to do a final plan update for the UI.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132037463


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,14 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// Only the root `AdaptiveSparkPlanExec` of the main query that triggers 
this query execution
+// should update UI.

Review Comment:
   ```suggestion
   // need to do a final plan update.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132036220


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -520,6 +526,14 @@ case class AdaptiveSparkPlanExec(
   }
   }
 
+case i: InMemoryTableScanExec =>

Review Comment:
   question: if the table cache is already materialized (second access of the 
cache), do we still need to wrap it with `TableCacheQueryStage`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132036803


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -345,7 +350,7 @@ case class AdaptiveSparkPlanExec(
 // Subqueries that don't belong to any query stage of the main query will 
execute after the
 // last UI update in `getFinalPhysicalPlan`, so we need to update UI here 
again to make sure
 // the newly generated nodes of those subqueries are updated.
-if (!isSubquery && currentPhysicalPlan.exists(_.subqueries.nonEmpty)) {
+if (shouldUpdatePlan && currentPhysicalPlan.exists(_.subqueries.nonEmpty)) 
{

Review Comment:
   I think it clearer to rename `shouldUpdatePlan` to `needFinalPlanUpdate`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-03-09 Thread via GitHub


AngersZh commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463391491

   @cloud-fan Seems this code https://github.com/apache/spark/pull/32283 first 
want to fix issue in k8s, then @dongjoon-hyun make it limit in k8s env. But 
this also can work for yarn env


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-03-09 Thread via GitHub


AngersZh commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463389076

   Failed UT should not related to this pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132032012


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   addressed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132027166


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   how about we put it this way: only the root `AdaptiveSparkPlanExec` of the 
main query that triggers this query execution should update UI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] thousandhu commented on pull request #40361: [SPARK_42742]access apiserver by pod env

2023-03-09 Thread via GitHub


thousandhu commented on PR #40361:
URL: https://github.com/apache/spark/pull/40361#issuecomment-1463380353

   I've enabled GitHub Actions in your forked repository. How to rerun the 
build check failed above?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-03-09 Thread via GitHub


cloud-fan commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463378857

   @dongjoon-hyun do you have more context about 
https://github.com/apache/spark/pull/33403? Why do we limit the stopping spark 
context behavior to k8s only?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] thousandhu opened a new pull request, #40361: [SPARK_42742]access apiserver by pod env

2023-03-09 Thread via GitHub


thousandhu opened a new pull request, #40361:
URL: https://github.com/apache/spark/pull/40361

   
   
   ### What changes were proposed in this pull request?
   When start spark on k8s,driver pod  use spark.kubernetes.driver.master to 
get apiserver address. This config  us  https://kubernetes.default.svc/ as 
default and do not care about the apiserver port.
   
   In our case, apiserver port is not 443 will driver will throw 
connectException. As k8s doc mentioned 
(https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#directly-accessing-the-rest-api),
 we can get master url by getting KUBERNETES_SERVICE_HOST and 
KUBERNETES_SERVICE_PORT_HTTPS environment variables from pod. So we add a new 
conf spark.kubernetes.driver.master.from.pod.env to allow driver get master url 
from env in cluster mode on k8s
   
   
   ### Why are the changes needed?
   Add a new conf spark.kubernetes.driver.master.from.pod.env  to let the 
driver pod get apiserver automatically from pod env instead of by  
spark.kubernetes.driver.master.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. When user set new conf spark.kubernetes.driver.master.from.pod.env as 
true, the logic of driver get apiserver url will changed. In some case it will 
help user to get right apiserver url.
   By default, the conf spark.kubernetes.driver.master.from.pod.env  is false, 
and the driver logic changes nothing.
   
   ### How was this patch tested?
   No. the apiserver is mocked in unit test. we tested this feature in our k8s 
cluster
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40359: [SPARK-42740][SQL] Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40359:
URL: https://github.com/apache/spark/pull/40359#discussion_r1132007081


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -181,19 +181,35 @@ private case object OracleDialect extends JdbcDialect {
 if (limit > 0) s"WHERE rownum <= $limit" else ""
   }
 
+  override def getOffsetClause(offset: Integer): String = {
+// Oracle doesn't support OFFSET clause.
+// We can use rownum > n to skip some rows in the result set.
+// Note: rn is an alias of rownum.
+if (offset > 0) s"WHERE rn > $offset" else ""
+  }
+
   class OracleSQLQueryBuilder(dialect: JdbcDialect, options: JDBCOptions)
 extends JdbcSQLQueryBuilder(dialect, options) {
 
-// TODO[SPARK-42289]: DS V2 pushdown could let JDBC dialect decide to push 
down offset
 override def build(): String = {
   val selectStmt = s"SELECT $columnList FROM ${options.tableOrQuery} 
$tableSampleClause" +
 s" $whereClause $groupByClause $orderByClause"
-  if (limit > 0) {
-val limitClause = dialect.getLimitClause(limit)
-options.prepareQuery + s"SELECT tab.* FROM ($selectStmt) tab 
$limitClause"
+  val finalSelectStmt = if (limit > 0) {
+if (offset > 0) {
+  s"SELECT $columnList FROM (SELECT tab.*, rownum rn FROM 
($selectStmt) tab)" +

Review Comment:
   how about
   ```
   SELECT * FROM ($selectStmt) tab WHERE rownum > ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1132005993


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   yes, but there are two query executions. The current execution ID is mapping 
to the outer query execution. The code workflow is:
   
   1. assigne a new execution id for an action
   2. mapping execution id to outer query execution, so there is no execution 
id can map to the nesed query execution 
https://github.com/apache/spark/blob/f8966e7eee1d7f2db7b376d557d5ff6658c80653/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L78
   3. find the outer query execution and update plan
   
https://github.com/apache/spark/blob/f8966e7eee1d7f2db7b376d557d5ff6658c80653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L222-L226
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40359: [SPARK-42740][SQL] Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40359:
URL: https://github.com/apache/spark/pull/40359#discussion_r1132005986


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -181,19 +181,35 @@ private case object OracleDialect extends JdbcDialect {
 if (limit > 0) s"WHERE rownum <= $limit" else ""
   }
 
+  override def getOffsetClause(offset: Integer): String = {
+// Oracle doesn't support OFFSET clause.
+// We can use rownum > n to skip some rows in the result set.
+// Note: rn is an alias of rownum.

Review Comment:
   nvm, I see the implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40359: [SPARK-42740][SQL] Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40359:
URL: https://github.com/apache/spark/pull/40359#discussion_r1132005699


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -181,19 +181,35 @@ private case object OracleDialect extends JdbcDialect {
 if (limit > 0) s"WHERE rownum <= $limit" else ""
   }
 
+  override def getOffsetClause(offset: Integer): String = {
+// Oracle doesn't support OFFSET clause.
+// We can use rownum > n to skip some rows in the result set.
+// Note: rn is an alias of rownum.

Review Comment:
   every table in oracle has the `rn` column?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40359: [SPARK-42740][SQL] Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40359:
URL: https://github.com/apache/spark/pull/40359#discussion_r1132005206


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala:
##
@@ -291,4 +291,22 @@ private case object MySQLDialect extends JdbcDialect with 
SQLConfHelper {
   throw QueryExecutionErrors.unsupportedDropNamespaceRestrictError()
 }
   }
+
+  class MySQLSQLQueryBuilder(dialect: JdbcDialect, options: JDBCOptions)
+extends JdbcSQLQueryBuilder(dialect, options) {
+
+override def build(): String = {
+  if (limit < 1 && offset > 0) {
+val offsetClause = dialect.getOffsetClause(offset)
+options.prepareQuery +
+  s"SELECT $columnList FROM ${options.tableOrQuery} 
$tableSampleClause" +
+  s" $whereClause $groupByClause $orderByClause LIMIT 
18446744073709551610 $offsetClause"

Review Comment:
   what does this `LIMIT 18446744073709551610` mean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40359: [SPARK-42740][SQL] Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40359:
URL: https://github.com/apache/spark/pull/40359#discussion_r1132004161


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala:
##
@@ -410,6 +410,15 @@ private[v2] trait V2JDBCTest extends SharedSparkSession 
with DockerIntegrationFu
 assert(sorts.isEmpty)
   }
 
+  private def checkOffsetPushed(df: DataFrame, offset: Option[Int]): Unit = {

Review Comment:
   can we rename `limitPushed` to `checkLimitPushed` and follow the 
implementation here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #40360: [SPARK-42741][SQL] Do not unwrap casts in binary comparison when literal is null

2023-03-09 Thread via GitHub


wangyum commented on code in PR #40360:
URL: https://github.com/apache/spark/pull/40360#discussion_r1132000965


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala:
##
@@ -192,7 +192,7 @@ class UnwrapCastInBinaryComparisonSuite extends PlanTest 
with ExpressionEvalHelp
 })
   }
 
-  test("unwrap casts when literal is null") {
+  test("SPARK-42741: Do not unwrap casts in binary comparison when literal is 
null") {

Review Comment:
   Expressions in this test are optimized by `NullPropagation`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40359: [SPARK-42740][SQL] Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-03-09 Thread via GitHub


beliefer commented on PR #40359:
URL: https://github.com/apache/spark/pull/40359#issuecomment-1463340665

   ping @cloud-fan cc @sadikovi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-03-09 Thread via GitHub


AngersZh commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463340552

   > This seems to be a revert of #33403 as now we stop SparkContext in YARN 
environment as well. We should justify it in the PR description. This is not 
simply passing the exitCode. Please update the PR title as well.
   
   DOne


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131119650


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala:
##
@@ -166,4 +170,32 @@ case class InMemoryTableScanExec(
   protected override def doExecuteColumnar(): RDD[ColumnarBatch] = {
 columnarInputRDD
   }
+
+  // ==
+  // Methods for AQE
+  // ==
+
+  @transient
+  private lazy val future: FutureAction[Unit] = {
+val rdd = cachedPlan.execute()
+sparkContext.submitJob(
+  rdd,
+  (_: Iterator[InternalRow]) => (),
+  (0 until rdd.getNumPartitions).toSeq,
+  (_: Int, _: Unit) => (),
+  ()
+)
+  }
+
+  override def isMaterialized: Boolean = relation.isMaterialized || 
super.isMaterialized

Review Comment:
   it seems `relation.isMaterialized` is not accuate enough, removed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131995849


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   It's a bit hard for me to map this diagram to the code. `getExecutionId` 
returns the current execution ID, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [WIP][SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-03-09 Thread via GitHub


LuciferYang commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1131995398


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala:
##
@@ -584,6 +585,101 @@ final class DataFrameStatFunctions private[sql] 
(sparkSession: SparkSession, roo
 }
 CountMinSketch.readFrom(ds.head())
   }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param colName
+   *   name of the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param fpp
+   *   expected false positive probability of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): 
BloomFilter = {
+buildBloomFilter(Column(colName), expectedNumItems, -1L, fpp)
+  }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param col
+   *   the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param fpp
+   *   expected false positive probability of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): 
BloomFilter = {
+buildBloomFilter(col, expectedNumItems, -1L, fpp)
+  }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param colName
+   *   name of the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param numBits
+   *   expected number of bits of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): 
BloomFilter = {
+buildBloomFilter(Column(colName), expectedNumItems, numBits, Double.NaN)
+  }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param col
+   *   the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param numBits
+   *   expected number of bits of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): 
BloomFilter = {
+buildBloomFilter(col, expectedNumItems, numBits, Double.NaN)
+  }
+
+  private def buildBloomFilter(
+  col: Column,
+  expectedNumItems: Long,
+  numBits: Long,
+  fpp: Double): BloomFilter = {
+
+def optimalNumOfBits(n: Long, p: Double): Long =

Review Comment:
   Do you mean to pass all 3 parameters to the server side, and then do the 
`conversion(optimalNumOfBits)` on the server side?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40116: [SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect

2023-03-09 Thread via GitHub


cloud-fan commented on PR #40116:
URL: https://github.com/apache/spark/pull/40116#issuecomment-1463337176

   I think the test is easy to fix. It wants to test the aggregate function 
result, but not the generated alias, so we just change the testing query to add 
alias explicitly.
   ```
   val avgDF = intervalData.select(
 avg($"year-month").as("a1"),
 avg($"year").as("a2"),
 ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131992558


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -250,3 +252,44 @@ case class BroadcastQueryStageExec(
 
   override def getRuntimeStatistics: Statistics = broadcast.runtimeStatistics
 }
+
+/**
+ * A table cache query stage whose child is a [[InMemoryTableScanExec]].
+ *
+ * @param id the query stage id.
+ * @param plan the underlying plan.
+ */
+case class TableCacheQueryStageExec(
+override val id: Int,
+override val plan: SparkPlan) extends QueryStageExec {
+
+  @transient val inMemoryTableScan = plan match {
+case i: InMemoryTableScanExec => i
+case _ =>
+  throw new IllegalStateException(s"wrong plan for in memory stage:\n 
${plan.treeString}")
+  }
+
+  @transient
+  private lazy val future: FutureAction[Unit] = {
+val rdd = inMemoryTableScan.executeCache()
+sparkContext.submitJob(
+  rdd,
+  (_: Iterator[CachedBatch]) => (),
+  (0 until rdd.getNumPartitions).toSeq,
+  (_: Int, _: Unit) => (),
+  ()
+)
+  }
+
+  override protected def doMaterialize(): Future[Any] = future
+
+  override def isMaterialized: Boolean = super.isMaterialized || 
inMemoryTableScan.isMaterialized
+
+  override def cancel(): Unit = {
+if (!isMaterialized) {

Review Comment:
   changed to a debug level logging



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131990280


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:
##
@@ -275,10 +272,22 @@ case class CachedRDDBuilder(
 storageLevel,
 cachedPlan.conf)
 }
-val cached = cb.map { batch =>
-  sizeInBytesStats.add(batch.sizeInBytes)
-  rowCountStats.add(batch.numRows)
-  batch
+val cached = cb.mapPartitionsInternal { it =>
+  new Iterator[CachedBatch] {
+override def hasNext: Boolean = {

Review Comment:
   sgtm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side for yarn mode

2023-03-09 Thread via GitHub


cloud-fan commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1463331316

   This seems to be a revert of https://github.com/apache/spark/pull/33403 as 
now we stop SparkContext in YARN environment as well. We should justify it in 
the PR description. This is not simply passing the exitCode. Please update the 
PR title as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum opened a new pull request, #40360: [SPARK-42741][SQL] Do not unwrap casts in binary comparison when literal is null

2023-03-09 Thread via GitHub


wangyum opened a new pull request, #40360:
URL: https://github.com/apache/spark/pull/40360

   ### What changes were proposed in this pull request?
   
   This PR makes `UnwrapCastInBinaryComparison` not to unwrap casts in binary 
comparison when literal is null.
   
   ### Why are the changes needed?
   
   In order to make the logic of `UnwrapCastInBinaryComparison` more clear. 
Null literals are already handled by `NullPropagation`:
   
https://github.com/apache/spark/blob/2de0d45887509fac8d5fc9448764a0e71f618797/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L823-L824
   
   
https://github.com/apache/spark/blob/2de0d45887509fac8d5fc9448764a0e71f618797/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L850-L851
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131987375


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   Execution id is assigned by `SQLExecution.withNewExecutionId` and would 
register to `SQLExecution.executionIdToQueryExecution` and it is used for 
action. The nested query execution itself is not an action so it does not have 
execution id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE

2023-03-09 Thread via GitHub


cloud-fan closed pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support 
parameterized query in subquery and CTE
URL: https://github.com/apache/spark/pull/40333


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE

2023-03-09 Thread via GitHub


cloud-fan commented on PR #40333:
URL: https://github.com/apache/spark/pull/40333#issuecomment-1463321220

   GA passes, let me merge it back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131980520


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:
##
@@ -275,10 +272,22 @@ case class CachedRDDBuilder(
 storageLevel,
 cachedPlan.conf)
 }
-val cached = cb.map { batch =>
-  sizeInBytesStats.add(batch.sizeInBytes)
-  rowCountStats.add(batch.numRows)
-  batch
+val cached = cb.mapPartitionsInternal { it =>
+  new Iterator[CachedBatch] {
+override def hasNext: Boolean = {

Review Comment:
   shall we use `TaskContext.addTaskCompletionListener`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131976928


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:
##
@@ -275,10 +272,22 @@ case class CachedRDDBuilder(
 storageLevel,
 cachedPlan.conf)
 }
-val cached = cb.map { batch =>
-  sizeInBytesStats.add(batch.sizeInBytes)
-  rowCountStats.add(batch.numRows)
-  batch
+val cached = cb.mapPartitionsInternal { it =>
+  new Iterator[CachedBatch] {
+override def hasNext: Boolean = {

Review Comment:
   what if the caller side invokes `hasNext` multiple times after it returns 
`false`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer opened a new pull request, #40359: [SPARK-42740][SQL] Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-03-09 Thread via GitHub


beliefer opened a new pull request, #40359:
URL: https://github.com/apache/spark/pull/40359

   ### What changes were proposed in this pull request?
   Currently, the DS V2 pushdown framework pushed offset as `OFFSET n` in 
default and pushed it with limit as `LIMIT m OFFSET n`. But some built-in 
dialect doesn't support these syntax. So, when Spark pushdown offset into these 
databases, them throwing errors.
   
   
   ### Why are the changes needed?
   Fix the bug that pushdown offset or paging is invalid for some built-in 
dialect.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'Yes'.
   The bug will be fixed.
   
   
   ### How was this patch tested?
   New test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131975162


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -250,3 +252,44 @@ case class BroadcastQueryStageExec(
 
   override def getRuntimeStatistics: Statistics = broadcast.runtimeStatistics
 }
+
+/**
+ * A table cache query stage whose child is a [[InMemoryTableScanExec]].
+ *
+ * @param id the query stage id.
+ * @param plan the underlying plan.
+ */
+case class TableCacheQueryStageExec(
+override val id: Int,
+override val plan: SparkPlan) extends QueryStageExec {
+
+  @transient val inMemoryTableScan = plan match {
+case i: InMemoryTableScanExec => i
+case _ =>
+  throw new IllegalStateException(s"wrong plan for in memory stage:\n 
${plan.treeString}")
+  }
+
+  @transient
+  private lazy val future: FutureAction[Unit] = {
+val rdd = inMemoryTableScan.executeCache()
+sparkContext.submitJob(
+  rdd,
+  (_: Iterator[CachedBatch]) => (),
+  (0 until rdd.getNumPartitions).toSeq,
+  (_: Int, _: Unit) => (),
+  ()
+)
+  }
+
+  override protected def doMaterialize(): Future[Any] = future
+
+  override def isMaterialized: Boolean = super.isMaterialized || 
inMemoryTableScan.isMaterialized
+
+  override def cancel(): Unit = {
+if (!isMaterialized) {

Review Comment:
   shall we do nothing here? I don't think we ever need to cancel a table cache 
job. The cached data will be accessed sooner or later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1131974200


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)

Review Comment:
   what does `no execution id` mean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40358: [SPARK-42733][CONNECT][Followup] Write without path or table

2023-03-09 Thread via GitHub


amaliujia commented on code in PR #40358:
URL: https://github.com/apache/spark/pull/40358#discussion_r1131967870


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:
##
@@ -345,6 +347,37 @@ final class DataFrameWriter[T] private[sql] (ds: 
Dataset[T]) {
 })
   }
 
+  /**
+   * Saves the content of the `DataFrame` to an external database table via 
JDBC. In the case the
+   * table already exists in the external database, behavior of this function 
depends on the save
+   * mode, specified by the `mode` function (default to throwing an exception).
+   *
+   * Don't create too many partitions in parallel on a large cluster; 
otherwise Spark might crash
+   * your external database systems.
+   *
+   * JDBC-specific option and parameter documentation for storing tables via 
JDBC in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option;>
+   * Data Source Option in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string 
tag/value. Normally at least
+   *   a "user" and "password" property should be included. "batchsize" can be 
used to control the
+   *   number of rows per insert. "isolationLevel" can be one of "NONE", 
"READ_COMMITTED",
+   *   "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding 
to standard
+   *   transaction isolation levels defined by JDBC's Connection object, with 
default of
+   *   "READ_UNCOMMITTED".
+   * @since 1.4.0

Review Comment:
   since 3.4.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40356: [SPARK-42733][CONNECT][PYTHON] Fix DataFrameWriter.save to work without path parameter

2023-03-09 Thread via GitHub


ueshin commented on code in PR #40356:
URL: https://github.com/apache/spark/pull/40356#discussion_r1131958688


##
python/pyspark/sql/tests/test_datasources.py:
##
@@ -192,6 +193,23 @@ def test_ignore_column_of_all_nulls(self):
 finally:
 shutil.rmtree(path)
 
+def test_jdbc(self):
+db = f"memory:{uuid.uuid4()}"

Review Comment:
   I think it should work without any configs.
   Could you try to rebuild?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin closed pull request #40276: [SPARK-42630][CONNECT][PYTHON] Implement data type string parser

2023-03-09 Thread via GitHub


ueshin closed pull request #40276: [SPARK-42630][CONNECT][PYTHON] Implement 
data type string parser
URL: https://github.com/apache/spark/pull/40276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on pull request #40276: [SPARK-42630][CONNECT][PYTHON] Implement data type string parser

2023-03-09 Thread via GitHub


ueshin commented on PR #40276:
URL: https://github.com/apache/spark/pull/40276#issuecomment-1463286761

   Close this in favor of #40260.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40357: [SPARK-42739][BUILD] Ensure release tag to be pushed to release branch

2023-03-09 Thread via GitHub


dongjoon-hyun commented on PR #40357:
URL: https://github.com/apache/spark/pull/40357#issuecomment-1463260567

   Thank you, @xinrong-meng and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131939457


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1033,9 +1033,12 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 requiredAttrIds.contains(a.exprId)) =>
 s.withMetadataColumns()
   case p: Project if p.metadataOutput.exists(a => 
requiredAttrIds.contains(a.exprId)) =>
+// Inject the requested metadata columns into the project's output, if 
not already present.

Review Comment:
   > but it's not available because the SubqueryAlias blocked it, this rule 
kept endlessly (re)appending the metadata column
   
   Make sense. So this change is a safe guard



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131938647


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1033,9 +1033,12 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 requiredAttrIds.contains(a.exprId)) =>
 s.withMetadataColumns()
   case p: Project if p.metadataOutput.exists(a => 
requiredAttrIds.contains(a.exprId)) =>
+// Inject the requested metadata columns into the project's output, if 
not already present.

Review Comment:
   yup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests

2023-03-09 Thread via GitHub


zhenlineo commented on PR #40274:
URL: https://github.com/apache/spark/pull/40274#issuecomment-1463236560

   @hvanhovell Want to keep this or shall we skip? It helps a bit when not 
knowing `build/sbt -Pconnect -Phive package` before running the IT. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests

2023-03-09 Thread via GitHub


zhenlineo commented on PR #40274:
URL: https://github.com/apache/spark/pull/40274#issuecomment-1463235387

   > seems `SimpleSparkConnectService` startup failed, the error message is
   > 
   > ```
   > Error: Missing application resource.
   > 
   > Usage: spark-submit [options]  [app 
arguments]
   > Usage: spark-submit --kill [submission ID] --master [spark://...]
   > Usage: spark-submit --status [submission ID] --master [spark://...]
   > Usage: spark-submit run-example [options] example-class [example args]
   > 
   > Options:
   >   --master MASTER_URL spark://host:port, mesos://host:port, yarn,
   >   k8s://https://host:port, or local (Default: 
local[*]).
   >   --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
   >   on one of the worker machines inside the 
cluster ("cluster")
   >   (Default: client).
   >   --class CLASS_NAME  Your application's main class (for Java / 
Scala apps).
   >   --name NAME A name of your application.
   >   --jars JARS Comma-separated list of jars to include on 
the driver
   > ...
   > ```
   
   Yeah, this was caused by the bug we had in the scripts.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131931758


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1033,9 +1033,12 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 requiredAttrIds.contains(a.exprId)) =>
 s.withMetadataColumns()
   case p: Project if p.metadataOutput.exists(a => 
requiredAttrIds.contains(a.exprId)) =>
+// Inject the requested metadata columns into the project's output, if 
not already present.

Review Comment:
   I hit a weird endless loop with this while debugging this `SubqueryAlias` 
issue. Basically, if the plan root already has a metadata attribute (perhaps 
added manually by a query rewrite), but it's not available because the 
`SubqueryAlias` blocked it, this rule kept endlessly (re)appending the metadata 
column to the projections below the `SubqueryAlias`. Once the rule ran 100 
times (leaving 100 copies of `_metadata` in the `Project` output), the endless 
loop detector kicked in and killed it.
   
   I don't think filtering by `inputAttrs` helps, when the problem is what's 
already in the `output` we're appending to?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo opened a new pull request, #40358: [SPARK-42733][CONNECT][Followup] Write without path or table

2023-03-09 Thread via GitHub


zhenlineo opened a new pull request, #40358:
URL: https://github.com/apache/spark/pull/40358

   ### What changes were proposed in this pull request?
   Fixes `DataFrameWriter.save` to work without path or table parameter.
   Added support of jdbc method in the writer as it is one of the impl that 
does not contains a path or table.
   
   ### Why are the changes needed?
   DataFrameWriter.save should work without path parameter because some data 
sources, such as jdbc, noop, works without those parameters.
   The follow up fix for scala client of 
https://github.com/apache/spark/pull/40356
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Unit and E2E test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131932553


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1033,9 +1033,12 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 requiredAttrIds.contains(a.exprId)) =>
 s.withMetadataColumns()
   case p: Project if p.metadataOutput.exists(a => 
requiredAttrIds.contains(a.exprId)) =>
+// Inject the requested metadata columns into the project's output, if 
not already present.

Review Comment:
   Re the "only include" comment, do you mean something like this?
   ```scala
   val missingMetadata = p.metadataOutput
 .filter(a => requiredAttrIds.contains(a.exprId)
 .filterNot(a => p.projectList.exists(_.exprId == a.exprId))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131931758


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1033,9 +1033,12 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 requiredAttrIds.contains(a.exprId)) =>
 s.withMetadataColumns()
   case p: Project if p.metadataOutput.exists(a => 
requiredAttrIds.contains(a.exprId)) =>
+// Inject the requested metadata columns into the project's output, if 
not already present.

Review Comment:
   I hit a weird endless loop with this while debugging this `SubqueryAlias` 
issue. Basically, if the plan root already has a metadata attribute (perhaps 
added manually by a query rewrite), but it's not available because the 
`SubqueryAlias` blocked it, this rule kept endlessly (re)appending the metadata 
column to the projections below the `SubqueryAlias`. Once the rule ran 100 
times (leaving 100 copies of `_metadata` in the `Project` output), the endless 
loop detector kicked in and killed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131931758


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1033,9 +1033,12 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 requiredAttrIds.contains(a.exprId)) =>
 s.withMetadataColumns()
   case p: Project if p.metadataOutput.exists(a => 
requiredAttrIds.contains(a.exprId)) =>
+// Inject the requested metadata columns into the project's output, if 
not already present.

Review Comment:
   I hit a weird endless loop with this while debugging this `SubqueryAlias` 
issue. Basically, if the plan root already has a metadata attribute, but it's 
not available because the `SubqueryAlias` blocked it, this rule kept endlessly 
(re)appending the metadata column to the projections below the `SubqueryAlias`. 
Once the rule ran 100 times (leaving 100 copies of `_metadata` in the `Project` 
output), the endless loop detector kicked in and killed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-09 Thread via GitHub


shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1463230011

   > Hm, I just don't see the logic in that. It isn't how SQL works either, as 
far as I understand. Here's maybe another example, imagine a DataFrame defined 
by `SELECT 3 as id, 3 as ID`. Would you also say selecting "id" is unambiguous? 
and it makes sense to you if I change a 3 to a 4 that this query is no longer 
semantically valid?
   
   If it's valid as per the plan then yes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-09 Thread via GitHub


shrprasa commented on PR #40128:
URL: https://github.com/apache/spark/pull/40128#issuecomment-1463227858

   Gentle ping @holdenk


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-09 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1131921529


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   I see value in both depending on the use case. What about making it 
configurable? If we just switch to runtime checks everywhere, it will be a 
substantial behavior change. We can add a new SQL property and default to the 
existing INSERT behavior of throwing an exception during the analysis phase.
   
   By the way, I don't target 3.4 in this PR so we will have time to build a 
proper runtime checking framework. I think that would be a substantial effort 
as we need to cover inner fields, arrays, maps. There is no logic for that at 
the moment, if I am not mistaken.
   
   I do think consistency would be important. UPDATE and INSERT should behave 
in the same way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit

2023-03-09 Thread via GitHub


beliefer commented on code in PR #40355:
URL: https://github.com/apache/spark/pull/40355#discussion_r1131922264


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/LiteralValueProtoConverter.scala:
##
@@ -154,46 +215,46 @@ object LiteralValueProtoConverter {
 }
 
 val elementType = array.getElementType
-if (elementType.hasShort) {
-  makeArrayData(v => v.getShort.toShort)
-} else if (elementType.hasInteger) {
-  makeArrayData(v => v.getInteger)
-} else if (elementType.hasLong) {
-  makeArrayData(v => v.getLong)
-} else if (elementType.hasDouble) {
-  makeArrayData(v => v.getDouble)
-} else if (elementType.hasByte) {
-  makeArrayData(v => v.getByte.toByte)
-} else if (elementType.hasFloat) {
-  makeArrayData(v => v.getFloat)
-} else if (elementType.hasBoolean) {
-  makeArrayData(v => v.getBoolean)
-} else if (elementType.hasString) {
-  makeArrayData(v => v.getString)
-} else if (elementType.hasBinary) {
-  makeArrayData(v => v.getBinary.toByteArray)
-} else if (elementType.hasDate) {
-  makeArrayData(v => DateTimeUtils.toJavaDate(v.getDate))
-} else if (elementType.hasTimestamp) {
-  makeArrayData(v => DateTimeUtils.toJavaTimestamp(v.getTimestamp))
-} else if (elementType.hasTimestampNtz) {
-  makeArrayData(v => 
DateTimeUtils.microsToLocalDateTime(v.getTimestampNtz))
-} else if (elementType.hasDayTimeInterval) {
-  makeArrayData(v => IntervalUtils.microsToDuration(v.getDayTimeInterval))
-} else if (elementType.hasYearMonthInterval) {
-  makeArrayData(v => IntervalUtils.monthsToPeriod(v.getYearMonthInterval))
-} else if (elementType.hasDecimal) {
-  makeArrayData(v => Decimal(v.getDecimal.getValue))
-} else if (elementType.hasCalendarInterval) {
-  makeArrayData(v => {
-val interval = v.getCalendarInterval
-new CalendarInterval(interval.getMonths, interval.getDays, 
interval.getMicroseconds)
-  })
-} else if (elementType.hasArray) {
-  makeArrayData(v => toArrayData(v.getArray))
-} else {
-  throw InvalidPlanInput(s"Unsupported Literal Type: $elementType)")
+makeArrayData(getConverter(elementType))
+  }
+
+  private def toMapData(map: proto.Expression.Literal.Map): Any = {
+def makeMapData[K, V](
+keyConverter: proto.Expression.Literal => K,
+valueConverter: proto.Expression.Literal => V)(implicit
+tagK: ClassTag[K],
+tagV: ClassTag[V]): mutable.Map[K, V] = {
+  val builder = mutable.HashMap.empty[K, V]
+  val keys = map.getKeysList.asScala
+  val values = map.getValuesList.asScala
+  builder.sizeHint(keys.size)
+  keys.zip(values).foreach { case (key, value) =>
+builder += ((keyConverter(key), valueConverter(value)))
+  }
+  builder
 }
+
+makeMapData(getConverter(map.getKeyType), getConverter(map.getValueType))
   }
 
+  private def toStructData(struct: proto.Expression.Literal.Struct): Any = {
+val elements = struct.getElementsList.asScala
+val dataTypes = 
struct.getStructType.getStruct.getFieldsList.asScala.map(_.getDataType)
+val structData = elements
+  .zip(dataTypes)
+  .map { case (element, dataType) =>
+getConverter(dataType)(element)
+  }
+  .toList
+
+structData match {
+  case List(a) => (a)
+  case List(a, b) => (a, b)
+  case List(a, b, c) => (a, b, c)
+  case List(a, b, c, d) => (a, b, c, d)
+  case List(a, b, c, d, e) => (a, b, c, d, e)
+  case List(a, b, c, d, e, f) => (a, b, c, d, e, f)

Review Comment:
   These code looks fat. But I have no idea that convert list to tuple[N].



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-09 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1131921529


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   I see value in both depending on the use case. What about making it 
configurable? If we just switch to runtime checks everywhere, it will be a 
substantial behavior change. We can add a new SQL property and default to the 
existing INSERT behavior of throwing an exception during the analysis phase.
   
   By the way, I don't target 3.4 in this PR so we will have time to build a 
proper runtime checking framework. I think that would be a substantial effort 
as we need to cover inner fields. There is no logic for that at the moment, if 
I am not mistaken.
   
   I do think consistency would be important. UPDATE and INSERT should behave 
in the same way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on a diff in pull request #40356: [SPARK-42733][CONNECT][PYTHON] Fix DataFrameWriter.save to work without path parameter

2023-03-09 Thread via GitHub


zhenlineo commented on code in PR #40356:
URL: https://github.com/apache/spark/pull/40356#discussion_r1131922084


##
python/pyspark/sql/tests/test_datasources.py:
##
@@ -192,6 +193,23 @@ def test_ignore_column_of_all_nulls(self):
 finally:
 shutil.rmtree(path)
 
+def test_jdbc(self):
+db = f"memory:{uuid.uuid4()}"

Review Comment:
   @ueshin What kind of config do I need to make this work? I got error: 
Database xxx not found.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-09 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1131921529


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   I see value in both depending on the use case. What about making it 
configurable? If we just switch to runtime checks everywhere, it will be a 
substantial behavior change. We can add a new SQL property and default to the 
existing behavior of throwing an exception during the analysis phase.
   
   By the way, I don't target 3.4 in this PR so we will have time to build a 
proper runtime checking framework. I think that would be a substantial effort 
as we need to cover inner fields. There is no logic for that at the moment, if 
I am not mistaken.
   
   I do think consistency would be important. UPDATE and INSERT should behave 
in the same way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit

2023-03-09 Thread via GitHub


beliefer commented on code in PR #40355:
URL: https://github.com/apache/spark/pull/40355#discussion_r1131919854


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala:
##
@@ -2065,6 +2065,43 @@ class PlanGenerationTestSuite
   fn.lit(Array(new CalendarInterval(2, 20, 100L), new CalendarInterval(2, 
21, 200L
   }
 
+  test("function typedLit") {
+simple.select(
+  fn.typedLit(fn.col("id")),
+  fn.typedLit('id),
+  fn.typedLit(1),

Review Comment:
   Thank you for the reminder.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

2023-03-09 Thread via GitHub


wangyum commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1463196395

   I had a change like this before: https://github.com/apache/spark/pull/22778.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40350: [SPARK-42726][CONNECT][PYTHON] Implement `DataFrame.mapInArrow`

2023-03-09 Thread via GitHub


xinrong-meng commented on PR #40350:
URL: https://github.com/apache/spark/pull/40350#issuecomment-1463193506

   Thanks @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [WIP][SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-03-09 Thread via GitHub


LuciferYang commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1131904105


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala:
##
@@ -176,4 +176,31 @@ class DataFrameStatSuite extends RemoteSparkSession {
 assert(sketch.relativeError() === 0.001)
 assert(sketch.confidence() === 0.99 +- 5e-3)
   }
+
+  // This test only verifies some basic requirements, more correctness tests 
can be found in
+  // `BloomFilterSuite` in project spark-sketch.
+  test("Bloom filter") {

Review Comment:
   Let me add that cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [WIP][SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-03-09 Thread via GitHub


LuciferYang commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1131904105


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala:
##
@@ -176,4 +176,31 @@ class DataFrameStatSuite extends RemoteSparkSession {
 assert(sketch.relativeError() === 0.001)
 assert(sketch.confidence() === 0.99 +- 5e-3)
   }
+
+  // This test only verifies some basic requirements, more correctness tests 
can be found in
+  // `BloomFilterSuite` in project spark-sketch.
+  test("Bloom filter") {

Review Comment:
   Let me add them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40357: [SPARK-42739][BUILD] Ensure release tag to be pushed to release branch

2023-03-09 Thread via GitHub


xinrong-meng commented on PR #40357:
URL: https://github.com/apache/spark/pull/40357#issuecomment-1463187698

   Merged to master and branch-3.4, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng closed pull request #40357: [SPARK-42739][BUILD] Ensure release tag to be pushed to release branch

2023-03-09 Thread via GitHub


xinrong-meng closed pull request #40357: [SPARK-42739][BUILD] Ensure release 
tag to be pushed to release branch
URL: https://github.com/apache/spark/pull/40357


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on a diff in pull request #40357: [SPARK-42739][BUILD] Ensure release tag to be pushed to release branch

2023-03-09 Thread via GitHub


xinrong-meng commented on code in PR #40357:
URL: https://github.com/apache/spark/pull/40357#discussion_r1131900698


##
dev/create-release/release-tag.sh:
##
@@ -122,6 +122,12 @@ if ! is_dry_run; then
   git push origin $RELEASE_TAG
   if [[ $RELEASE_VERSION != *"preview"* ]]; then
 git push origin HEAD:$GIT_BRANCH

Review Comment:
   When `git push origin HEAD:$GIT_BRANCH` doesn't succeed, the PR proposes to 
call out the exact error instead of failing silently. 
   In addition, the reason why we have to `| grep origin` is that, `git branch 
-r --contains tags/_d_tag` exits 0 even if `git push origin HEAD:$GIT_BRANCH` 
doesn't execute.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] chong0929 commented on a diff in pull request #40341: [SPARK-42715][SQL] Tips for Optimizing NegativeArraySizeException

2023-03-09 Thread via GitHub


chong0929 commented on code in PR #40341:
URL: https://github.com/apache/spark/pull/40341#discussion_r1131900349


##
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##
@@ -204,7 +204,12 @@ public void initBatch(
* by copying from ORC VectorizedRowBatch columns to Spark ColumnarBatch 
columns.
*/
   private boolean nextBatch() throws IOException {
-recordReader.nextBatch(wrap.batch());
+try {
+  recordReader.nextBatch(wrap.batch());
+} catch (NegativeArraySizeException e) {

Review Comment:
   Thanks for your ideas, they sound nice, i will make it done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [WIP][SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-03-09 Thread via GitHub


LuciferYang commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1131897415


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1073,6 +1074,12 @@ class SparkConnectPlanner(val session: SparkSession) {
 }
 Some(Lead(children.head, children(1), children(2), ignoreNulls))
 
+  case "bloom_filter_agg" if fun.getArgumentsCount == 3 =>
+val children = 
fun.getArgumentsList.asScala.toSeq.map(transformExpression)
+Some(
+  new BloomFilterAggregate(children.head, children(1), children(2))

Review Comment:
   > There is a small issue here. The aggregate requires the first input to be 
a Long. `DataFrameStatFunctions.bloomFilter` supports `Byte`, `Short`, `Int`, 
`Long`, and `String`. While we can simply add a cast to long for the first 4, 
string will be an issue. We need adapt the BloomFilterAggregate to make it 
fully compatible.
   
   So maybe adding a new protobuf message is a more simpler way? 
`BloomFilterAggregate` is a internal function used by `InjectRuntimeFilter`,  
and `InjectRuntimeFilter` hashes both input column and might contain values, so 
I think there is no String type input in that scenario?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] chong0929 commented on a diff in pull request #40341: [SPARK-42715][SQL] Tips for Optimizing NegativeArraySizeException

2023-03-09 Thread via GitHub


chong0929 commented on code in PR #40341:
URL: https://github.com/apache/spark/pull/40341#discussion_r1131896943


##
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##
@@ -204,7 +204,12 @@ public void initBatch(
* by copying from ORC VectorizedRowBatch columns to Spark ColumnarBatch 
columns.
*/
   private boolean nextBatch() throws IOException {
-recordReader.nextBatch(wrap.batch());
+try {

Review Comment:
   Thoughtful, i will make a test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40357: [SPARK-42739][BUILD] Ensure release tag to be pushed to release branch

2023-03-09 Thread via GitHub


HyukjinKwon commented on code in PR #40357:
URL: https://github.com/apache/spark/pull/40357#discussion_r1131893994


##
dev/create-release/release-tag.sh:
##
@@ -122,6 +122,12 @@ if ! is_dry_run; then
   git push origin $RELEASE_TAG
   if [[ $RELEASE_VERSION != *"preview"* ]]; then
 git push origin HEAD:$GIT_BRANCH

Review Comment:
   Hm, so to clarify, we set `set -e` on the top, meaning that the script will 
fail immediately if any command fails with non-zero exit.
   
   I assume that there's a case when `git push origin $RELEASE_TAG` and `git 
push origin HEAD:$GIT_BRANCH` are successfully executed but the tag still 
doesn't exist? If that's the case LGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [WIP][SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-03-09 Thread via GitHub


LuciferYang commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1131893211


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1073,6 +1074,12 @@ class SparkConnectPlanner(val session: SparkSession) {
 }
 Some(Lead(children.head, children(1), children(2), ignoreNulls))
 
+  case "bloom_filter_agg" if fun.getArgumentsCount == 3 =>
+val children = 
fun.getArgumentsList.asScala.toSeq.map(transformExpression)
+Some(
+  new BloomFilterAggregate(children.head, children(1), children(2))

Review Comment:
   > You will need to hash the input column.
   
   like `new BloomFilterAggregate(new XxHash64(Seq(children.head)), `
   
   But if we hash the input column, I think it will be hashed twice ... the 
existing test case 
   
   ```
   assert(0.until(1000).forall(filter1.mightContain))
   ```
   
   will not pass



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit

2023-03-09 Thread via GitHub


beliefer commented on code in PR #40355:
URL: https://github.com/apache/spark/pull/40355#discussion_r1131890754


##
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##
@@ -195,6 +197,17 @@ message Expression {
   DataType elementType = 1;
   repeated Literal element = 2;
 }
+
+message Map {
+  DataType keyType = 1;
+  DataType valueType = 2;
+  map map_data = 3;

Review Comment:
   Yeah. So we can support the key as any literal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE

2023-03-09 Thread via GitHub


cloud-fan commented on PR #40333:
URL: https://github.com/apache/spark/pull/40333#issuecomment-1463145544

   maybe there is a conflict right after my last commit, let me rebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40349: [SPARK-42725][CONNECT][PYTHON] Make LiteralExpression support array params

2023-03-09 Thread via GitHub


zhengruifeng commented on PR #40349:
URL: https://github.com/apache/spark/pull/40349#issuecomment-1463120914

   merged into master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40349: [SPARK-42725][CONNECT][PYTHON] Make LiteralExpression support array params

2023-03-09 Thread via GitHub


zhengruifeng closed pull request #40349: [SPARK-42725][CONNECT][PYTHON] Make 
LiteralExpression support array params
URL: https://github.com/apache/spark/pull/40349


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131870627


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -281,6 +281,53 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
 )
   }
 
+  metadataColumnsTest("metadata propagates through projections automatically",

Review Comment:
   This change is for the general metadata col framework, not file source 
metadata columns, we can add the tests in `MetadataColumnSuite`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-09 Thread via GitHub


cloud-fan commented on code in PR #40321:
URL: https://github.com/apache/spark/pull/40321#discussion_r1131869082


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1033,9 +1033,12 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 requiredAttrIds.contains(a.exprId)) =>
 s.withMetadataColumns()
   case p: Project if p.metadataOutput.exists(a => 
requiredAttrIds.contains(a.exprId)) =>
+// Inject the requested metadata columns into the project's output, if 
not already present.

Review Comment:
   do we hit a real issue with this? If the metadata col is already in the 
output, this code path should not be triggered as we have `val metaCols = 
getMetadataAttributes(node).filterNot(inputAttrs.contains)`
   
   What we can improve is to only include metadata columns that are included in 
`requiredAttrIds`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE

2023-03-09 Thread via GitHub


HyukjinKwon commented on PR #40333:
URL: https://github.com/apache/spark/pull/40333#issuecomment-1463085558

   Seems like the compliation didn't pass. Let me just quickly revert this and 
reopen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-09 Thread via GitHub


itholic commented on code in PR #40282:
URL: https://github.com/apache/spark/pull/40282#discussion_r1127483352


##
python/docs/source/development/errors.rst:
##
@@ -0,0 +1,92 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Error conditions in PySpark
+===
+
+This is a list of common, named error conditions returned by PySpark which are 
defined at `error_classes.py 
`_.
+
+When writing PySpark errors, developers must use an error condition from the 
list. If an appropriate error condition is not available, add a new one into 
the list. For more information, please refer to `Contributing Error and 
Exception 
`_.
+
+++--+

Review Comment:
   Or maybe do you want to organize all the error classes that exist in JVM and 
Python on one page?
   
   IMHO, it is better to document them separately in each document because most 
error classes on the JVM side are SQL-related error classes including SQLSTATE, 
and Python error classes are error classes for Python-specific types and values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-09 Thread via GitHub


itholic commented on PR #40282:
URL: https://github.com/apache/spark/pull/40282#issuecomment-1463083039

   Documentation for SQL side is get merged from 
https://github.com/apache/spark/pull/40336.
   
   Note that Python side are simpler compared to SQL side because we do not 
have SQLSTATE, and there is currently no main error class with sub-error 
classes. Also, the overall volume of errors is not as high as in SQL documents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #39624: [SPARK-42101][SQL] Make AQE support InMemoryTableScanExec

2023-03-09 Thread via GitHub


ulysses-you commented on code in PR #39624:
URL: https://github.com/apache/spark/pull/39624#discussion_r1130968837


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -220,10 +221,28 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def getExecutionId: Option[Long] = {
-// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
-// belongs to another (parent) query, and we should not call update UI in 
this query.
 
Option(context.session.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY))
-  .map(_.toLong).filter(SQLExecution.getQueryExecution(_) eq context.qe)
+  .map(_.toLong)
+  }
+
+  private lazy val shouldUpdatePlan: Boolean = {
+// If the `QueryExecution` does not match the current execution ID, it 
means the execution ID
+// belongs to another (parent) query, and we should call update metrics 
instead of plan in
+// this query. For example:
+//
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 0, no execution id)
+//   |
+//  InMemoryTableScanExec
+//   |
+//  ...
+//   |
+//  AdaptiveSparkPlanExec (query execution 1, execution id 0)

Review Comment:
   They are two query executions. How can the outer AQE work through IMR if we 
replace the nested AQE with `CacheTableQueryStageExec`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE

2023-03-09 Thread via GitHub


cloud-fan commented on PR #40333:
URL: https://github.com/apache/spark/pull/40333#issuecomment-1463076663

   This is a bug fix of a new feature in 3.4, so I won't call it a release 
blocker. I've set the fixed version to 3.4.0, if rc3 passes, I'll change it to 
3.4.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE

2023-03-09 Thread via GitHub


cloud-fan commented on PR #40333:
URL: https://github.com/apache/spark/pull/40333#issuecomment-1463075399

   The failed `BasicSchedulerIntegrationSuite` is not related to this PR, I'm 
merging it to master/3.4, thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE

2023-03-09 Thread via GitHub


cloud-fan closed pull request #40333: [SPARK-42702][SPARK-42623][SQL] Support 
parameterized query in subquery and CTE
URL: https://github.com/apache/spark/pull/40333


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-09 Thread via GitHub


HyukjinKwon commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1131838714


##
docs/spark-connect-overview.md:
##
@@ -0,0 +1,244 @@
+---
+layout: global
+title: Spark Connect Overview
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+**Building client-side Spark applications**
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server
+architecture that allows remote connectivity to Spark clusters using the
+DataFrame API and unresolved logical plans as the protocol. The separation
+between client and server allows Spark and its open ecosystem to be
+leveraged from everywhere. It can be embedded in modern data applications,
+in IDEs, Notebooks and programming languages.
+
+To get started, see [Quickstart: Spark 
Connect](api/python/getting_started/quickstart_connect.html).
+
+
+  
+
+
+# How Spark Connect works
+
+The Spark Connect client library is designed to simplify Spark application
+development. It is a thin API that can be embedded everywhere: in application
+servers, IDEs, notebooks, and programming languages. The Spark Connect API
+builds on Spark's DataFrame API using unresolved logical plans as a
+language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved
+logical query plans which are encoded using protocol buffers. These are sent
+to the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and
+translates unresolved logical plans into Spark's logical plan operators.
+This is similar to parsing a SQL query, where attributes and relations are
+parsed and an initial parse plan is built. From there, the standard Spark
+execution process kicks in, ensuring that Spark Connect leverages all of
+Spark's optimizations and enhancements. Results are streamed back to the
+client via gRPC as Apache Arrow-encoded row batches.
+
+
+  
+
+
+# Operational benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact their
+own environment as they can run in their own processes. Users can define their
+own dependencies on the client and don't need to worry about potential 
conflicts
+with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded 
independently
+of applications, e.g. to benefit from performance improvements and security 
fixes.
+This means applications can be forward-compatible, as long as the server-side 
RPC
+definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive 
debugging
+during development directly from your favorite IDE. Similarly, applications can
+be monitored using the application's framework native metrics and logging 
libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark and 
Scala
+applications. we will walk through how to run an Apache Spark server with Spark
+Connect and connect to it from a client application using the Spark Connect 
client
+library.
+
+## Download and start Spark server with Spark Connect
+
+First, download Spark from the
+[Download Apache Spark](https://spark.apache.org/downloads.html) page. Spark 
Connect
+was introduced in Apache Spark version 3.4 so make sure you choose 3.4.0 or 
newer in
+the release drop down at the top of the page. Then choose your package type, 
typically
+“Pre-built for Apache Hadoop 3.3 and later”, and click the link to download.
+
+Now extract the Spark package you just downloaded on your computer, for 
example:
+
+{% highlight bash %}
+tar -xvf spark-3.4.0-bin-hadoop3.tgz
+{% endhighlight %}
+
+In a terminal window, now go to the `spark` folder in the location where you 
extracted
+Spark before and run the `start-connect-server.sh` script to start Spark 
server with
+Spark Connect, like in this example:
+
+{% highlight bash %}
+./sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.4.0
+{% endhighlight %}
+
+Note that we include a Spark Connect package 

[GitHub] [spark] HyukjinKwon closed pull request #40302: [SPARK-42686][CORE] Defer formatting for debug messages in TaskMemoryManager

2023-03-09 Thread via GitHub


HyukjinKwon closed pull request #40302: [SPARK-42686][CORE] Defer formatting 
for debug messages in TaskMemoryManager
URL: https://github.com/apache/spark/pull/40302


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40302: [SPARK-42686][CORE] Defer formatting for debug messages in TaskMemoryManager

2023-03-09 Thread via GitHub


HyukjinKwon commented on PR #40302:
URL: https://github.com/apache/spark/pull/40302#issuecomment-1463061643

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on pull request #40353: [SPARK-42732][PYSPARK][CONNECT] Support spark connect session getActiveSession method

2023-03-09 Thread via GitHub


WeichenXu123 commented on PR #40353:
URL: https://github.com/apache/spark/pull/40353#issuecomment-1463060514

   > @WeichenXu123 in what case won't Spark Connect ML have access to the 
session?
   
   For some APIs, like `estimator.fit(dataset)`, `model.transform(dataset)`, we 
can get session from the input spark dataframe, in other cases e.g. get a model 
attribute, we have no input dataframe, so we need `getActiveSession` to get the 
session and then send getting attribute requests to server side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40350: [SPARK-42726][CONNECT][PYTHON] Implement `DataFrame.mapInArrow`

2023-03-09 Thread via GitHub


HyukjinKwon closed pull request #40350: [SPARK-42726][CONNECT][PYTHON] 
Implement `DataFrame.mapInArrow`
URL: https://github.com/apache/spark/pull/40350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40350: [SPARK-42726][CONNECT][PYTHON] Implement `DataFrame.mapInArrow`

2023-03-09 Thread via GitHub


HyukjinKwon commented on PR #40350:
URL: https://github.com/apache/spark/pull/40350#issuecomment-1463058011

   The test failure seems unrelated.
   
   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng closed pull request #40357: [WIP][SPARK-42739][BUILD] Ensure release tag to be pushed to release branch

2023-03-09 Thread via GitHub


xinrong-meng closed pull request #40357: [WIP][SPARK-42739][BUILD] Ensure 
release tag to be pushed to release branch
URL: https://github.com/apache/spark/pull/40357


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40356: [SPARK-42733][CONNECT][PYTHON] Fix DataFrameWriter.save to work without path parameter

2023-03-09 Thread via GitHub


zhengruifeng commented on PR #40356:
URL: https://github.com/apache/spark/pull/40356#issuecomment-1463045799

   merged to master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40356: [SPARK-42733][CONNECT][PYTHON] Fix DataFrameWriter.save to work without path parameter

2023-03-09 Thread via GitHub


zhengruifeng closed pull request #40356: [SPARK-42733][CONNECT][PYTHON] Fix 
DataFrameWriter.save to work without path parameter
URL: https://github.com/apache/spark/pull/40356


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng opened a new pull request, #40357: [SPARK-42739][BUILD] Ensure release tag to be pushed to release branch

2023-03-09 Thread via GitHub


xinrong-meng opened a new pull request, #40357:
URL: https://github.com/apache/spark/pull/40357

   ### What changes were proposed in this pull request?
   In the release script, add a check to ensure release tag to be pushed to 
release branch.
   
   
   ### Why are the changes needed?
   To ensure the success of a RC cut. Otherwise, release conductors have to 
manually check that.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Manual test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale

2023-03-09 Thread via GitHub


github-actions[bot] commented on PR #38739:
URL: https://github.com/apache/spark/pull/38739#issuecomment-1463026755

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-09 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1131780862


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -244,6 +245,89 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   parameters = Map("fieldName" -> "`file_name`", "fields" -> "`id`, 
`university`"))
   }
 
+  metadataColumnsTest("df metadataColumn - schema conflict",
+schemaWithNameConflicts) { (df, f0, f1) =>
+// the user data has the schema: name, age, _metadata.id, 
_metadata.university
+
+// get the real metadata column (whose name should have been adjusted)
+val metadataColumn = df.metadataColumn("_metadata")
+assert(metadataColumn.expr.asInstanceOf[NamedExpression].name == 
"__metadata")
+
+// select user data
+checkAnswer(
+  df.select("name", "age", "_METADATA", "_metadata")
+.withColumn("file_name", metadataColumn.getField("file_name")),
+  Seq(
+Row("jack", 24, Row(12345L, "uom"), Row(12345L, "uom"), 
f0(METADATA_FILE_NAME)),
+Row("lily", 31, Row(54321L, "ucb"), Row(54321L, "ucb"), 
f1(METADATA_FILE_NAME))
+  )
+)
+  }
+
+  metadataColumnsTest("df metadataColumn - no schema conflict",
+schema) { (df, f0, f1) =>
+// get the real metadata column (whose name should _NOT_ have been 
adjusted)
+val metadataColumn = df.metadataColumn("_metadata")
+assert(metadataColumn.expr.asInstanceOf[NamedExpression].name == 
"_metadata")
+
+// select user data
+checkAnswer(
+  df.select("name", "age")
+.withColumn("file_name", metadataColumn.getField("file_name")),
+  Seq(
+Row("jack", 24, f0(METADATA_FILE_NAME)),
+Row("lily", 31, f1(METADATA_FILE_NAME))
+  )
+)
+  }
+
+  metadataColumnsTest("df metadataColumn - column not found", schema) { (df, 
f0, f1) =>
+// Not a column at all
+checkError(
+  exception = intercept[AnalysisException] {
+df.withMetadataColumn("foo")
+  },
+  errorClass = "UNRESOLVED_COLUMN.WITH_SUGGESTION",
+  parameters = Map("objectName" -> "`foo`", "proposal" -> "`_metadata`"))
+
+// Name exists, but does not reference a metadata column
+checkError(
+  exception = intercept[AnalysisException] {
+df.withMetadataColumn("name")
+  },
+  errorClass = "UNRESOLVED_COLUMN.WITH_SUGGESTION",
+  parameters = Map("objectName" -> "`name`", "proposal" -> "`_metadata`"))
+  }
+
+  metadataColumnsTest("metadata name conflict resolved with leading 
underscores - one",
+schemaWithNameConflicts) { (df, f0, f1) =>
+// the user data has the schema: name, age, _metadata.id, 
_metadata.university
+
+checkAnswer(
+  df.select("name", "age", "_metadata", "__metadata.file_name"),
+  Seq(
+Row("jack", 24, Row(12345L, "uom"), f0(METADATA_FILE_NAME)),
+Row("lily", 31, Row(54321L, "ucb"), f1(METADATA_FILE_NAME))
+  )
+)
+  }
+
+  metadataColumnsTest("metadata name conflict resolved with leading 
underscores - several",
+new StructType()
+  .add(schema("name").copy(name = "_metadata"))
+  .add(schema("age").copy(name = "__metadata"))
+  .add(schema("info").copy(name = "___metadata"))) { (df, f0, f1) =>
+// the user data has the schema: _metadata, __metadata, ___metadata.id, 
___metadata.university
+
+checkAnswer(
+  df.select("_metadata", "__metadata", "___metadata", 
"metadata.file_name"),
+  Seq(
+Row("jack", 24, Row(12345L, "uom"), f0(METADATA_FILE_NAME)),
+Row("lily", 31, Row(54321L, "ucb"), f1(METADATA_FILE_NAME))
+  )
+)
+  }
+

Review Comment:
   Went ahead and added a test case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40315: [SPARK-42699][CONNECT] SparkConnectServer should make client and AM same exit code

2023-03-09 Thread via GitHub


amaliujia commented on code in PR #40315:
URL: https://github.com/apache/spark/pull/40315#discussion_r1131780390


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -737,12 +737,19 @@ class SparkSession private(
   // scalastyle:on
 
   /**
-   * Stop the underlying `SparkContext`.
+   * Stop the underlying `SparkContext` with default exit code 0.
*
* @since 2.0.0
*/
-  def stop(): Unit = {
-sparkContext.stop()
+  def stop(): Unit = stop(0)

Review Comment:
   I think it is better to not change this line. This line builds on 
`sparkContext.stop()`. Even though the underlying implementation is 
`SparkContext.stop(0)`, I think we'd better do not make that assumption that it 
will always be.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-09 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1131780602


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2714,6 +2726,17 @@ class Dataset[T] private[sql](
*/
   def withColumn(colName: String, col: Column): DataFrame = 
withColumns(Seq(colName), Seq(col))
 
+  /**
+   * Returns a new Dataset by selecting a metadata column with the given 
logical name.
+   *
+   * A metadata column can be accessed this way even if the underlying data 
source defines a data
+   * column with a conflicting name.
+   *
+   * @group untypedrel
+   * @since 4.0.0
+   */
+  def withMetadataColumn(colName: String): DataFrame = withColumn(colName, 
metadataColumn(colName))

Review Comment:
   Removed `withMetadataColumn` method for now. Nothing stops us from adding it 
in the future if it's needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >