[GitHub] [spark] allisonwang-db commented on a change in pull request #32787: [SPARK-35618][SQL] Resolve star expressions in subqueries using outer query plans

2021-06-07 Thread GitBox


allisonwang-db commented on a change in pull request #32787:
URL: https://github.com/apache/spark/pull/32787#discussion_r647164105



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
##
@@ -791,4 +791,28 @@ class AnalysisErrorSuite extends AnalysisTest {
   assertAnalysisError(plan, s"Correlated column is not allowed in 
predicate ($msg)" :: Nil)
 }
   }
+
+  test("SPARK-35618: Resolve star expressions in subquery") {

Review comment:
   Yes, currently only `Filter` can host outer references for correlated 
subqueries, and star expansion only happens when the node is either a `Project` 
or `Aggregate` (buildExpandedProjectList). It will be clearer with lateral 
subquery examples: 
   ```sql
   // t: [a, b]
   SELECT * FROM t, LATERAL (SELECT t.*)  // <--- t.* should be resolved as 
t.a, t.b
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-06-07 Thread GitBox


SparkQA commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-856505385


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43975/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


SparkQA commented on pull request #32769:
URL: https://github.com/apache/spark/pull/32769#issuecomment-856505160


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43974/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #32653: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-06-07 Thread GitBox


HeartSaVioR commented on pull request #32653:
URL: https://github.com/apache/spark/pull/32653#issuecomment-856504425


   retest this, please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #32653: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-06-07 Thread GitBox


HeartSaVioR commented on a change in pull request #32653:
URL: https://github.com/apache/spark/pull/32653#discussion_r647159789



##
File path: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala
##
@@ -95,15 +114,65 @@ private[kafka010] class KafkaMicroBatchStream(
   override def latestOffset(start: Offset, readLimit: ReadLimit): Offset = {
 val startPartitionOffsets = 
start.asInstanceOf[KafkaSourceOffset].partitionToOffsets
 latestPartitionOffsets = 
kafkaOffsetReader.fetchLatestOffsets(Some(startPartitionOffsets))
-endPartitionOffsets = KafkaSourceOffset(readLimit match {
-  case rows: ReadMaxRows =>
-rateLimit(rows.maxRows(), startPartitionOffsets, 
latestPartitionOffsets)
-  case _: ReadAllAvailable =>
-latestPartitionOffsets
-})
+
+val limits: Seq[ReadLimit] = readLimit match {
+  case rows: CompositeReadLimit => rows.getReadLimits
+  case rows => Seq(rows)
+}
+
+val offsets = if (limits.exists(_.isInstanceOf[ReadAllAvailable])) {
+  // ReadAllAvailable has the highest priority
+  latestPartitionOffsets
+} else {
+  val lowerLimit = 
limits.find(_.isInstanceOf[ReadMinRows]).map(_.asInstanceOf[ReadMinRows])
+  val upperLimit = 
limits.find(_.isInstanceOf[ReadMaxRows]).map(_.asInstanceOf[ReadMaxRows])
+
+  lowerLimit.flatMap { limit =>
+// checking if we need to skip batch based on minOffsetPerTrigger 
criteria
+val skipBatch = delayBatch(
+  limit.minRows, latestPartitionOffsets, startPartitionOffsets, 
limit.maxTriggerDelayMs)
+if (skipBatch) {
+  logDebug(
+s"Delaying batch as number of records available is less than 
minOffsetsPerTrigger")
+  Some(startPartitionOffsets)
+} else {
+  None
+}
+  }.orElse {
+// checking if we need to adjust a range of offsets based on 
maxOffsetPerTrigger criteria
+upperLimit.map { limit =>
+  rateLimit(limit.maxRows(), startPartitionOffsets, 
latestPartitionOffsets)
+}
+  }.getOrElse(latestPartitionOffsets)
+}
+
+endPartitionOffsets = KafkaSourceOffset(offsets)
 endPartitionOffsets
   }
 
+  /** Checks if we need to skip this trigger based on minOffsetsPerTrigger & 
maxTriggerDelay */
+  private def delayBatch(
+  minLimit: Long,
+  latestOffsets: Map[TopicPartition, Long],
+  currentOffsets: Map[TopicPartition, Long],
+  maxTriggerDelayMs: Long): Boolean = {
+// Checking first if the maxbatchDelay time has passed

Review comment:
   nit: It won't hurt if we only call `System.currentTimeMillis()` once and 
reuse it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


HyukjinKwon commented on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856501314


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

2021-06-07 Thread GitBox


SparkQA commented on pull request #32786:
URL: https://github.com/apache/spark/pull/32786#issuecomment-856498062


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43973/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


SparkQA commented on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856496277


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43971/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32815: [SPARK-35675][SQL] EnsureRequirements remove shuffle should respect PartitioningCollection

2021-06-07 Thread GitBox


SparkQA commented on pull request #32815:
URL: https://github.com/apache/spark/pull/32815#issuecomment-856495276


   **[Test build #139457 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139457/testReport)**
 for PR 32815 at commit 
[`fa56cf7`](https://github.com/apache/spark/commit/fa56cf7223f1ea7f4342e9335e91b80182097be9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


maropu commented on a change in pull request #32769:
URL: https://github.com/apache/spark/pull/32769#discussion_r647153040



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala
##
@@ -42,9 +42,27 @@ case class ExpandExec(
   override lazy val metrics = Map(
 "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"))
 
-  // The GroupExpressions can output data with arbitrary partitioning, so set 
it
-  // as UNKNOWN partitioning
-  override def outputPartitioning: Partitioning = UnknownPartitioning(0)
+  /**
+   * The Expand is commonly introduced by the RewriteDistinctAggregates 
optimizer rule.
+   * In that case there can be several attributes that are kept as they are by 
the Expand.
+   * If the child's output is partitioned by those attributes, then so will be
+   * the output of the Expand.
+   * In general case the Expand can output data with arbitrary partitioning, 
so set it
+   * as UNKNOWN partitioning.
+   */
+  override def outputPartitioning: Partitioning = {
+val stableAttrs = ExpressionSet(output.zipWithIndex.filter {
+  case (attr, i) => projections.forall(_(i).semanticEquals(attr))
+}.map(_._1))
+
+child.outputPartitioning match {

Review comment:
   Ah, I see.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32788: [SPARK-35602][SS] Update state schema to be able to accept long length JSON

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32788:
URL: https://github.com/apache/spark/pull/32788#issuecomment-856493899


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32788: [SPARK-35602][SS] Update state schema to be able to accept long length JSON

2021-06-07 Thread GitBox


SparkQA commented on pull request #32788:
URL: https://github.com/apache/spark/pull/32788#issuecomment-856493864


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32788: [SPARK-35602][SS] Update state schema to be able to accept long length JSON

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32788:
URL: https://github.com/apache/spark/pull/32788#issuecomment-856493899


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you opened a new pull request #32815: [SPARK-35675][SQL] EnsureRequirements remove shuffle should respect PartitioningCollection

2021-06-07 Thread GitBox


ulysses-you opened a new pull request #32815:
URL: https://github.com/apache/spark/pull/32815


   
   
   ### What changes were proposed in this pull request?
   
   Add `PartitioningCollection` in EnsureRequirements during remove shuffle.
   
   ### Why are the changes needed?
   
   Currently `EnsureRequirements` only check if child has semantic equal 
`HashPartitioning` and remove
   redundant shuffle. We can enhance this case using `PartitioningCollection`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, plan might be changed.
   
   ### How was this patch tested?
   
   Add test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tanelk commented on a change in pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


tanelk commented on a change in pull request #32769:
URL: https://github.com/apache/spark/pull/32769#discussion_r647151531



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala
##
@@ -42,9 +42,27 @@ case class ExpandExec(
   override lazy val metrics = Map(
 "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"))
 
-  // The GroupExpressions can output data with arbitrary partitioning, so set 
it
-  // as UNKNOWN partitioning
-  override def outputPartitioning: Partitioning = UnknownPartitioning(0)
+  /**
+   * The Expand is commonly introduced by the RewriteDistinctAggregates 
optimizer rule.
+   * In that case there can be several attributes that are kept as they are by 
the Expand.
+   * If the child's output is partitioned by those attributes, then so will be
+   * the output of the Expand.
+   * In general case the Expand can output data with arbitrary partitioning, 
so set it
+   * as UNKNOWN partitioning.
+   */
+  override def outputPartitioning: Partitioning = {
+val stableAttrs = ExpressionSet(output.zipWithIndex.filter {
+  case (attr, i) => projections.forall(_(i).semanticEquals(attr))
+}.map(_._1))
+
+child.outputPartitioning match {

Review comment:
   The `ProjectExec`, that was inserted after join, managed to simplify it 
to `HashPartitioning` for the cases I could think of.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] eejbyfeldt commented on a change in pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


eejbyfeldt commented on a change in pull request #32783:
URL: https://github.com/apache/spark/pull/32783#discussion_r647149051



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -1181,21 +1176,25 @@ case class CatalystToExternalMap private(
 newMapBuilderMethod.invoke(moduleField).asInstanceOf[Builder[AnyRef, 
AnyRef]]
   }
 
+  private def keyValueIterator(md: MapData): Iterator[AnyRef] = {
+val keyArray = md.keyArray()
+val valueArray = md.valueArray()
+val row = new GenericInternalRow(1)
+0.until(md.numElements()).iterator.map { i =>
+  row.update(0, keyArray.get(i, inputMapType.keyType))
+  val key = keyLambdaFunction.eval(row)
+  row.update(0, valueArray.get(i, inputMapType.valueType))
+  val value = valueLambdaFunction.eval(row)
+  Tuple2(key, value)
+}
+  }
+
   override def eval(input: InternalRow): Any = {
 val result = inputData.eval(input).asInstanceOf[MapData]
 if (result != null) {
   val builder = newMapBuilder()
   builder.sizeHint(result.numElements())
-  val keyArray = result.keyArray()
-  val valueArray = result.valueArray()
-  var i = 0
-  while (i < result.numElements()) {

Review comment:
   I guess I am not sure whether it does or not. But I updated the PR to 
use a while instead, to be sure. Now the style is also more similar to what was 
there before.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


maropu commented on a change in pull request #32783:
URL: https://github.com/apache/spark/pull/32783#discussion_r647150106



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -1181,21 +1176,25 @@ case class CatalystToExternalMap private(
 newMapBuilderMethod.invoke(moduleField).asInstanceOf[Builder[AnyRef, 
AnyRef]]
   }
 
+  private def keyValueIterator(md: MapData): Iterator[AnyRef] = {
+val keyArray = md.keyArray()
+val valueArray = md.valueArray()
+val row = new GenericInternalRow(1)
+0.until(md.numElements()).iterator.map { i =>
+  row.update(0, keyArray.get(i, inputMapType.keyType))
+  val key = keyLambdaFunction.eval(row)
+  row.update(0, valueArray.get(i, inputMapType.valueType))
+  val value = valueLambdaFunction.eval(row)
+  Tuple2(key, value)
+}
+  }
+
   override def eval(input: InternalRow): Any = {
 val result = inputData.eval(input).asInstanceOf[MapData]
 if (result != null) {
   val builder = newMapBuilder()
   builder.sizeHint(result.numElements())
-  val keyArray = result.keyArray()
-  val valueArray = result.valueArray()
-  var i = 0
-  while (i < result.numElements()) {

Review comment:
   The latest one looks fine. Thanks ;)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32783:
URL: https://github.com/apache/spark/pull/32783#issuecomment-855167204


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] eejbyfeldt commented on a change in pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


eejbyfeldt commented on a change in pull request #32783:
URL: https://github.com/apache/spark/pull/32783#discussion_r647149051



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -1181,21 +1176,25 @@ case class CatalystToExternalMap private(
 newMapBuilderMethod.invoke(moduleField).asInstanceOf[Builder[AnyRef, 
AnyRef]]
   }
 
+  private def keyValueIterator(md: MapData): Iterator[AnyRef] = {
+val keyArray = md.keyArray()
+val valueArray = md.valueArray()
+val row = new GenericInternalRow(1)
+0.until(md.numElements()).iterator.map { i =>
+  row.update(0, keyArray.get(i, inputMapType.keyType))
+  val key = keyLambdaFunction.eval(row)
+  row.update(0, valueArray.get(i, inputMapType.valueType))
+  val value = valueLambdaFunction.eval(row)
+  Tuple2(key, value)
+}
+  }
+
   override def eval(input: InternalRow): Any = {
 val result = inputData.eval(input).asInstanceOf[MapData]
 if (result != null) {
   val builder = newMapBuilder()
   builder.sizeHint(result.numElements())
-  val keyArray = result.keyArray()
-  val valueArray = result.valueArray()
-  var i = 0
-  while (i < result.numElements()) {

Review comment:
   I guess I am not sure whether it does or not. But I updated the PR to 
use a while instead, to be sure. Now it the style is also more similar to what 
was there before.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


SparkQA commented on pull request #32783:
URL: https://github.com/apache/spark/pull/32783#issuecomment-856490236


   **[Test build #139456 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139456/testReport)**
 for PR 32783 at commit 
[`9cc2484`](https://github.com/apache/spark/commit/9cc2484600374022bb76a95039e22a8c232a4700).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


SparkQA commented on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856485101


   **[Test build #139455 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139455/testReport)**
 for PR 32805 at commit 
[`70b86fc`](https://github.com/apache/spark/commit/70b86fc2277ca2246e526b57db091853be166d98).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


SparkQA commented on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856485015


   **[Test build #139454 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139454/testReport)**
 for PR 32807 at commit 
[`8e53b88`](https://github.com/apache/spark/commit/8e53b88de26915038a5aa10dddece692eb33efad).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856483830


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139443/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856483830


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139443/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32795: [SPARK-35588][PYTHON][DOCS] Update quickstart.ipynb to use pyspark.pandas

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32795:
URL: https://github.com/apache/spark/pull/32795#issuecomment-856483016


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139453/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-06-07 Thread GitBox


SparkQA commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-856483304


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43975/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856483018


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43966/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


SparkQA removed a comment on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856427751


   **[Test build #139443 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139443/testReport)**
 for PR 32812 at commit 
[`7d30f36`](https://github.com/apache/spark/commit/7d30f368a9c0543b7dad698b13f481902a83534e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32513: [SPARK-35378][SQL] Eagerly execute commands in QueryExecution instead of caller sides

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32513:
URL: https://github.com/apache/spark/pull/32513#issuecomment-856483017


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139440/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856483018


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43966/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


SparkQA commented on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856482909


   **[Test build #139443 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139443/testReport)**
 for PR 32812 at commit 
[`7d30f36`](https://github.com/apache/spark/commit/7d30f368a9c0543b7dad698b13f481902a83534e).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32795: [SPARK-35588][PYTHON][DOCS] Update quickstart.ipynb to use pyspark.pandas

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32795:
URL: https://github.com/apache/spark/pull/32795#issuecomment-856483016


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139453/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32513: [SPARK-35378][SQL] Eagerly execute commands in QueryExecution instead of caller sides

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32513:
URL: https://github.com/apache/spark/pull/32513#issuecomment-856483017


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139440/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


SparkQA commented on pull request #32769:
URL: https://github.com/apache/spark/pull/32769#issuecomment-856482578


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43974/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] 137alpha commented on pull request #32813: [SPARK-34591][MLLIB] Disable decision tree pruning

2021-06-07 Thread GitBox


137alpha commented on pull request #32813:
URL: https://github.com/apache/spark/pull/32813#issuecomment-856481729


   @srowen  @asolimando  @sethah Tagging you as you were heavily contributors 
to the original pull request which this bugfix undoes 
(https://github.com/apache/spark/pull/20632). Deeply grateful for your input 
and attention here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-06-07 Thread GitBox


allisonwang-db commented on a change in pull request #32303:
URL: https://github.com/apache/spark/pull/32303#discussion_r647142502



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##
@@ -871,7 +871,13 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with 
SQLConfHelper with Logg
   override def visitFromClause(ctx: FromClauseContext): LogicalPlan = 
withOrigin(ctx) {
 val from = ctx.relation.asScala.foldLeft(null: LogicalPlan) { (left, 
relation) =>
   val right = plan(relation.relationPrimary)
-  val join = right.optionalMap(left)(Join(_, _, Inner, None, 
JoinHint.NONE))
+  val join = right.optionalMap(left) { (left, right) =>
+if (relation.LATERAL != null) {
+  LateralJoin(left, LateralSubquery(right), Inner, None)

Review comment:
   Actually, does it make sense to have join hints for lateral joins? A 
lateral join is essentially a nested loop join. Ideally, the evaluation logic 
should be for each row in the left, plug the outer query attribute values into 
the outer references and evaluate the subquery. So it should only be planned as 
a (correlated) nested loop join. But since Spark doesn't support such 
execution, it first decorrelates the subquery and then rewrites the lateral 
join as a normal join. This seems to be an implementation detail and it doesn't 
make sense to add join hints here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


sarutak commented on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856476375


   retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

2021-06-07 Thread GitBox


SparkQA commented on pull request #32786:
URL: https://github.com/apache/spark/pull/32786#issuecomment-856475690


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43973/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32814: [SPARK-35664][SQL] Support java.time.LocalDateTime as an external type of TimestampWithoutTZ type

2021-06-07 Thread GitBox


SparkQA commented on pull request #32814:
URL: https://github.com/apache/spark/pull/32814#issuecomment-856474722


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43970/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


SparkQA commented on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856474653


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43971/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32788: [SPARK-35602][SS] Update state schema to be able to accept long length JSON

2021-06-07 Thread GitBox


SparkQA commented on pull request #32788:
URL: https://github.com/apache/spark/pull/32788#issuecomment-856472535


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32513: [SPARK-35378][SQL] Eagerly execute commands in QueryExecution instead of caller sides

2021-06-07 Thread GitBox


SparkQA removed a comment on pull request #32513:
URL: https://github.com/apache/spark/pull/32513#issuecomment-856389959


   **[Test build #139440 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139440/testReport)**
 for PR 32513 at commit 
[`83d2710`](https://github.com/apache/spark/commit/83d27106a3e286547550c77f274ccdc5c4226391).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32795: [SPARK-35588][PYTHON][DOCS] Update quickstart.ipynb to use pyspark.pandas

2021-06-07 Thread GitBox


SparkQA removed a comment on pull request #32795:
URL: https://github.com/apache/spark/pull/32795#issuecomment-856455317


   **[Test build #139453 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139453/testReport)**
 for PR 32795 at commit 
[`0bea115`](https://github.com/apache/spark/commit/0bea115293fa128ea2a36889e4721438181a0465).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32795: [SPARK-35588][PYTHON][DOCS] Update quickstart.ipynb to use pyspark.pandas

2021-06-07 Thread GitBox


SparkQA commented on pull request #32795:
URL: https://github.com/apache/spark/pull/32795#issuecomment-856469089


   **[Test build #139453 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139453/testReport)**
 for PR 32795 at commit 
[`0bea115`](https://github.com/apache/spark/commit/0bea115293fa128ea2a36889e4721438181a0465).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32513: [SPARK-35378][SQL] Eagerly execute commands in QueryExecution instead of caller sides

2021-06-07 Thread GitBox


SparkQA commented on pull request #32513:
URL: https://github.com/apache/spark/pull/32513#issuecomment-856469129


   **[Test build #139440 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139440/testReport)**
 for PR 32513 at commit 
[`83d2710`](https://github.com/apache/spark/commit/83d27106a3e286547550c77f274ccdc5c4226391).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


maropu commented on a change in pull request #32769:
URL: https://github.com/apache/spark/pull/32769#discussion_r647131837



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala
##
@@ -42,9 +42,27 @@ case class ExpandExec(
   override lazy val metrics = Map(
 "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"))
 
-  // The GroupExpressions can output data with arbitrary partitioning, so set 
it
-  // as UNKNOWN partitioning
-  override def outputPartitioning: Partitioning = UnknownPartitioning(0)
+  /**
+   * The Expand is commonly introduced by the RewriteDistinctAggregates 
optimizer rule.
+   * In that case there can be several attributes that are kept as they are by 
the Expand.
+   * If the child's output is partitioned by those attributes, then so will be
+   * the output of the Expand.
+   * In general case the Expand can output data with arbitrary partitioning, 
so set it
+   * as UNKNOWN partitioning.
+   */
+  override def outputPartitioning: Partitioning = {
+val stableAttrs = ExpressionSet(output.zipWithIndex.filter {
+  case (attr, i) => projections.forall(_(i).semanticEquals(attr))
+}.map(_._1))
+
+child.outputPartitioning match {

Review comment:
   we cannot use the shuffled join path for tests?  
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledJoin.scala#L48-L49




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32659: [SPARK-22639][SQL] Support aggregate cbo stats estimation if the group by clause involves substring

2021-06-07 Thread GitBox


maropu commented on a change in pull request #32659:
URL: https://github.com/apache/spark/pull/32659#discussion_r647121080



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
##
@@ -80,6 +80,54 @@ object EstimationUtils {
 expressions.collect {
   case alias @ Alias(attr: Attribute, _) if attributeStats.contains(attr) 
=>
 alias.toAttribute -> attributeStats(attr)
+  case alias @ Alias(expn: Expression, _) if isExpressionStatsExist(expn, 
attributeStats) =>
+getExpressionStats(alias.toAttribute, expn, attributeStats)
+}
+  }
+
+  // Support for substring expressions.
+  // TODO: Support for more expressions like Multiply.
+  private def isExpressionStatsExist(
+  expn: Expression,

Review comment:
   `expn` -> `expr`

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
##
@@ -80,6 +80,54 @@ object EstimationUtils {
 expressions.collect {
   case alias @ Alias(attr: Attribute, _) if attributeStats.contains(attr) 
=>
 alias.toAttribute -> attributeStats(attr)
+  case alias @ Alias(expn: Expression, _) if isExpressionStatsExist(expn, 
attributeStats) =>
+getExpressionStats(alias.toAttribute, expn, attributeStats)
+}
+  }
+
+  // Support for substring expressions.
+  // TODO: Support for more expressions like Multiply.

Review comment:
   Why do we need to handle individual exprs here? For aggregate stat 
estimation, we cannot just use upper-bould stat values from a child plan in 
`AggregateEstimation`?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
##
@@ -80,6 +80,54 @@ object EstimationUtils {
 expressions.collect {
   case alias @ Alias(attr: Attribute, _) if attributeStats.contains(attr) 
=>
 alias.toAttribute -> attributeStats(attr)
+  case alias @ Alias(expn: Expression, _) if isExpressionStatsExist(expn, 
attributeStats) =>

Review comment:
   Why did you update this method instead of `AggregateEstimation`? 
`Project` uses this method though. Is this related to projections?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


SparkQA commented on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856463079


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43966/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856458336


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43967/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


SparkQA commented on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856458323


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43967/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856458336


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43967/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856456977


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


SparkQA commented on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856456954


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856456977


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32795: [SPARK-35588][PYTHON][DOCS] Update quickstart.ipynb to use pyspark.pandas

2021-06-07 Thread GitBox


SparkQA commented on pull request #32795:
URL: https://github.com/apache/spark/pull/32795#issuecomment-856455317


   **[Test build #139453 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139453/testReport)**
 for PR 32795 at commit 
[`0bea115`](https://github.com/apache/spark/commit/0bea115293fa128ea2a36889e4721438181a0465).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856454817


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139444/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856454817


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139444/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


SparkQA removed a comment on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856427834


   **[Test build #139444 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139444/testReport)**
 for PR 32805 at commit 
[`8fedbd5`](https://github.com/apache/spark/commit/8fedbd5ebe7807a6bd9b774da5d93d6e3924f67a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


SparkQA commented on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856454449


   **[Test build #139444 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139444/testReport)**
 for PR 32805 at commit 
[`8fedbd5`](https://github.com/apache/spark/commit/8fedbd5ebe7807a6bd9b774da5d93d6e3924f67a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856454137


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139448/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


SparkQA removed a comment on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856451502


   **[Test build #139448 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139448/testReport)**
 for PR 32807 at commit 
[`8e53b88`](https://github.com/apache/spark/commit/8e53b88de26915038a5aa10dddece692eb33efad).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


SparkQA commented on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856454109


   **[Test build #139448 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139448/testReport)**
 for PR 32807 at commit 
[`8e53b88`](https://github.com/apache/spark/commit/8e53b88de26915038a5aa10dddece692eb33efad).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856454137


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139448/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32795: [SPARK-35588][PYTHON][DOCS] Update quickstart.ipynb to use pyspark.pandas

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32795:
URL: https://github.com/apache/spark/pull/32795#issuecomment-856433282


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43965/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #32791: [SPARK-34290][SQL][FOLLOWUP] Cleanup truncate table not supported for V2Table error

2021-06-07 Thread GitBox


yaooqinn commented on pull request #32791:
URL: https://github.com/apache/spark/pull/32791#issuecomment-856452387


   thanks, merged to master 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #32791: [SPARK-34290][SQL][FOLLOWUP] Cleanup truncate table not supported for V2Table error

2021-06-07 Thread GitBox


yaooqinn closed pull request #32791:
URL: https://github.com/apache/spark/pull/32791


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-06-07 Thread GitBox


SparkQA commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-856451925


   **[Test build #139452 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139452/testReport)**
 for PR 32303 at commit 
[`d646720`](https://github.com/apache/spark/commit/d646720edf977ea50ac8f273eea75b045915038e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

2021-06-07 Thread GitBox


SparkQA commented on pull request #32786:
URL: https://github.com/apache/spark/pull/32786#issuecomment-856451650


   **[Test build #139450 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139450/testReport)**
 for PR 32786 at commit 
[`ee147ec`](https://github.com/apache/spark/commit/ee147ec2657a1229ea235d71b7d508cae5a83a08).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


SparkQA commented on pull request #32769:
URL: https://github.com/apache/spark/pull/32769#issuecomment-856451723


   **[Test build #139451 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139451/testReport)**
 for PR 32769 at commit 
[`b3475f4`](https://github.com/apache/spark/commit/b3475f4f42cbe5a513fa6496e2d5c5d8b156b350).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32788: [SPARK-35602][SS] Update state schema to be able to accept long length JSON

2021-06-07 Thread GitBox


SparkQA commented on pull request #32788:
URL: https://github.com/apache/spark/pull/32788#issuecomment-856451620


   **[Test build #139449 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139449/testReport)**
 for PR 32788 at commit 
[`d660573`](https://github.com/apache/spark/commit/d66057348750553e438cc48faa962f545bfe2ca9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32814: [SPARK-35664][SQL] Support java.time.LocalDateTime as an external type of TimestampWithoutTZ type

2021-06-07 Thread GitBox


SparkQA commented on pull request #32814:
URL: https://github.com/apache/spark/pull/32814#issuecomment-856451528


   **[Test build #139447 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139447/testReport)**
 for PR 32814 at commit 
[`1101f55`](https://github.com/apache/spark/commit/1101f5550f7dd032fec39a211cb7e4bc345e565b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


SparkQA commented on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856451502


   **[Test build #139448 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139448/testReport)**
 for PR 32807 at commit 
[`8e53b88`](https://github.com/apache/spark/commit/8e53b88de26915038a5aa10dddece692eb33efad).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32785: [SPARK-35601][PYTHON] Support arithmetic operations against bool literals

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32785:
URL: https://github.com/apache/spark/pull/32785#issuecomment-856450035


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43963/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856450037


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139445/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

2021-06-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32786:
URL: https://github.com/apache/spark/pull/32786#issuecomment-856450034


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43969/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856450037


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139445/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32786:
URL: https://github.com/apache/spark/pull/32786#issuecomment-856450034


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43969/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32785: [SPARK-35601][PYTHON] Support arithmetic operations against bool literals

2021-06-07 Thread GitBox


AmplabJenkins commented on pull request #32785:
URL: https://github.com/apache/spark/pull/32785#issuecomment-856450035


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43963/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32787: [SPARK-35618][SQL] Resolve star expressions in subqueries using outer query plans

2021-06-07 Thread GitBox


maropu commented on a change in pull request #32787:
URL: https://github.com/apache/spark/pull/32787#discussion_r647115051



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
##
@@ -791,4 +791,28 @@ class AnalysisErrorSuite extends AnalysisTest {
   assertAnalysisError(plan, s"Correlated column is not allowed in 
predicate ($msg)" :: Nil)
 }
   }
+
+  test("SPARK-35618: Resolve star expressions in subquery") {

Review comment:
   I read the PR description and I thought this PR is to accept new query 
patterns, but this PR only has the negative test cases?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32812: [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout

2021-06-07 Thread GitBox


SparkQA commented on pull request #32812:
URL: https://github.com/apache/spark/pull/32812#issuecomment-856445369


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43966/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] q2w commented on a change in pull request #32766: [SPARK-35627][CORE] Decommission executors in batches to not overload network bandwidth

2021-06-07 Thread GitBox


q2w commented on a change in pull request #32766:
URL: https://github.com/apache/spark/pull/32766#discussion_r647114128



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##
@@ -519,10 +558,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 
scheduler.sc.env.blockManager.master.decommissionBlockManagers(executorsToDecommission)
 
 if (!triggeredByExecutor) {
-  executorsToDecommission.foreach { executorId =>
-logInfo(s"Notify executor $executorId to decommissioning.")
-executorDataMap(executorId).executorEndpoint.send(DecommissionExecutor)
-  }

Review comment:
   No, i haven't seen this in a public cloud. We have some experience with 
this issue in private cloud which had bigger timeout for forceful node removal 
and this was the motive of this PR to give some control of decommissioning 
process to user.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-06-07 Thread GitBox


allisonwang-db commented on a change in pull request #32303:
URL: https://github.com/apache/spark/pull/32303#discussion_r647113231



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##
@@ -871,7 +871,13 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with 
SQLConfHelper with Logg
   override def visitFromClause(ctx: FromClauseContext): LogicalPlan = 
withOrigin(ctx) {
 val from = ctx.relation.asScala.foldLeft(null: LogicalPlan) { (left, 
relation) =>
   val right = plan(relation.relationPrimary)
-  val join = right.optionalMap(left)(Join(_, _, Inner, None, 
JoinHint.NONE))
+  val join = right.optionalMap(left) { (left, right) =>
+if (relation.LATERAL != null) {
+  LateralJoin(left, LateralSubquery(right), Inner, None)

Review comment:
   Good point. I will add it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32805: [SPARK-35666][ML] gemv skip array shape checking

2021-06-07 Thread GitBox


SparkQA commented on pull request #32805:
URL: https://github.com/apache/spark/pull/32805#issuecomment-856442577


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43967/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-06-07 Thread GitBox


allisonwang-db commented on a change in pull request #32303:
URL: https://github.com/apache/spark/pull/32303#discussion_r647112288



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
##
@@ -315,19 +314,28 @@ object PullupCorrelatedPredicates extends 
Rule[LogicalPlan] with PredicateHelper
   case ListQuery(sub, children, exprId, childOutputs, conditions) if 
children.nonEmpty =>
 val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans)
 ListQuery(newPlan, children, exprId, childOutputs, 
getJoinCondition(newCond, conditions))
+  case LateralSubquery(sub, children, exprId, conditions) if 
children.nonEmpty =>
+val (newPlan, newCond) = decorrelate(sub, outerPlans)
+LateralSubquery(newPlan, children, exprId, getJoinCondition(newCond, 
conditions))
 }
   }
 
   /**
* Pull up the correlated predicates and rewrite all subqueries in an 
operator tree..
*/
   def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithPruning(
-_.containsAnyPattern(SCALAR_SUBQUERY, EXISTS_SUBQUERY, LIST_SUBQUERY)) {
+_.containsPattern(PLAN_EXPRESSION)) {
 case f @ Filter(_, a: Aggregate) =>
   rewriteSubQueries(f, Seq(a, a.child))
-// Only a few unary nodes (Project/Filter/Aggregate) can contain 
subqueries.
+// Only a few unary nodes (Project/Filter/Aggregate/LateralJoin) can 
contain subqueries.
 case q: UnaryNode =>
-  rewriteSubQueries(q, q.children)
+  val newPlan = rewriteSubQueries(q, q.children)
+  // Preserve the original output of the node.
+  if (newPlan.output != q.output) {

Review comment:
   Nice!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


SparkQA commented on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856441509


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] 137alpha commented on pull request #32813: [SPARK-34591][MLLIB] Disable decision tree pruning

2021-06-07 Thread GitBox


137alpha commented on pull request #32813:
URL: https://github.com/apache/spark/pull/32813#issuecomment-856441296


   Hello, I am the author of the Jira ticket 
https://issues.apache.org/jira/browse/SPARK-34591. 
   
   In my view, the behaviour described in the ticket is a serious problem - it 
makes the DecisionTreeClassifier and the RandomForestClassifier seriously 
unreliable for probability estimation problems for Spark 2.4.0 and all later 
versions.
   
   Additionally, the original implementation of the feature did not update the 
Spark ML documentation to describe this non-standard modification to the tree 
algorithm. The only way I could trace the behaviour (given that it was in 
conflict with the Spark documentation) was to examine every Jira ticket 
referenced in the release notes after Spark 2.3.0 (where I knew this problem 
did not exist) to identify ones that might be responsible.
   
   In my own experience, I have three clients which have been directly affected 
by this issue.
   
   The Jira ticket gives a minimal example with "maximally worst" behaviour - a 
tree that is pruned (outside the user's control) so that there are no splits at 
all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


cloud-fan commented on a change in pull request #32807:
URL: https://github.com/apache/spark/pull/32807#discussion_r647111792



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
##
@@ -699,20 +699,25 @@ abstract class PushableColumnBase {
 
   def unapply(e: Expression): Option[String] = {
 import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
-def helper(e: Expression): Option[Seq[String]] = e match {
-  case a: Attribute =>
-// Attribute that contains dot "." in name is supported only when
-// nested predicate pushdown is enabled.
-if (nestedPredicatePushdownEnabled || !a.name.contains(".")) {
-  Some(Seq(a.name))
-} else {
-  None
-}
-  case s: GetStructField if nestedPredicatePushdownEnabled =>
-helper(s.child).map(_ :+ s.childSchema(s.ordinal).name)
-  case _ => None
+if (nestedPredicatePushdownEnabled) {

Review comment:
   cc @dbtsai @viirya 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


cloud-fan commented on a change in pull request #32807:
URL: https://github.com/apache/spark/pull/32807#discussion_r647111703



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
##
@@ -699,20 +699,25 @@ abstract class PushableColumnBase {
 
   def unapply(e: Expression): Option[String] = {
 import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
-def helper(e: Expression): Option[Seq[String]] = e match {
-  case a: Attribute =>
-// Attribute that contains dot "." in name is supported only when
-// nested predicate pushdown is enabled.
-if (nestedPredicatePushdownEnabled || !a.name.contains(".")) {
-  Some(Seq(a.name))
-} else {
-  None
-}
-  case s: GetStructField if nestedPredicatePushdownEnabled =>
-helper(s.child).map(_ :+ s.childSchema(s.ordinal).name)
-  case _ => None
+if (nestedPredicatePushdownEnabled) {

Review comment:
   note that:
   1. nestedPredicatePushdownEnabled is always enabled for DS v2 (by default)
   2. nestedPredicatePushdownEnabled is never enabled for DS v1
   3. nestedPredicatePushdownEnabled is only enabled for file source parquet 
and orc (by default)
   
   After changing the quoting logic:
   1. DS v1 is not affected
   2. file source is builtin so we are fine
   3. DS v2 will be affected if the column name contains special chars.
   
   Personally, I think the new quoting behavior is better (more ANSI SQL), and 
most v2 implementations won't be affected as they already need to deal with 
quoted names.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tanelk commented on a change in pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


tanelk commented on a change in pull request #32769:
URL: https://github.com/apache/spark/pull/32769#discussion_r647111719



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala
##
@@ -42,9 +42,27 @@ case class ExpandExec(
   override lazy val metrics = Map(
 "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"))
 
-  // The GroupExpressions can output data with arbitrary partitioning, so set 
it
-  // as UNKNOWN partitioning
-  override def outputPartitioning: Partitioning = UnknownPartitioning(0)
+  /**
+   * The Expand is commonly introduced by the RewriteDistinctAggregates 
optimizer rule.
+   * In that case there can be several attributes that are kept as they are by 
the Expand.
+   * If the child's output is partitioned by those attributes, then so will be
+   * the output of the Expand.
+   * In general case the Expand can output data with arbitrary partitioning, 
so set it
+   * as UNKNOWN partitioning.
+   */
+  override def outputPartitioning: Partitioning = {
+val stableAttrs = ExpressionSet(output.zipWithIndex.filter {
+  case (attr, i) => projections.forall(_(i).semanticEquals(attr))
+}.map(_._1))
+
+child.outputPartitioning match {

Review comment:
   Added that case, but I was not able to construct a test case for it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


maropu commented on pull request #32783:
URL: https://github.com/apache/spark/pull/32783#issuecomment-856440404


   cc: @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


maropu commented on pull request #32783:
URL: https://github.com/apache/spark/pull/32783#issuecomment-856440290


   Nice catch, @eejbyfeldt  and thank you for your contribution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on pull request #32787: [SPARK-35618][SQL] Resolve star expressions in subqueries using outer query plans

2021-06-07 Thread GitBox


allisonwang-db commented on pull request #32787:
URL: https://github.com/apache/spark/pull/32787#issuecomment-856440257


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32783: [SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-07 Thread GitBox


maropu commented on a change in pull request #32783:
URL: https://github.com/apache/spark/pull/32783#discussion_r647110803



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -1181,21 +1176,25 @@ case class CatalystToExternalMap private(
 newMapBuilderMethod.invoke(moduleField).asInstanceOf[Builder[AnyRef, 
AnyRef]]
   }
 
+  private def keyValueIterator(md: MapData): Iterator[AnyRef] = {
+val keyArray = md.keyArray()
+val valueArray = md.valueArray()
+val row = new GenericInternalRow(1)
+0.until(md.numElements()).iterator.map { i =>
+  row.update(0, keyArray.get(i, inputMapType.keyType))
+  val key = keyLambdaFunction.eval(row)
+  row.update(0, valueArray.get(i, inputMapType.valueType))
+  val value = valueLambdaFunction.eval(row)
+  Tuple2(key, value)
+}
+  }
+
   override def eval(input: InternalRow): Any = {
 val result = inputData.eval(input).asInstanceOf[MapData]
 if (result != null) {
   val builder = newMapBuilder()
   builder.sizeHint(result.numElements())
-  val keyArray = result.keyArray()
-  val valueArray = result.valueArray()
-  var i = 0
-  while (i < result.numElements()) {

Review comment:
   We tend to use `while` for perf-intensive code. The proposed code does 
not cause any perf overhead? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request #32814: [SPARK-35664][SQL] Support java.time. LocalDateTime as an external type of TimestampWithoutTZ type

2021-06-07 Thread GitBox


gengliangwang opened a new pull request #32814:
URL: https://github.com/apache/spark/pull/32814


   
   
   ### What changes were proposed in this pull request?
   
   In the PR, I propose to extend Spark SQL API to accept 
java.time.LocalDateTime as an external type of recently added new Catalyst type 
- TimestampWithoutTZ. The Java class java.time.LocalDateTime has similar 
semantic to ANSI SQL timestamp without timezone type, and it is the most 
suitable to be an external type for TimestampWithoutTZType. In more details:
   
   * Added TimestampWithoutTZConverter which converts java.time.LocalDateTime 
instances to/from internal representation of the Catalyst type 
TimestampWithoutTZType (to Long type). The TimestampWithoutTZConverter object 
uses new methods of DateTimeUtils:
 * localDateTimeToMicros() converts the input date time to the total length 
in microseconds. 
 * microsToLocalDateTime() obtains a java.time.LocalDateTime 
   * Support new type TimestampWithoutTZType in RowEncoder via the methods 
createDeserializerForLocalDateTime() and createSerializerForLocalDateTime().
   * Extended the Literal API to construct literals from 
java.time.LocalDateTime instances.
   
   ### Why are the changes needed?
   
   To allow users parallelization of java.time.LocalDateTime collections, and 
construct timestamp without time zone columns. Also to collect such columns 
back to the driver side.
   
   ### Does this PR introduce _any_ user-facing change?
   
   The PR extends existing functionality. So, users can parallelize instances 
of the java.time.LocalDateTime class and collect them back.
   
   ### How was this patch tested?
   
   New unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

2021-06-07 Thread GitBox


SparkQA commented on pull request #32786:
URL: https://github.com/apache/spark/pull/32786#issuecomment-856439833


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43969/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


SparkQA removed a comment on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856427821


   **[Test build #139445 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139445/testReport)**
 for PR 32804 at commit 
[`ec9e127`](https://github.com/apache/spark/commit/ec9e127caa335ae1512714b61f2a1a8e5e67392a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32804: [SPARK-26867][YARN] Spark Support of YARN Placement Constraint

2021-06-07 Thread GitBox


SparkQA commented on pull request #32804:
URL: https://github.com/apache/spark/pull/32804#issuecomment-856436747


   **[Test build #139445 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139445/testReport)**
 for PR 32804 at commit 
[`ec9e127`](https://github.com/apache/spark/commit/ec9e127caa335ae1512714b61f2a1a8e5e67392a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tanelk commented on a change in pull request #32769: [SPARK-35630][SQL] ExpandExec should not introduce unnecessary exchanges

2021-06-07 Thread GitBox


tanelk commented on a change in pull request #32769:
URL: https://github.com/apache/spark/pull/32769#discussion_r647107606



##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##
@@ -4003,6 +4003,56 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 }
 checkAnswer(sql(s"select /*+ REPARTITION(3, a) */ a b from values('123') 
t(a)"), Row("123"))
   }
+
+  test("SPARK-35630: ExpandExec should not introduce unnecessary exchanges") {
+withTable("test_table") {
+  spark.range(11)
+.withColumn("group1", $"id" % 2)
+.withColumn("group2", $"id" % 4)
+.withColumn("a", $"id" % 3)
+.withColumn("b", $"id" % 6)
+.write.saveAsTable("test_table")

Review comment:
   I simplified it even a bit further




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #32786: [SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions.

2021-06-07 Thread GitBox


sarutak commented on pull request #32786:
URL: https://github.com/apache/spark/pull/32786#issuecomment-856436191


   retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak edited a comment on pull request #32807: [SPARK-35669][SQL] Fix special char in CSV header with filter pushdown

2021-06-07 Thread GitBox


sarutak edited a comment on pull request #32807:
URL: https://github.com/apache/spark/pull/32807#issuecomment-856411972


   #31964 is mostly for the display format of queries so we can revert it 
safely and I don't mind. But even if we revert it, the potential problem is 
present isn't it?
   This problem happens with ```col("`a``b`")``` even before #31964.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >