[GitHub] [spark] AmplabJenkins removed a comment on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-647291121







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-647291121







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-06-21 Thread GitBox


SparkQA commented on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-647290651


   **[Test build #124352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124352/testReport)**
 for PR 28840 at commit 
[`5d5fe71`](https://github.com/apache/spark/commit/5d5fe71f61d596062545a7e9cc0efd023840307f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-21 Thread GitBox


agrawaldevesh commented on a change in pull request #28708:
URL: https://github.com/apache/spark/pull/28708#discussion_r443318618



##
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##
@@ -157,7 +158,8 @@ class BlockManagerMasterEndpoint(
   context.reply(true)
 
 case DecommissionBlockManagers(executorIds) =>

Review comment:
   I wanted to confirm my understanding that this PR does not do anything 
about shuffle files written by executors that have since finished ? That would 
be the External Shuffle Service's responsibility (if enabled), Right ? And that 
piece of the puzzle isn't implemented yet ?
   
   My question stems from how we handle this scenario: (Consider the standalone 
scheduler mode here)
   - Executor wrote some shuffle data and is enabled with external shuffle 
service 
   - Executor finished normally and nothing is running on the worker. 
   - DecommissionExecutor is not even called because the executor does not exist
   - Thus the DecommissionBlockManager isn't called and thus no migrations take 
place.
   
   Therefore fetch failures would happen for other tasks that try to read this 
executor's shuffle data (via the External Shuffle Service). I believe fixing 
this is not in the scope of this PR.
   
   Can you please check this understanding. Thanks.
   

##
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##
@@ -489,6 +491,24 @@ class BlockManagerMasterEndpoint(
   storageLevel: StorageLevel,
   memSize: Long,
   diskSize: Long): Boolean = {
+logInfo(s"Updating block info on master ${blockId} for ${blockManagerId}")
+
+if (blockId.isInternalShuffle) {
+  blockId match {
+case ShuffleIndexBlockId(shuffleId, mapId, _) =>
+  // Don't update the map output on just the index block
+  logDebug(s"Received shuffle index block update for ${shuffleId} 
${mapId}, ignoring.")
+  return true
+case ShuffleDataBlockId(shuffleId: Int, mapId: Long, reduceId: Int) =>
+  logInfo(s"Received shuffle data block update for ${shuffleId} 
${mapId}, updating.")

Review comment:
   Should this be a logDebug ?

##
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala
##
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import java.util.concurrent.ExecutorService
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+import scala.util.control.NonFatal
+
+import org.apache.spark._
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config
+import org.apache.spark.shuffle.{MigratableResolver, ShuffleBlockInfo}
+import org.apache.spark.storage.BlockManagerMessages.ReplicateBlock
+import org.apache.spark.util.ThreadUtils
+
+/**
+ * Class to handle block manager decommissioning retries.
+ * It creates a Thread to retry offloading all RDD cache and Shuffle blocks
+ */
+private[storage] class BlockManagerDecommissioner(
+  conf: SparkConf,
+  bm: BlockManager) extends Logging {
+
+  private val maxReplicationFailuresForDecommission =
+conf.get(config.STORAGE_DECOMMISSION_MAX_REPLICATION_FAILURE_PER_BLOCK)
+
+  /**
+   * This runnable consumes any shuffle blocks in the queue for migration. 
This part of a
+   * producer/consumer where the main migration loop updates the queue of 
blocks to be migrated
+   * periodically. On migration failure, the current thread will reinsert the 
block for another
+   * thread to consume. Each thread migrates blocks to a different particular 
executor to avoid
+   * distribute the blocks as quickly as possible without overwhelming any 
particular executor.
+   *
+   * There is no preference for which peer a given block is migrated to.
+   * This is notable different than the RDD cache block migration (further 
down in this file)
+   * which uses the existing priority mechanism for determining where to 
replicate blocks to.
+   * Generally speaking cache blocks are less impactful as they normally 
represent narrow
+   * transformations and we normally have less cache 

[GitHub] [spark] cloud-fan commented on pull request #28780: [SPARK-31952][SQL]Fix incorrect memory spill metric when doing Aggregate

2020-06-21 Thread GitBox


cloud-fan commented on pull request #28780:
URL: https://github.com/apache/spark/pull/28780#issuecomment-647288933


   > number of in-memory bytes spilled
   
   what's the difference between it and `spill (disk)`? IIUC `spill` means we 
dump data from memory to disk.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


cloud-fan commented on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647286415


   repartition by key is another story and should support partition coalesce. 
If it's not the case now, please create a bug report in the JIRA, thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyE

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #26339:
URL: https://github.com/apache/spark/pull/26339#issuecomment-647285293







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsExc

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #26339:
URL: https://github.com/apache/spark/pull/26339#issuecomment-647285293







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


SparkQA commented on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647285180


   **[Test build #124351 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124351/testReport)**
 for PR 28889 at commit 
[`62b3e79`](https://github.com/apache/spark/commit/62b3e79a49446ae8836567fbbdce4441a6e5c86a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] prakharjain09 commented on a change in pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors

2020-06-21 Thread GitBox


prakharjain09 commented on a change in pull request #28619:
URL: https://github.com/apache/spark/pull/28619#discussion_r443322288



##
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##
@@ -1842,6 +1842,17 @@ package object config {
   .timeConf(TimeUnit.MILLISECONDS)
   .createOptional
 
+  private[spark] val EXECUTOR_DECOMMISSION_KILL_INTERVAL =
+ConfigBuilder("spark.executor.decommission.killInterval")
+  .doc("Duration after which a decommissioned executor will be killed 
forcefully." +
+"This config is useful for cloud environments where we know in advance 
when " +
+"an executor is going to go down after decommissioning signal i.e. 
around 2 mins " +
+"in aws spot nodes, 1/2 hrs in spot block nodes etc. This config is 
currently " +

Review comment:
   @cloud-fan As per my understanding, Worker Decommissioning is getting 
triggered currently using SIGPWR signal (and not via some message coming from 
YARN/Kubernetes Cluster manager). So getting this timeout from Spark Cluster 
Manager might not be possible. We might be able to do this once Spark's Worker 
Decommissioning logic starts triggering via communication from YARN etc in 
future. cc  @holdenk 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsE

2020-06-21 Thread GitBox


SparkQA removed a comment on pull request #26339:
URL: https://github.com/apache/spark/pull/26339#issuecomment-647229068


   **[Test build #124343 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124343/testReport)**
 for PR 26339 at commit 
[`717d9a5`](https://github.com/apache/spark/commit/717d9a56faf128e4864844cdf6341cb7b8731307).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsException

2020-06-21 Thread GitBox


SparkQA commented on pull request #26339:
URL: https://github.com/apache/spark/pull/26339#issuecomment-647284372


   **[Test build #124343 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124343/testReport)**
 for PR 26339 at commit 
[`717d9a5`](https://github.com/apache/spark/commit/717d9a56faf128e4864844cdf6341cb7b8731307).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647282993







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647282993







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28890: [MINOR][SQL] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predic

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28890:
URL: https://github.com/apache/spark/pull/28890#issuecomment-647282918







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647268689


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28890: [MINOR][SQL] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28890:
URL: https://github.com/apache/spark/pull/28890#issuecomment-647282918







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647282568


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28890: [MINOR][SQL] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate

2020-06-21 Thread GitBox


SparkQA commented on pull request #28890:
URL: https://github.com/apache/spark/pull/28890#issuecomment-647282523


   **[Test build #124350 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124350/testReport)**
 for PR 28890 at commit 
[`1f915e4`](https://github.com/apache/spark/commit/1f915e41e8b02e5a714c6341c925da37a10ec8a1).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on pull request #28890: [SQL][MINOR] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate

2020-06-21 Thread GitBox


dilipbiswal commented on pull request #28890:
URL: https://github.com/apache/spark/pull/28890#issuecomment-647281805


   cc @HyukjinKwon 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-21 Thread GitBox


viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-647281932


   @cloud-fan Thanks for clarifying. The idea sounds worth exploring as we can 
avoid local shuffle under current parallelism design of independent stages in 
AQE. I will explore the possibility.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal opened a new pull request #28890: [SQL][MINOR] Update expression description of ArrayFilter and ArrayExists to include examples of IS NULL and IS NOT null predicate

2020-06-21 Thread GitBox


dilipbiswal opened a new pull request #28890:
URL: https://github.com/apache/spark/pull/28890


   ### What changes were proposed in this pull request?
   A minor PR that adds a couple of usage examples for ArrayFilter and 
ArrayExists that shows how to deal with NULL data.
   
   ### Why are the changes needed?
   Enhances the examples that shows how to filter out null values from an array 
and also to test if null value exists in an array.
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   Tested in command shell.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource

2020-06-21 Thread GitBox


dbtsai commented on a change in pull request #27366:
URL: https://github.com/apache/spark/pull/27366#discussion_r443316503



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonFiltersSuite.scala
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.json
+
+import org.apache.spark.sql.catalyst.{StructFilters, StructFiltersSuite}
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.types.StructType
+
+class JsonFiltersSuite extends StructFiltersSuite {
+  def createFilters(filters: Seq[sources.Filter], schema: StructType): 
StructFilters = {
+new JsonFilters(filters, schema)
+  }

Review comment:
   Do you think it's possible to implement nested situations? Although JSON 
is not columnar format, but if we are able to push down nested predicate, we 
will be able to avoid de-serialize unnecessary data.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource

2020-06-21 Thread GitBox


dbtsai commented on a change in pull request #27366:
URL: https://github.com/apache/spark/pull/27366#discussion_r443316503



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonFiltersSuite.scala
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.json
+
+import org.apache.spark.sql.catalyst.{StructFilters, StructFiltersSuite}
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.types.StructType
+
+class JsonFiltersSuite extends StructFiltersSuite {
+  def createFilters(filters: Seq[sources.Filter], schema: StructType): 
StructFilters = {
+new JsonFilters(filters, schema)
+  }

Review comment:
   Do you think it's possible to extend to nested cases? Although JSON is 
not columnar format, but if we are able to push down nested predicate, we will 
be able to avoid de-serialize unnecessary data.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


koertkuipers edited a comment on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612


   @cloud-fan so how can i repartition by a column while the number of 
partitions is set smartly (based on data size) instead of using some user 
specified number of partitions or hardcoded value?
   
   repartitioning a dataframe by columns is fairly typical before writing to a 
partitioned file sink to avoid too many files per directory. see for example:
   
https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L472
   in these situations its beneficial to write out the optimal number of files, 
not a fixed/hardcoded number...
   
   and personally for repartition i would expect the optimal number of files to 
be written if AQE is enabled and i did not specify the number of partitions. 
thats why i was so confused by the current results. but thats just my opinion.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-21 Thread GitBox


cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-647270331


   And as @maryannxue said, you may trigger the large side first and it doesn't 
make sense to hold off. Ideally we should trigger both sides and cancel the 
large side if the small side completes very quickly. It will be great if you 
can explore the cancelation approach.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-21 Thread GitBox


cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-647269450


   > If first stage uses all resources, I think later stage still needs to held 
off?
   
   That's true, but that's an assumption. It's also possible that these 2 jobs 
indeed run together.
   
   > the speed-up of AQE is gained by triggering all stages (not holding off 
other stage as you said) together, or optimizing join from SMJ to BHJ (if we 
only consider join case)
   
   In the benchmark, the default parallelism takes all the CPU cores. I think 
the most perf gain should be from shuffle partition coalescing and SMJ -> BHJ. 
cc @JkSelf 
   
   That said, by design AQE triggers all independent stages at the same time, 
to maximize the parallelism. And it's helpful if the resource is sufficient (or 
auto-scaling). I don't think we should change this design.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647268689


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647268426


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28889: [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28889:
URL: https://github.com/apache/spark/pull/28889#issuecomment-647268426


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


koertkuipers edited a comment on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612


   @cloud-fan so how can i repartition by a column while the number of 
partitions is set smartly (based on data size) instead of using some user 
specified number of partitions or hardcoded value?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] koertkuipers edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


koertkuipers edited a comment on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612


   @cloud-fan so how can i repartition by a column while the number of 
partitions is set smartly instead of using some hardcoded value?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] williamhyun opened a new pull request #28889: [SPARK-XXX][BUILD] Upgrade to Apache Commons Lang 3.10

2020-06-21 Thread GitBox


williamhyun opened a new pull request #28889:
URL: https://github.com/apache/spark/pull/28889


   ### What changes were proposed in this pull request?
   
   This PR aims to upgrade to Apache Commons Lang 3.10.
   
   ### Why are the changes needed?
   
   This will bring the latest bug fixes. 
   
https://commons.apache.org/proper/commons-lang/release-notes/RELEASE-NOTES-3.10.txt
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass the Jenkins. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] koertkuipers commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


koertkuipers commented on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647266612


   @cloud-fan so how can i repartition by a column without having to specify 
the number of partitions? 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan edited a comment on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


cloud-fan edited a comment on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647264492


   note: `repartition` is a physical operation that users specify how many 
partitions to have after the shuffle. That's why Spark can't coalesce the 
partition, as it breaks the user's expectation.
   
   Other optimization can still happen, like SMJ to BHJ.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


cloud-fan commented on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647264492


   note: `repartition` is a physical operation that users specify how many 
partitions to have after the shuffle. That's why Spark can't coalesce the 
partition.
   
   Other optimization can still happen, like SMJ to BHJ.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647264059







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


SparkQA commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647263966


   **[Test build #124349 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124349/testReport)**
 for PR 28880 at commit 
[`10b9f7e`](https://github.com/apache/spark/commit/10b9f7e3bbe3685ccd27e2d71ac50e40b17588f9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647264059







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


SparkQA removed a comment on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647258478


   **[Test build #124349 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124349/testReport)**
 for PR 28880 at commit 
[`10b9f7e`](https://github.com/apache/spark/commit/10b9f7e3bbe3685ccd27e2d71ac50e40b17588f9).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647258791







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647258791







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


SparkQA commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647258478


   **[Test build #124349 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124349/testReport)**
 for PR 28880 at commit 
[`10b9f7e`](https://github.com/apache/spark/commit/10b9f7e3bbe3685ccd27e2d71ac50e40b17588f9).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647257913







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


SparkQA commented on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647257837


   **[Test build #124346 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124346/testReport)**
 for PR 2 at commit 
[`23f3dda`](https://github.com/apache/spark/commit/23f3dda8b67c18716042c670ea5caa88eac68af1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


SparkQA removed a comment on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647250427


   **[Test build #124346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124346/testReport)**
 for PR 2 at commit 
[`23f3dda`](https://github.com/apache/spark/commit/23f3dda8b67c18716042c670ea5caa88eac68af1).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647257913







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28869:
URL: https://github.com/apache/spark/pull/28869#issuecomment-647254915







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28869:
URL: https://github.com/apache/spark/pull/28869#issuecomment-647254915







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


SparkQA commented on pull request #28869:
URL: https://github.com/apache/spark/pull/28869#issuecomment-647254624


   **[Test build #124348 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124348/testReport)**
 for PR 28869 at commit 
[`04a2d44`](https://github.com/apache/spark/commit/04a2d4485021b8be1aa66936e2ca071bd52990a6).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


Ngone51 commented on a change in pull request #28869:
URL: https://github.com/apache/spark/pull/28869#discussion_r443302172



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/BaseAggregateExec.scala
##
@@ -53,15 +53,31 @@ trait BaseAggregateExec extends UnaryExecNode {
   // can't bind the `mergeExpressions` with the output of the partial 
aggregate, as they use
   // the `inputAggBufferAttributes` of the original `DeclarativeAggregate` 
before copy. Instead,
   // we shall use `inputAggBufferAttributes` after copy to match the new 
`mergeExpressions`.
-  val aggAttrs = aggregateExpressions
-// there're exactly four cases needs `inputAggBufferAttributes` from 
child according to the
-// agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final,
-// Partial -> PartialMerge, PartialMerge -> PartialMerge.
-.filter(a => a.mode == Final || a.mode == 
PartialMerge).map(_.aggregateFunction)
-.flatMap(_.inputAggBufferAttributes)
+  val aggAttrs = inputAggBufferAttributes
   child.output.dropRight(aggAttrs.length) ++ aggAttrs
 } else {
   child.output
 }
   }
+
+  private val inputAggBufferAttributes: Seq[Attribute] = {
+aggregateExpressions
+  // there're exactly four cases needs `inputAggBufferAttributes` from 
child according to the
+  // agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final,
+  // Partial -> PartialMerge, PartialMerge -> PartialMerge.
+  .filter(a => a.mode == Final || a.mode == PartialMerge)
+  .flatMap(_.aggregateFunction.inputAggBufferAttributes)
+  }
+
+  protected val aggregateBufferAttributes: Seq[AttributeReference] = {
+aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes)
+  }
+
+  override def producedAttributes: AttributeSet =
+AttributeSet(aggregateAttributes) ++
+
AttributeSet(resultExpressions.diff(groupingExpressions).map(_.toAttribute)) ++
+AttributeSet(aggregateBufferAttributes) ++
+// it's not empty when the inputAggBufferAttributes is from the child 
Aggregate, which contains
+// subquery in AggregateFunction. See SPARK-31620 for more details.
+AttributeSet(inputAggBufferAttributes.filterNot(child.output.contains))

Review comment:
   Oh..the comment should actually be:
   
   `it's not empty when the inputAggBufferAttributes is not equal to the 
aggerate buffer attributes of child Aggregate, when the child  Aggregate 
contains the subquery in AggregateFunction.`
   
   
   
   After SPARK-31620, the inputAggBufferAttributes is not from children but the 
node itself when there're Final/PartialMerge aggregate expressions, see 
`inputAttributes()` above. (But please note that agg attributes are still the 
same between the parent agg node and child agg node when there's no subquery in 
agg expression)
   
   
   Therefore, in the case of SPARK-31620, we actually use the attributes 
produced by the node itself. But for other cases, we still use the agg  buffer 
attributes from the children, since `inputAggBufferAttributes` is equal to the 
agg buffer attributes from the children and so 
`inputAggBufferAttributes.filterNot(child.output.contains)` is empty.
   
   
   
   
   
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


Ngone51 commented on a change in pull request #28869:
URL: https://github.com/apache/spark/pull/28869#discussion_r443302172



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/BaseAggregateExec.scala
##
@@ -53,15 +53,31 @@ trait BaseAggregateExec extends UnaryExecNode {
   // can't bind the `mergeExpressions` with the output of the partial 
aggregate, as they use
   // the `inputAggBufferAttributes` of the original `DeclarativeAggregate` 
before copy. Instead,
   // we shall use `inputAggBufferAttributes` after copy to match the new 
`mergeExpressions`.
-  val aggAttrs = aggregateExpressions
-// there're exactly four cases needs `inputAggBufferAttributes` from 
child according to the
-// agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final,
-// Partial -> PartialMerge, PartialMerge -> PartialMerge.
-.filter(a => a.mode == Final || a.mode == 
PartialMerge).map(_.aggregateFunction)
-.flatMap(_.inputAggBufferAttributes)
+  val aggAttrs = inputAggBufferAttributes
   child.output.dropRight(aggAttrs.length) ++ aggAttrs
 } else {
   child.output
 }
   }
+
+  private val inputAggBufferAttributes: Seq[Attribute] = {
+aggregateExpressions
+  // there're exactly four cases needs `inputAggBufferAttributes` from 
child according to the
+  // agg planning in `AggUtils`: Partial -> Final, PartialMerge -> Final,
+  // Partial -> PartialMerge, PartialMerge -> PartialMerge.
+  .filter(a => a.mode == Final || a.mode == PartialMerge)
+  .flatMap(_.aggregateFunction.inputAggBufferAttributes)
+  }
+
+  protected val aggregateBufferAttributes: Seq[AttributeReference] = {
+aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes)
+  }
+
+  override def producedAttributes: AttributeSet =
+AttributeSet(aggregateAttributes) ++
+
AttributeSet(resultExpressions.diff(groupingExpressions).map(_.toAttribute)) ++
+AttributeSet(aggregateBufferAttributes) ++
+// it's not empty when the inputAggBufferAttributes is from the child 
Aggregate, which contains
+// subquery in AggregateFunction. See SPARK-31620 for more details.
+AttributeSet(inputAggBufferAttributes.filterNot(child.output.contains))

Review comment:
   Oh..the comment should actually be:
   
   `it's not empty when the child Aggregate contains the subquery in 
AggregateFunction.`
   
   
   
   After SPARK-31620, the inputAggBufferAttributes is not from children but the 
node itself when there're Final/PartialMerge aggregate expressions, see 
`inputAttributes()` above. (But please note that agg attributes are still the 
same between the parent agg node and child agg node when there's no subquery in 
agg expression)
   
   
   Therefore, in the case of SPARK-31620, we actually use the attributes 
produced by the node itself. But for other cases, we still use the agg  buffer 
attributes from the children, since `inputAggBufferAttributes` is equal to the 
agg buffer attributes from the children and so 
`inputAggBufferAttributes.filterNot(child.output.contains)` is empty.
   
   
   
   
   
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28883: [SPARK-32042][SQL] Support UTF8String literals

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #28883:
URL: https://github.com/apache/spark/pull/28883#issuecomment-647251277


   Seems like that's the codes you wrote .. not in the OSS Spark site. You 
should convert it to the regular Scala types such as 
`CatalystTypeConverters.createToScalaConverter`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647250806







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-647250808







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647250806







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-647250808







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure

2020-06-21 Thread GitBox


SparkQA commented on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-647250445


   **[Test build #124347 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124347/testReport)**
 for PR 28866 at commit 
[`d1742fe`](https://github.com/apache/spark/commit/d1742fe43be3f1f66a557f56ff15299215a59684).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


SparkQA commented on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647250427


   **[Test build #124346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124346/testReport)**
 for PR 2 at commit 
[`23f3dda`](https://github.com/apache/spark/commit/23f3dda8b67c18716042c670ea5caa88eac68af1).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


yaooqinn commented on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-647249888


   cc @dongjoon-hyun, thanks for reviewing this pr in advance



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn opened a new pull request #28888: [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


yaooqinn opened a new pull request #2:
URL: https://github.com/apache/spark/pull/2


   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   This PR backports https://github.com/apache/spark/pull/28870 which ports 
https://issues.apache.org/jira/browse/HIVE-14817 for spark thrift server.
   
   ### Why are the changes needed?
   
   Port HIVE-14817 to fix related issues
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no
   ### How was this patch tested?
   
   passing Jenkins



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure

2020-06-21 Thread GitBox


beliefer commented on a change in pull request #28866:
URL: https://github.com/apache/spark/pull/28866#discussion_r443300023



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -405,6 +405,15 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 runEvent(JobCancelled(jobId, None))
   }
 
+  /** Make task set success and check result. */
+  private def checkAnswer(
+taskSet: TaskSet,
+results: Seq[(TaskEndReason, Any)],

Review comment:
   Let's update `results` to `taskEndInfos` for `complete` too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure

2020-06-21 Thread GitBox


beliefer commented on a change in pull request #28866:
URL: https://github.com/apache/spark/pull/28866#discussion_r443299866



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -405,6 +405,15 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 runEvent(JobCancelled(jobId, None))
   }
 
+  /** Make task set success and check result. */
+  private def checkAnswer(

Review comment:
   OK
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647248573







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647248573







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647248244


   **[Test build #124345 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124345/testReport)**
 for PR 28876 at commit 
[`bc159ab`](https://github.com/apache/spark/commit/bc159ab6d5191a9b7e01bafd7a65d39c03eb419c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #28883: [SPARK-32042][SQL] Support UTF8String literals

2020-06-21 Thread GitBox


wangyum commented on pull request #28883:
URL: https://github.com/apache/spark/pull/28883#issuecomment-647247953


   Actually, I use it internal:
   
https://github.com/wangyum/spark/blob/SPARK-27227/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala#L227-L228



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298986



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.aggregate
 
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers

Review comment:
   yea. :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298784



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.aggregate
 
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers

Review comment:
   nit: I guess it can be removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647246339







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647246334







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647246379


   @kiszk too FYI



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647246269


   You can create new one. Can I have some feedback from @rednaxelafx before we 
go ahead?
   
   @rednaxelafx, does it looks good enough to you or do you have any guidance 
on the testing for him?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298402



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -144,11 +145,16 @@ object AggUtils {
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
 val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-val namedDistinctExpressions = distinctExpressions.map {
-  case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+  // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+  // `groupingExpressions` is not extracted during logical phase.
+  NormalizeFloatingNumbers.normalize(e) match {
+case ne: NamedExpression => ne
+case other => Alias(other, other.toString)()

Review comment:
   Moved to `SparkStrategies`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298442



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -144,11 +145,16 @@ object AggUtils {
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
 val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-val namedDistinctExpressions = distinctExpressions.map {
-  case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+  // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+  // `groupingExpressions` is not extracted during logical phase.

Review comment:
   Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-21 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647245944


   Seems merged PR cann't reopen. Is there any way ? If not I will send an 
another pr for this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647245921


   **[Test build #124344 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124344/testReport)**
 for PR 28876 at commit 
[`e5211c3`](https://github.com/apache/spark/commit/e5211c3538584c2bcb6ebe0569278a01dc313857).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28866: [SPARK-31845][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure

2020-06-21 Thread GitBox


beliefer commented on a change in pull request #28866:
URL: https://github.com/apache/spark/pull/28866#discussion_r443297998



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -405,6 +405,15 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 runEvent(JobCancelled(jobId, None))
   }
 
+  /** Make task set success and check result. */

Review comment:
   You said right, I will change the comments.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


cloud-fan commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443297885



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -144,11 +145,16 @@ object AggUtils {
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
 val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-val namedDistinctExpressions = distinctExpressions.map {
-  case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+  // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+  // `groupingExpressions` is not extracted during logical phase.
+  NormalizeFloatingNumbers.normalize(e) match {
+case ne: NamedExpression => ne
+case other => Alias(other, other.toString)()

Review comment:
   It looks weird if we do normalization in both `SparkStrategies` and 
`AggUtils`, better to put them in one place.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


cloud-fan commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443297250



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -144,11 +145,16 @@ object AggUtils {
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
 val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-val namedDistinctExpressions = distinctExpressions.map {
-  case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+  // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+  // `groupingExpressions` is not extracted during logical phase.

Review comment:
   +1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-21 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647244110


    test3 
   ```
   val df1 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c1")
   val df2 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c2")
   val start = System.currentTimeMillis
   df1.join(df2).where("c1 like c2").count()
   // 3 times test
   // before  159226, 159147, 159587
   // after   159641, 160960, 160091
   println(System.currentTimeMillis - start)
   ```
   The worst case is that do compare and compile each row. And it seems only 
little regression.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647242250


   @koertkuipers, are you able to show the exact reproducible steps? Sounds 
like ideally we should file a separate JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28883: [SPARK-32042][SQL] Support UTF8String literals

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #28883:
URL: https://github.com/apache/spark/pull/28883#issuecomment-647241272


   Yeah, it's internal. Let's don't do this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] warrenzhu25 edited a comment on pull request #28887: [SPARK-32044][SS] Kakfa continuous processing print mislead initial o…

2020-06-21 Thread GitBox


warrenzhu25 edited a comment on pull request #28887:
URL: https://github.com/apache/spark/pull/28887#issuecomment-647238451


   > Probably, it would be nice to explain the difference between 
Optional.orElse vs Optional.orElseGet for Java a bit in PR description. It's 
very confusing in Scala developers as we expect lazy evaluation natively.
   > 
   > And one more, please rebase the target branch to master. We normally 
receive patches against master branch. If the problem no longer exists in 
master branch, we may want to find and port back the commit which fixed the 
issue previously, instead of picking up newer commit.
   
   1. Updated description to explain more about Optional.orElse vs 
Optional.orElseGet
   2. In master branch, KafkaContinuousReader.scala has been refactored out in 
e75488718. This file is only in branch-2.4. Any suggestions about how should I 
do?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


Ngone51 commented on pull request #28869:
URL: https://github.com/apache/spark/pull/28869#issuecomment-647240697


   > Could you put the explanation into the PR description?
   
   Sure



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] koertkuipers commented on pull request #27986: [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled

2020-06-21 Thread GitBox


koertkuipers commented on pull request #27986:
URL: https://github.com/apache/spark/pull/27986#issuecomment-647240428


   @wangyum i have `spark.sql.adaptive.coalescePartitions.enabled=true` and the 
data size is small.
   how can i see that the step does coalesce? in number of tasks (i always see 
2048)? in number of output files (i always see 2048)?
   
   i have `spark.sql.shuffle.partitions=2048` and 
`spark.sql.adaptive.coalescePartitions.initialPartitionNum=2048`.
   
   when i do a groupBy instead of repartition than the number of tasks varies 
with data size (and is less than 2048) and the number of output files varies 
too (and is less than 2048). with repartition the tasks and output files are 
always fixed at 2048.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


dongjoon-hyun commented on pull request #28869:
URL: https://github.com/apache/spark/pull/28869#issuecomment-647240319


   Got it. Thanks for the explanation, @Ngone51 . Could you put the explanation 
into the PR description?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28885: [SPARK-29375][SPARK-28940][SQL] Whole plan exchange and subquery reuse

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #28885:
URL: https://github.com/apache/spark/pull/28885#issuecomment-647239990


   cc @maryannxue too FYI



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

2020-06-21 Thread GitBox


HyukjinKwon closed pull request #28881:
URL: https://github.com/apache/spark/pull/28881


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #28881:
URL: https://github.com/apache/spark/pull/28881#issuecomment-647239682


   Closing as a dup



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-21 Thread GitBox


HyukjinKwon commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-647238811


   cc @tgravescs FYI



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #28869: [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression

2020-06-21 Thread GitBox


Ngone51 commented on pull request #28869:
URL: https://github.com/apache/spark/pull/28869#issuecomment-647238903


   > Thanks. The fix looks good. Do you think we can have a UT failing on 
master/branch-3.0 without this patch, @Ngone51 ?
   
   Before this patch, for an Aggregate operator, its input attributes will 
always be equal to or more than(because it refers to its own attributes while 
it should refer to the attributes from the child) its reference attributes. 
Therefore, its missing inputs must always be empty. Thus, I think we couldn't 
have a UT to cover it.
   
   After we correct the right references in this PR, the problem is then 
exposed in the UT of SPARK-31620, since missing inputs are no longer always 
empty.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28870: [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


dongjoon-hyun commented on pull request #28870:
URL: https://github.com/apache/spark/pull/28870#issuecomment-647238927


   Thank you!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #28870: [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


yaooqinn commented on pull request #28870:
URL: https://github.com/apache/spark/pull/28870#issuecomment-647238559


   OK



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] warrenzhu25 commented on pull request #28887: [SPARK-32044][SS] Kakfa continuous processing print mislead initial o…

2020-06-21 Thread GitBox


warrenzhu25 commented on pull request #28887:
URL: https://github.com/apache/spark/pull/28887#issuecomment-647238451


   > Probably, it would be nice to explain the difference between 
Optional.orElse vs Optional.orElseGet for Java a bit in PR description. It's 
very confusing in Scala developers as we expect lazy evaluation natively.
   > 
   > And one more, please rebase the target branch to master. We normally 
receive patches against master branch. If the problem no longer exists in 
master branch, we may want to find and port back the commit which fixed the 
issue previously, instead of picking up newer commit.
   
   1. Updated description to explain more about Optional.orElse vs 
Optional.orElseGet
   2. In master branch, KafkaContinuousReader.scala is no longer existed. This 
class only in branch-2.4. Any suggestions about how should I do?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #28616: [SPARK-31798][SHUFFLE][API] Shuffle Writer API changes to return custom map output metadata

2020-06-21 Thread GitBox


dongjoon-hyun closed pull request #28616:
URL: https://github.com/apache/spark/pull/28616


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28870: [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown

2020-06-21 Thread GitBox


dongjoon-hyun commented on pull request #28870:
URL: https://github.com/apache/spark/pull/28870#issuecomment-647237258


   If you don't mind, could you make a backporting PR then?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443291919



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -144,11 +145,16 @@ object AggUtils {
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
 val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-val namedDistinctExpressions = distinctExpressions.map {
-  case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+  // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+  // `groupingExpressions` is not extracted during logical phase.
+  NormalizeFloatingNumbers.normalize(e) match {
+case ne: NamedExpression => ne
+case other => Alias(other, other.toString)()

Review comment:
   If we broaden the scope, `SparkStrategies` already is looking at the 
detail of `functionsWithDistinct` like the following.
   ```scala
   val (functionsWithDistinct, functionsWithoutDistinct) =
 aggregateExpressions.partition(_.isDistinct)
   if 
(functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length 
> 1) {
 // This is a sanity check. We should not reach here when we have 
multiple distinct
 // column sets. Our `RewriteDistinctAggregates` should take care 
this case.
 sys.error("You hit a query analyzer bug. Please report your query 
to " +
 "Spark user mailing list.")
   }
   ```
   
   And the very next line is the same logic block for `groupingExpression`.
   ```scala
   // Ideally this should be done in `NormalizeFloatingNumbers`, but we 
do it here because
   // `groupingExpressions` is not extracted during logical phase.
   val normalizedGroupingExpressions = groupingExpressions.map { e =>
 NormalizeFloatingNumbers.normalize(e) match {
   case n: NamedExpression => n
   case other => Alias(other, e.name)(exprId = e.exprId)
 }
   }
   ```
   
   Given the above, I guess what you concerned is only one line source code, 
`val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children`.
   
   And, the following comment is about the definition of 
`functionsWithDistinct` which is generated by `SparkStrategies`. So, it's not a 
leaked detail or hidden from `SparkStrategies`. For me, it seems to be an 
assumption given by `SparkStrategies`.
   ```
   // functionsWithDistinct is guaranteed to be non-empty. Even though it may 
contain more than one
 // DISTINCT aggregate function, all of those functions will have the same 
column expressions.
 // For example, it would be valid for functionsWithDistinct to be
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443291919



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -144,11 +145,16 @@ object AggUtils {
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
 val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-val namedDistinctExpressions = distinctExpressions.map {
-  case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+  // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+  // `groupingExpressions` is not extracted during logical phase.
+  NormalizeFloatingNumbers.normalize(e) match {
+case ne: NamedExpression => ne
+case other => Alias(other, other.toString)()

Review comment:
   If we broaden the scope, `SparkStrategies` already is looking at the 
detail of `functionsWithDistinct` like the following.
   ```scala
   val (functionsWithDistinct, functionsWithoutDistinct) =
 aggregateExpressions.partition(_.isDistinct)
   if 
(functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length 
> 1) {
 // This is a sanity check. We should not reach here when we have 
multiple distinct
 // column sets. Our `RewriteDistinctAggregates` should take care 
this case.
 sys.error("You hit a query analyzer bug. Please report your query 
to " +
 "Spark user mailing list.")
   }
   ```
   
   And the very next line is the same logic block for `groupingExpression`.
   ```scala
   // Ideally this should be done in `NormalizeFloatingNumbers`, but we 
do it here because
   // `groupingExpressions` is not extracted during logical phase.
   val normalizedGroupingExpressions = groupingExpressions.map { e =>
 NormalizeFloatingNumbers.normalize(e) match {
   case n: NamedExpression => n
   case other => Alias(other, e.name)(exprId = e.exprId)
 }
   }
   ```
   
   Given the above, I guess what you concerned is only one line source code, 
`val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children`.
   
   And, the comment is about the definition of `functionsWithDistinct` which is 
generated from the above. The comment is not a detail hidden from 
`SparkStrategies`. For me, it seems to be an assumption given by 
`SparkStrategies`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

2020-06-21 Thread GitBox


dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443291919



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##
@@ -144,11 +145,16 @@ object AggUtils {
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
 val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-val namedDistinctExpressions = distinctExpressions.map {
-  case ne: NamedExpression => ne
-  case other => Alias(other, other.toString)()
+val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+  // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+  // `groupingExpressions` is not extracted during logical phase.
+  NormalizeFloatingNumbers.normalize(e) match {
+case ne: NamedExpression => ne
+case other => Alias(other, other.toString)()

Review comment:
   If we broaden the scope, `SparkStrategies` already is looking at the 
detail of `functionsWithDistinct` like the following.
   ```scala
   val (functionsWithDistinct, functionsWithoutDistinct) =
 aggregateExpressions.partition(_.isDistinct)
   if 
(functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length 
> 1) {
 // This is a sanity check. We should not reach here when we have 
multiple distinct
 // column sets. Our `RewriteDistinctAggregates` should take care 
this case.
 sys.error("You hit a query analyzer bug. Please report your query 
to " +
 "Spark user mailing list.")
   }
   ```
   
   And the very next line is the same logic block for `groupingExpression`.
   ```scala
   // Ideally this should be done in `NormalizeFloatingNumbers`, but we 
do it here because
   // `groupingExpressions` is not extracted during logical phase.
   val normalizedGroupingExpressions = groupingExpressions.map { e =>
 NormalizeFloatingNumbers.normalize(e) match {
   case n: NamedExpression => n
   case other => Alias(other, e.name)(exprId = e.exprId)
 }
   }
   ```
   
   Given the above, I guess what you concerned is only one line source code, 
`val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children`.
   
   And, the following comment is about the definition of 
`functionsWithDistinct` which is generated from the above. The comment is not a 
detail hidden from `SparkStrategies`. For me, it seems to be an assumption 
given by `SparkStrategies`.
   ```
   // functionsWithDistinct is guaranteed to be non-empty. Even though it may 
contain more than one
 // DISTINCT aggregate function, all of those functions will have the same 
column expressions.
 // For example, it would be valid for functionsWithDistinct to be
 // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
 // disallowed because those two distinct aggregates have different column 
expressions.
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >