[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-659874439







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29128: [SPARK-32329][TESTS] Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29128:
URL: https://github.com/apache/spark/pull/29128#issuecomment-659874570







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-659875108


   Addressed all comments besides the only one that - I am still keeping two 
ratio configs separately (SMJ and SHJ). Let me know if I need to change this. 
cc @maropu and @viirya, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-659874311







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


cloud-fan commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r456233068



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
##
@@ -135,7 +136,16 @@ class SessionCatalog(
 
   private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {

Review comment:
   ah that's a good point. We should probably investigate how to design the 
data source API so that sources don't need to infer schema can skip this cache. 
It's hard to use the JDBC data source as we need to run REFRESH TABLE (or wait 
for TTL after this PR) once the table is changed outside of spark (which is 
common to JDBC source).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29128: [SPARK-32329][TESTS] Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29128:
URL: https://github.com/apache/spark/pull/29128#issuecomment-659874570







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


SparkQA commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-659874701


   **[Test build #126032 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126032/testReport)**
 for PR 29079 at commit 
[`d620940`](https://github.com/apache/spark/commit/d6209407731bbed2602c1d6a05c7c50982561faf).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29128: [SPARK-32329][TESTS] Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29128:
URL: https://github.com/apache/spark/pull/29128#issuecomment-659810256


   **[Test build #126017 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126017/testReport)**
 for PR 29128 at commit 
[`5f7fe1b`](https://github.com/apache/spark/commit/5f7fe1bd4d9673d52151320f3a4193c313683736).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-659874439







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456233046



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoinSuite.scala
##
@@ -103,46 +119,69 @@ class CoalesceBucketsInSortMergeJoinSuite extends 
SQLTestUtils with SharedSparkS
   }
 
   test("bucket coalescing - basic") {
-withSQLConf(SQLConf.COALESCE_BUCKETS_IN_SORT_MERGE_JOIN_ENABLED.key -> 
"true") {
+withSQLConf(SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED.key -> "true") {
+  run(JoinSetting(
+RelationSetting(4, None), RelationSetting(8, Some(4)), joinOperator = 
sortMergeJoin))
+  run(JoinSetting(
+RelationSetting(4, None), RelationSetting(8, Some(4)), joinOperator = 
shuffledHashJoin,
+shjBuildSide = Some(BuildLeft)))
+  // Coalescing bucket should not happen when the target is on shuffled 
hash join

Review comment:
   @imback82 - yes, extracting this to a new test - `bucket coalescing 
shouldn't be applied to shuffled hash join build side`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you opened a new pull request #29142: [SPARK-29343][SQL][FOLLOW-UP] Add more aggregate function to support eliminate sorts.

2020-07-16 Thread GitBox


ulysses-you opened a new pull request #29142:
URL: https://github.com/apache/spark/pull/29142


   
   
   ### What changes were proposed in this pull request?
   
   Add more aggregate function and make these case support eliminate sorts.
   
   ### Why are the changes needed?
   
   Make `EliminateSorts` match more case.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, if match case user will see the different execution plan.
   
   ### How was this patch tested?
   
   Not  need.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-659874311







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29128: [SPARK-32329][TESTS] Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES

2020-07-16 Thread GitBox


SparkQA commented on pull request #29128:
URL: https://github.com/apache/spark/pull/29128#issuecomment-659873878


   **[Test build #126017 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126017/testReport)**
 for PR 29128 at commit 
[`5f7fe1b`](https://github.com/apache/spark/commit/5f7fe1bd4d9673d52151320f3a4193c313683736).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456232826



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoinSuite.scala
##
@@ -178,7 +235,16 @@ class CoalesceBucketsInSortMergeJoinSuite extends 
SQLTestUtils with SharedSparkS
 rightKeys = rCols.reverse,
 leftRelation = lRel,
 rightRelation = RelationSetting(rCols, 8, Some(4)),
-isSortMergeJoin = true))
+joinOperator = sortMergeJoin,
+shjBuildSide = None))
+
+  run(JoinSetting(
+leftKeys = lCols.reverse,
+rightKeys = rCols.reverse,
+leftRelation = lRel,
+rightRelation = RelationSetting(rCols, 8, Some(4)),
+joinOperator = shuffledHashJoin,
+shjBuildSide = Some(BuildLeft)))

Review comment:
   @imback82 - updated.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456232773



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala
##
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.bucketing
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.catalog.BucketSpec
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight}
+import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.{FileSourceScanExec, FilterExec, 
ProjectExec, SparkPlan}
+import org.apache.spark.sql.execution.joins.{BaseJoinExec, 
ShuffledHashJoinExec, SortMergeJoinExec}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * This rule coalesces one side of the `SortMergeJoin` and `ShuffledHashJoin`
+ * if the following conditions are met:
+ *   - Two bucketed tables are joined.
+ *   - Join keys match with output partition expressions on their respective 
sides.
+ *   - The larger bucket number is divisible by the smaller bucket number.
+ *   - COALESCE_BUCKETS_IN_JOIN_ENABLED is set to true.
+ *   - The ratio of the number of buckets is less than the value set in
+ * COALESCE_BUCKETS_IN_SORT_MERGE_JOIN_MAX_BUCKET_RATIO (`SortMergeJoin`) 
or,
+ * COALESCE_BUCKETS_IN_SHUFFLED_HASH_JOIN_MAX_BUCKET_RATIO 
(`ShuffledHashJoin`).
+ */
+case class CoalesceBucketsInJoin(conf: SQLConf) extends Rule[SparkPlan] {
+  private def updateNumCoalescedBuckets(
+  join: BaseJoinExec,
+  numLeftBuckets: Int,
+  numRightBucket: Int,
+  numCoalescedBuckets: Int): BaseJoinExec = {
+if (numCoalescedBuckets != numLeftBuckets) {
+  val leftCoalescedChild = join.left transformUp {
+case f: FileSourceScanExec =>
+  f.copy(optionalNumCoalescedBuckets = Some(numCoalescedBuckets))
+  }
+  join match {
+case j: SortMergeJoinExec => j.copy(left = leftCoalescedChild)
+case j: ShuffledHashJoinExec => j.copy(left = leftCoalescedChild)
+  }
+} else {
+  val rightCoalescedChild = join.right transformUp {
+case f: FileSourceScanExec =>
+  f.copy(optionalNumCoalescedBuckets = Some(numCoalescedBuckets))
+  }
+  join match {
+case j: SortMergeJoinExec => j.copy(right = rightCoalescedChild)
+case j: ShuffledHashJoinExec => j.copy(right = rightCoalescedChild)
+  }
+}
+  }
+
+  private def isCoalesceSHJStreamSide(
+  join: ShuffledHashJoinExec,
+  numLeftBuckets: Int,
+  numRightBucket: Int,
+  numCoalescedBuckets: Int): Boolean = {
+if (numCoalescedBuckets == numLeftBuckets) {
+  join.buildSide != BuildRight
+} else {
+  join.buildSide != BuildLeft
+}
+  }
+
+  def apply(plan: SparkPlan): SparkPlan = {
+if (!conf.coalesceBucketsInJoinEnabled) {
+  return plan
+}
+
+plan transform {
+  case ExtractJoinWithBuckets(join, numLeftBuckets, numRightBuckets) =>
+val bucketRatio = math.max(numLeftBuckets, numRightBuckets) /
+  math.min(numLeftBuckets, numRightBuckets)
+val numCoalescedBuckets = math.min(numLeftBuckets, numRightBuckets)
+join match {
+  case j: SortMergeJoinExec
+if bucketRatio <= 
conf.coalesceBucketsInSortMergeJoinMaxBucketRatio =>
+updateNumCoalescedBuckets(j, numLeftBuckets, numRightBuckets, 
numCoalescedBuckets)
+  case j: ShuffledHashJoinExec
+// Only coalesce the buckets for shuffled hash join stream side,
+// to avoid OOM for build side.
+if bucketRatio <= 
conf.coalesceBucketsInShuffledHashJoinMaxBucketRatio &&
+  isCoalesceSHJStreamSide(j, numLeftBuckets, numRightBuckets, 
numCoalescedBuckets) =>
+updateNumCoalescedBuckets(j, numLeftBuckets, numRightBuckets, 
numCoalescedBuckets)
+  case other => other
+}
+  case other => other
+}
+  }
+}
+
+/**
+ * An extractor that extracts 

[GitHub] [spark] c21 commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456232703



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoinSuite.scala
##
@@ -103,46 +119,69 @@ class CoalesceBucketsInSortMergeJoinSuite extends 
SQLTestUtils with SharedSparkS
   }
 
   test("bucket coalescing - basic") {
-withSQLConf(SQLConf.COALESCE_BUCKETS_IN_SORT_MERGE_JOIN_ENABLED.key -> 
"true") {
+withSQLConf(SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED.key -> "true") {
+  run(JoinSetting(
+RelationSetting(4, None), RelationSetting(8, Some(4)), joinOperator = 
sortMergeJoin))
+  run(JoinSetting(
+RelationSetting(4, None), RelationSetting(8, Some(4)), joinOperator = 
shuffledHashJoin,
+shjBuildSide = Some(BuildLeft)))
+  // Coalescing bucket should not happen when the target is on shuffled 
hash join
+  // build side.
   run(JoinSetting(
-RelationSetting(4, None), RelationSetting(8, Some(4)), isSortMergeJoin 
= true))
+RelationSetting(4, None), RelationSetting(8, None), joinOperator = 
shuffledHashJoin,
+shjBuildSide = Some(BuildRight)))
 }
-withSQLConf(SQLConf.COALESCE_BUCKETS_IN_SORT_MERGE_JOIN_ENABLED.key -> 
"false") {
-  run(JoinSetting(RelationSetting(4, None), RelationSetting(8, None), 
isSortMergeJoin = true))
+withSQLConf(SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED.key -> "false") {
+  run(JoinSetting(
+RelationSetting(4, None), RelationSetting(8, None), joinOperator = 
broadcastHashJoin))

Review comment:
   @cloud-fan - updated with extra test for SMJ.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on pull request #29131: [SPARK-32321][SS] Remove KAFKA-7703 workaround

2020-07-16 Thread GitBox


zsxwing commented on pull request #29131:
URL: https://github.com/apache/spark/pull/29131#issuecomment-659873692


   Thanks for raising the PR. Could you clarify what's the cost to keep this?
   
   I believe KAFKA-7703 has been fixed since you have verified it using my 
reproduction codes. However I'd be more conservative. Although I did report 
KAFKA-7703, I didn't have any evidence that this was exactly the issue we hit 
in production, or that was the only possible issue. There were no enough logs 
to prove it unfortunately. What I know is the workaround we patched in Spark 
did prevent Kafka consumer from reporting incorrect offsets, but it could hide 
other potential unknown issues.
   
   Currently there is no Spark release using Kafka 2.5.0, so I don't feel 
confident that there are no other unknown issues causing the same incorrect 
offset issue. If the cost to keep this workaround is minor, can we wait until a 
Spark release using Kafka 2.5.0 is out for a while? Once there is a Spark 
release available and people start to use it, I can look at our internal logs 
to see if the warning log in `fetchLatestOffsets` is really gone, which will be 
an evidence to prove KAFKA-7703 is likely the only issue.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456232535



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala
##
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.bucketing
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.catalog.BucketSpec
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight}
+import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.{FileSourceScanExec, FilterExec, 
ProjectExec, SparkPlan}
+import org.apache.spark.sql.execution.joins.{BaseJoinExec, 
ShuffledHashJoinExec, SortMergeJoinExec}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * This rule coalesces one side of the `SortMergeJoin` and `ShuffledHashJoin`
+ * if the following conditions are met:
+ *   - Two bucketed tables are joined.
+ *   - Join keys match with output partition expressions on their respective 
sides.
+ *   - The larger bucket number is divisible by the smaller bucket number.
+ *   - COALESCE_BUCKETS_IN_JOIN_ENABLED is set to true.
+ *   - The ratio of the number of buckets is less than the value set in
+ * COALESCE_BUCKETS_IN_SORT_MERGE_JOIN_MAX_BUCKET_RATIO (`SortMergeJoin`) 
or,
+ * COALESCE_BUCKETS_IN_SHUFFLED_HASH_JOIN_MAX_BUCKET_RATIO 
(`ShuffledHashJoin`).
+ */
+case class CoalesceBucketsInJoin(conf: SQLConf) extends Rule[SparkPlan] {
+  private def updateNumCoalescedBuckets(
+  join: BaseJoinExec,
+  numLeftBuckets: Int,
+  numRightBucket: Int,
+  numCoalescedBuckets: Int): BaseJoinExec = {
+if (numCoalescedBuckets != numLeftBuckets) {
+  val leftCoalescedChild = join.left transformUp {
+case f: FileSourceScanExec =>
+  f.copy(optionalNumCoalescedBuckets = Some(numCoalescedBuckets))
+  }

Review comment:
   @maropu - sure. updated.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on a change in pull request #29079:
URL: https://github.com/apache/spark/pull/29079#discussion_r456232607



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoinSuite.scala
##
@@ -19,17 +19,21 @@ package org.apache.spark.sql.execution.bucketing
 
 import org.apache.spark.sql.catalyst.catalog.BucketSpec
 import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
-import org.apache.spark.sql.catalyst.optimizer.BuildLeft
+import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, 
BuildSide}
 import org.apache.spark.sql.catalyst.plans.Inner
 import org.apache.spark.sql.execution.{BinaryExecNode, FileSourceScanExec, 
SparkPlan}
 import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
InMemoryFileIndex, PartitionSpec}
 import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
-import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, 
SortMergeJoinExec}
+import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, 
ShuffledHashJoinExec, SortMergeJoinExec}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
 import org.apache.spark.sql.types.{IntegerType, StructType}
 
-class CoalesceBucketsInSortMergeJoinSuite extends SQLTestUtils with 
SharedSparkSession {
+class CoalesceBucketsInJoinSuite extends SQLTestUtils with SharedSparkSession {
+  private val sortMergeJoin = "sortMergeJoin"

Review comment:
   @cloud-fan - sure. updated.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-16 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r456232227



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (scheme.equals("file")) {
+  logWarning(s"Temporary data will be written into a local file system " +
+s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in 
local mode, " +
+s"you might need to configure 'hive.exec.scratchdir' " +
+s"to use accessible file system (e.g. HDFS path) from any executors in 
the cluster.")

Review comment:
   Removed `s` in the head. BTW there are a lot of existing code which 
includes it, but I left it as it is.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-16 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r456231877



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (scheme.equals("file")) {
+  logWarning(s"Temporary data will be written into a local file system " +
+s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in 
local mode, " +
+s"you might need to configure 'hive.exec.scratchdir' " +
+s"to use accessible file system (e.g. HDFS path) from any executors in 
the cluster.")
+}
+path
+  }
+
+  private def supportSchemeToUseNonBlobStore(path: Path): Boolean = {
+path != null && {
+  val supportedBlobSchemes = SQLConf.get.supportedSchemesToUseNonBlobstore
+  val scheme = Option(path.toUri.getScheme).getOrElse("")
+  
Utils.stringToSeq(supportedBlobSchemes).contains(scheme.toLowerCase(Locale.ROOT))
+}
+  }
+
+  def getExternalTmpPath(
   sparkSession: SparkSession,
   hadoopConf: Configuration,
   path: Path): Path = {
 import org.apache.spark.sql.hive.client.hive._
-

Review comment:
   Thanks for pointing it. Reverted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29101: [SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions

2020-07-16 Thread GitBox


cloud-fan commented on pull request #29101:
URL: https://github.com/apache/spark/pull/29101#issuecomment-659872039


   LGTM. It's a much simpler and robust solution!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659870258







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


SparkQA commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659870428


   **[Test build #126031 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126031/testReport)**
 for PR 29117 at commit 
[`d7974a4`](https://github.com/apache/spark/commit/d7974a4d58bd51f99d6c010ac536e63a5094fbf3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659848866


   **[Test build #126028 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126028/testReport)**
 for PR 29117 at commit 
[`d7974a4`](https://github.com/apache/spark/commit/d7974a4d58bd51f99d6c010ac536e63a5094fbf3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659870258







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


SparkQA commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659870049


   **[Test build #126028 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126028/testReport)**
 for PR 29117 at commit 
[`d7974a4`](https://github.com/apache/spark/commit/d7974a4d58bd51f99d6c010ac536e63a5094fbf3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 edited a comment on pull request #28994: [SPARK-32170][CORE] Improve the speculation for the inefficient tasks by the task metrics.

2020-07-16 Thread GitBox


venkata91 edited a comment on pull request #28994:
URL: https://github.com/apache/spark/pull/28994#issuecomment-659869847


   This is an interesting idea and a good start. Just considering the runTime 
of a task alone might not be useful in many cases. Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on pull request #28994: [SPARK-32170][CORE] Improve the speculation for the inefficient tasks by the task metrics.

2020-07-16 Thread GitBox


venkata91 commented on pull request #28994:
URL: https://github.com/apache/spark/pull/28994#issuecomment-659869847


   This is an interesting idea and a good start. Just considering the runTime 
of a task alone might not be useful in many cases.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #28994: [SPARK-32170][CORE] Improve the speculation for the inefficient tasks by the task metrics.

2020-07-16 Thread GitBox


venkata91 commented on a change in pull request #28994:
URL: https://github.com/apache/spark/pull/28994#discussion_r456228668



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##
@@ -1125,6 +1142,78 @@ private[spark] class TaskSetManager(
   def executorAdded(): Unit = {
 recomputeLocality()
   }
+
+  /**
+   * A class for checking inefficient tasks to be speculated, the inefficient 
tasks come from
+   * the tasks which may be speculated by the previous strategy.
+   */
+  private class InefficientTask {
+private var taskData: Map[Long, TaskData] = null
+private var successTaskProgress = 0.0
+private val checkInefficientTask = speculationTaskMinDuration > 0
+
+if (checkInefficientTask) {
+  val appStatusStore = sched.sc.statusTracker.getAppStatusStore
+  if (appStatusStore != null) {
+successTaskProgress =
+  computeSuccessTaskProgress(taskSet.stageId, taskSet.stageAttemptId, 
appStatusStore)
+val stageData = appStatusStore.stageAttempt(taskSet.stageId, 
taskSet.stageAttemptId, true)
+if (stageData != null) {
+  taskData = stageData._1.tasks.orNull
+}
+  }
+}
+
+private def computeSuccessTaskProgress(stageId: Int, stageAttemptId: Int,
+  appStatusStore: AppStatusStore): Double = {
+  var sumInputRecords, sumShuffleReadRecords, sumExecutorRunTime = 0.0
+  appStatusStore.taskList(stageId, stageAttemptId, Int.MaxValue).filter {
+_.status == "SUCCESS"
+  }.map(_.taskMetrics).filter(_.isDefined).map(_.get).foreach { task =>
+if (task.inputMetrics != null) {
+  sumInputRecords += task.inputMetrics.recordsRead
+}

Review comment:
   how about recordsWritten? Should that also be considered wrt progress 
same wrt shuffleRecordsWritten?

##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##
@@ -1125,6 +1142,78 @@ private[spark] class TaskSetManager(
   def executorAdded(): Unit = {
 recomputeLocality()
   }
+
+  /**
+   * A class for checking inefficient tasks to be speculated, the inefficient 
tasks come from
+   * the tasks which may be speculated by the previous strategy.
+   */
+  private class InefficientTask {
+private var taskData: Map[Long, TaskData] = null
+private var successTaskProgress = 0.0
+private val checkInefficientTask = speculationTaskMinDuration > 0
+
+if (checkInefficientTask) {
+  val appStatusStore = sched.sc.statusTracker.getAppStatusStore
+  if (appStatusStore != null) {
+successTaskProgress =
+  computeSuccessTaskProgress(taskSet.stageId, taskSet.stageAttemptId, 
appStatusStore)
+val stageData = appStatusStore.stageAttempt(taskSet.stageId, 
taskSet.stageAttemptId, true)
+if (stageData != null) {
+  taskData = stageData._1.tasks.orNull
+}
+  }
+}
+
+private def computeSuccessTaskProgress(stageId: Int, stageAttemptId: Int,
+  appStatusStore: AppStatusStore): Double = {
+  var sumInputRecords, sumShuffleReadRecords, sumExecutorRunTime = 0.0
+  appStatusStore.taskList(stageId, stageAttemptId, Int.MaxValue).filter {
+_.status == "SUCCESS"
+  }.map(_.taskMetrics).filter(_.isDefined).map(_.get).foreach { task =>
+if (task.inputMetrics != null) {
+  sumInputRecords += task.inputMetrics.recordsRead
+}
+if (task.shuffleReadMetrics != null) {
+  sumShuffleReadRecords += task.shuffleReadMetrics.recordsRead
+}
+sumExecutorRunTime += task.executorRunTime
+  }
+  if (sumExecutorRunTime > 0) {
+(sumInputRecords + sumShuffleReadRecords) / (sumExecutorRunTime / 
1000.0)
+  } else 0
+}
+
+def maySpeculateTask(tid: Long, runtimeMs: Long, taskInfo: TaskInfo): 
Boolean = {
+  // note: 1) only check inefficient tasks when 
'SPECULATION_TASK_DURATION_THRESHOLD' > 0.
+  // 2) some tasks may have neither input records nor shuffleRead records, 
so
+  // the 'successTaskProgress' may be zero all the time, this case we 
should not consider,
+  // eg: some spark-sql like that 'msck repair table' or 'drop table' and 
so on.
+  if (!checkInefficientTask || successTaskProgress <= 0) {
+true
+  } else if (runtimeMs < speculationTaskMinDuration) {
+false
+  } else if (taskData != null && taskData.contains(tid) && taskData(tid) 
!= null &&
+taskData(tid).taskMetrics.isDefined) {
+val taskMetrics = taskData(tid).taskMetrics.get
+val currentTaskProgressRate = (taskMetrics.inputMetrics.recordsRead +

Review comment:
   would it make sense to add taskProgress as part of taskMetrics that way 
it can also be shown in SparkUI? Although taskProgress for tasks which doesn't 
involve input/output/shuffle records would be hard to measure?

##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29120:
URL: https://github.com/apache/spark/pull/29120#issuecomment-659862630


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126021/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29141: [SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29141:
URL: https://github.com/apache/spark/pull/29141#issuecomment-659852368







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29120:
URL: https://github.com/apache/spark/pull/29120#issuecomment-659819077


   **[Test build #126021 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126021/testReport)**
 for PR 29120 at commit 
[`e56f5d4`](https://github.com/apache/spark/commit/e56f5d4936fc8105d672fea5fe8ae441b7de0f2b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29120:
URL: https://github.com/apache/spark/pull/29120#issuecomment-659862619


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29120:
URL: https://github.com/apache/spark/pull/29120#issuecomment-659862619







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29141: [SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value

2020-07-16 Thread GitBox


SparkQA commented on pull request #29141:
URL: https://github.com/apache/spark/pull/29141#issuecomment-659862760


   **[Test build #126030 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126030/testReport)**
 for PR 29141 at commit 
[`3210002`](https://github.com/apache/spark/commit/321000236e5571545912af0db1c02a3fa06f1a9a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join

2020-07-16 Thread GitBox


SparkQA commented on pull request #29120:
URL: https://github.com/apache/spark/pull/29120#issuecomment-659862399


   **[Test build #126021 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126021/testReport)**
 for PR 29120 at commit 
[`e56f5d4`](https://github.com/apache/spark/commit/e56f5d4936fc8105d672fea5fe8ae441b7de0f2b).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang edited a comment on pull request #29133: [SPARK-32253][INFRA] Show errors only for the sbt tests of github actions

2020-07-16 Thread GitBox


gengliangwang edited a comment on pull request #29133:
URL: https://github.com/apache/spark/pull/29133#issuecomment-659821378


   @HyukjinKwon I have updated the PR description.
   Meanwhile, I created a PR on my repo to see what the test failure log will 
look like: https://github.com/gengliangwang/spark/pull/6
   
   Here is an example of failed log output: 
https://github.com/gengliangwang/spark/pull/6/checks?check_run_id=880362871
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29141: [SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value

2020-07-16 Thread GitBox


dongjoon-hyun commented on pull request #29141:
URL: https://github.com/apache/spark/pull/29141#issuecomment-659859923


   Thank you, @cloud-fan !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking edited a comment on pull request #28977: [WIP] Add all hive.execution suite in the parallel test group

2020-07-16 Thread GitBox


xuanyuanking edited a comment on pull request #28977:
URL: https://github.com/apache/spark/pull/28977#issuecomment-659327525


   Summary for separating all `hive.execution` suites
   
   Test | Worker | Scala test time
    | - | -
   https://github.com/apache/spark/pull/28977#issuecomment-659309943 | 
worker-03 | s
   https://github.com/apache/spark/pull/28977#issuecomment-659486466 | 
worker-04 | 8403s



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


SparkQA commented on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659856381


   **[Test build #126029 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126029/testReport)**
 for PR 29032 at commit 
[`9356fac`](https://github.com/apache/spark/commit/9356facb887328a2e781f46dc533f41eb6751392).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-659856207







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659856015







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-659856207







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659856015







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-659746319


   **[Test build #126007 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126007/testReport)**
 for PR 28840 at commit 
[`94fa132`](https://github.com/apache/spark/commit/94fa132ca4d58f631cc7666e25b126bc28c7f34e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-07-16 Thread GitBox


SparkQA commented on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-659855587


   **[Test build #126007 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126007/testReport)**
 for PR 28840 at commit 
[`94fa132`](https://github.com/apache/spark/commit/94fa132ca4d58f631cc7666e25b126bc28c7f34e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


HyukjinKwon commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659855366


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


viirya commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r456219067



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
##
@@ -135,7 +136,16 @@ class SessionCatalog(
 
   private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {

Review comment:
   Hmm, I think this cache is still useful for avoiding inferring schema 
again. This is also an expensive operation.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29141: [SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29141:
URL: https://github.com/apache/spark/pull/29141#issuecomment-659852368







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29141: [SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value

2020-07-16 Thread GitBox


cloud-fan commented on pull request #29141:
URL: https://github.com/apache/spark/pull/29141#issuecomment-659852057


   cc @dongjoon-hyun 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan opened a new pull request #29141: [SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value

2020-07-16 Thread GitBox


cloud-fan opened a new pull request #29141:
URL: https://github.com/apache/spark/pull/29141


   backport https://github.com/apache/spark/pull/29125



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659850454







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659850454







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659849344







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659849344







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


SparkQA commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659848866


   **[Test build #126028 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126028/testReport)**
 for PR 29117 at commit 
[`d7974a4`](https://github.com/apache/spark/commit/d7974a4d58bd51f99d6c010ac536e63a5094fbf3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #29140: [SPARK-32145][SQL][FOLLOWUP] Fix type in the error log of SparkOperation

2020-07-16 Thread GitBox


cloud-fan closed pull request #29140:
URL: https://github.com/apache/spark/pull/29140


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29140: [SPARK-32145][SQL][FOLLOWUP] Fix type in the error log of SparkOperation

2020-07-16 Thread GitBox


cloud-fan commented on pull request #29140:
URL: https://github.com/apache/spark/pull/29140#issuecomment-659848160


   thanks, merging to master!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


cloud-fan commented on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659846826


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29015:
URL: https://github.com/apache/spark/pull/29015#issuecomment-659845800







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29015:
URL: https://github.com/apache/spark/pull/29015#issuecomment-659845800







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


cloud-fan commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r456214684



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
##
@@ -135,7 +136,16 @@ class SessionCatalog(
 
   private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {

Review comment:
   For external data sources, it's common that data are changed outside of 
Spark. I think it's more important to make sure we get the latest data in a new 
query. Maybe we should disable this relation cache by default.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659844495


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126025/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


SparkQA commented on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-659844682


   **[Test build #126027 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126027/testReport)**
 for PR 28852 at commit 
[`3e761dc`](https://github.com/apache/spark/commit/3e761dcd790b9c30e5cee7bffe916dfc2c82b7a5).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29015:
URL: https://github.com/apache/spark/pull/29015#issuecomment-659773878


   **[Test build #126012 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126012/testReport)**
 for PR 29015 at commit 
[`31b231e`](https://github.com/apache/spark/commit/31b231e1b0a984ebdfc408beedaadeec6881ddff).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


SparkQA commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-65984


   **[Test build #126025 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126025/testReport)**
 for PR 29117 at commit 
[`f6207b0`](https://github.com/apache/spark/commit/f6207b038a67b575b65ed4adba1c407b6a0d0ecd).
* This patch **fails PySpark pip packaging tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI

2020-07-16 Thread GitBox


SparkQA commented on pull request #29015:
URL: https://github.com/apache/spark/pull/29015#issuecomment-659844348


   **[Test build #126012 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126012/testReport)**
 for PR 29015 at commit 
[`31b231e`](https://github.com/apache/spark/commit/31b231e1b0a984ebdfc408beedaadeec6881ddff).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `  case class DecommissionWorkersOnHosts(hostnames: Seq[String])`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659834294


   **[Test build #126025 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126025/testReport)**
 for PR 29117 at commit 
[`f6207b0`](https://github.com/apache/spark/commit/f6207b038a67b575b65ed4adba1c407b6a0d0ecd).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659844490







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659844490


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29014: [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29014:
URL: https://github.com/apache/spark/pull/29014#issuecomment-659840501







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-07-16 Thread GitBox


SparkQA commented on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-659840659


   **[Test build #126026 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126026/testReport)**
 for PR 28840 at commit 
[`fc4789f`](https://github.com/apache/spark/commit/fc4789fcb5357bd1a7cfc88b76c7d76822457db7).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29014: [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29014:
URL: https://github.com/apache/spark/pull/29014#issuecomment-659840501







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29014: [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29014:
URL: https://github.com/apache/spark/pull/29014#issuecomment-659773941


   **[Test build #126013 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126013/testReport)**
 for PR 29014 at commit 
[`1903e59`](https://github.com/apache/spark/commit/1903e59ca08600d457b1cb0a2e223ba4629a4a46).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29014: [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning

2020-07-16 Thread GitBox


SparkQA commented on pull request #29014:
URL: https://github.com/apache/spark/pull/29014#issuecomment-659839803


   **[Test build #126013 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126013/testReport)**
 for PR 29014 at commit 
[`1903e59`](https://github.com/apache/spark/commit/1903e59ca08600d457b1cb0a2e223ba4629a4a46).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `case class ExecutorDecommissionInfo(message: String, 
isHostDecommissioned: Boolean) `
 * `case class ExecutorProcessLost(`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29067: [SPARK-32274][SQL] Make SQL cache serialization pluggable

2020-07-16 Thread GitBox


viirya commented on a change in pull request #29067:
URL: https://github.com/apache/spark/pull/29067#discussion_r456210407



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
##
@@ -19,84 +19,301 @@ package org.apache.spark.sql.execution.columnar
 
 import org.apache.commons.lang3.StringUtils
 
+import org.apache.spark.TaskContext
+import org.apache.spark.annotation.{DeveloperApi, Since}
+import org.apache.spark.internal.Logging
 import org.apache.spark.network.util.JavaUtils
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
+import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.plans.QueryPlan
-import org.apache.spark.sql.catalyst.plans.logical
+import org.apache.spark.sql.catalyst.plans.{logical, QueryPlan}
 import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.catalyst.util.truncatedString
 import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.vectorized.{OffHeapColumnVector, 
OnHeapColumnVector, WritableColumnVector}
+import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}
+import org.apache.spark.sql.types.{AtomicType, BinaryType, StructType, 
UserDefinedType}
+import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}
 import org.apache.spark.storage.StorageLevel
-import org.apache.spark.util.LongAccumulator
+import org.apache.spark.util.{LongAccumulator, Utils}
 
+/**
+ * Basic interface that all cached batches of data must support. This is 
primarily to allow
+ * for metrics to be handled outside of the encoding and decoding steps in a 
standard way.
+ */
+@DeveloperApi
+@Since("3.1.0")
+trait CachedBatch {
+  def numRows: Int
+  def sizeInBytes: Long
+}
 
 /**
- * CachedBatch is a cached batch of rows.
- *
- * @param numRows The total number of rows in this batch
- * @param buffers The buffers for serialized columns
- * @param stats The stat of columns
+ * Provides APIs for compressing, filtering, and decompressing SQL data that 
will be
+ * persisted/cached.
  */
-private[columnar]
-case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
InternalRow)
+@DeveloperApi
+@Since("3.1.0")
+trait CachedBatchSerializer extends Serializable {
+  /**
+   * Run the given plan and convert its output to a implementation of 
[[CachedBatch]].
+   * @param cachedPlan the plan to run.
+   * @return the RDD containing the batches of data to cache.
+   */
+  def convertForCache(cachedPlan: SparkPlan): RDD[CachedBatch]
+
+  /**
+   * Builds a function that can be used to filter which batches are loaded.
+   * In most cases extending [[SimpleMetricsCachedBatchSerializer]] will 
provide what
+   * you need with the added expense of calculating the min and max value for 
some

Review comment:
   Looks like the stats calculation is in `DefaultCachedBatchSerializer`, 
instead of `SimpleMetricsCachedBatchSerializer`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-659838382







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-659838382







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r456209080



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,30 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val previousValue = SQLConf.get.getConf(StaticSQLConf.METADATA_CACHE_TTL)
+try {
+  SQLConf.get.setConf(StaticSQLConf.METADATA_CACHE_TTL, 1L)

Review comment:
   added a comment in 
https://github.com/apache/spark/pull/28852/commits/3e761dcd790b9c30e5cee7bffe916dfc2c82b7a5

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##
@@ -226,4 +227,16 @@ object StaticSQLConf {
   .version("3.0.0")
   .intConf
   .createWithDefault(100)
+
+  val METADATA_CACHE_TTL = buildStaticConf("spark.sql.metadataCacheTTL")

Review comment:
   Updated in 
https://github.com/apache/spark/pull/28852/commits/3e761dcd790b9c30e5cee7bffe916dfc2c82b7a5





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659834167


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126011/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659834162


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659834162







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


SparkQA commented on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659834294


   **[Test build #126025 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126025/testReport)**
 for PR 29117 at commit 
[`f6207b0`](https://github.com/apache/spark/commit/f6207b038a67b575b65ed4adba1c407b6a0d0ecd).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29135: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29135:
URL: https://github.com/apache/spark/pull/29135#issuecomment-659833089


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126016/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659773860


   **[Test build #126011 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126011/testReport)**
 for PR 29032 at commit 
[`9356fac`](https://github.com/apache/spark/commit/9356facb887328a2e781f46dc533f41eb6751392).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0

2020-07-16 Thread GitBox


viirya commented on a change in pull request #29114:
URL: https://github.com/apache/spark/pull/29114#discussion_r456207305



##
File path: LICENSE
##
@@ -229,7 +229,7 @@ BSD 3-Clause
 
 
 python/lib/py4j-*-src.zip
-python/pyspark/cloudpickle.py
+python/pyspark/cloudpickle/*.py

Review comment:
   I think it is not big deal to have BSD 3 for other two files. :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0

2020-07-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29114:
URL: https://github.com/apache/spark/pull/29114#discussion_r456207338



##
File path: LICENSE
##
@@ -229,7 +229,7 @@ BSD 3-Clause
 
 
 python/lib/py4j-*-src.zip
-python/pyspark/cloudpickle.py
+python/pyspark/cloudpickle/*.py

Review comment:
   I think we ported `__init__.py` and `compat.py` here too and it includes 
all.
   
   Looks like their license wasn't changed from the very first release 
(https://github.com/cloudpipe/cloudpickle/blob/master/LICENSE). I guess it 
might be fine..





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


SparkQA commented on pull request #29032:
URL: https://github.com/apache/spark/pull/29032#issuecomment-659833600


   **[Test build #126011 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126011/testReport)**
 for PR 29032 at commit 
[`9356fac`](https://github.com/apache/spark/commit/9356facb887328a2e781f46dc533f41eb6751392).
* This patch **fails PySpark pip packaging tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `case class ExecutorDecommissionInfo(message: String, 
isHostDecommissioned: Boolean)`
 * `  case class DecommissionExecutor(executorId: String, decommissionInfo: 
ExecutorDecommissionInfo)`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29135: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #29135:
URL: https://github.com/apache/spark/pull/29135#issuecomment-659833081







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-659833113







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29067: [SPARK-32274][SQL] Make SQL cache serialization pluggable

2020-07-16 Thread GitBox


viirya commented on a change in pull request #29067:
URL: https://github.com/apache/spark/pull/29067#discussion_r456206490



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
##
@@ -19,84 +19,301 @@ package org.apache.spark.sql.execution.columnar
 
 import org.apache.commons.lang3.StringUtils
 
+import org.apache.spark.TaskContext
+import org.apache.spark.annotation.{DeveloperApi, Since}
+import org.apache.spark.internal.Logging
 import org.apache.spark.network.util.JavaUtils
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
+import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.plans.QueryPlan
-import org.apache.spark.sql.catalyst.plans.logical
+import org.apache.spark.sql.catalyst.plans.{logical, QueryPlan}
 import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.catalyst.util.truncatedString
 import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.vectorized.{OffHeapColumnVector, 
OnHeapColumnVector, WritableColumnVector}
+import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}
+import org.apache.spark.sql.types.{AtomicType, BinaryType, StructType, 
UserDefinedType}
+import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}
 import org.apache.spark.storage.StorageLevel
-import org.apache.spark.util.LongAccumulator
+import org.apache.spark.util.{LongAccumulator, Utils}
 
+/**
+ * Basic interface that all cached batches of data must support. This is 
primarily to allow
+ * for metrics to be handled outside of the encoding and decoding steps in a 
standard way.
+ */
+@DeveloperApi
+@Since("3.1.0")
+trait CachedBatch {
+  def numRows: Int
+  def sizeInBytes: Long
+}
 
 /**
- * CachedBatch is a cached batch of rows.
- *
- * @param numRows The total number of rows in this batch
- * @param buffers The buffers for serialized columns
- * @param stats The stat of columns
+ * Provides APIs for compressing, filtering, and decompressing SQL data that 
will be
+ * persisted/cached.
  */
-private[columnar]
-case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
InternalRow)
+@DeveloperApi
+@Since("3.1.0")
+trait CachedBatchSerializer extends Serializable {
+  /**
+   * Run the given plan and convert its output to a implementation of 
[[CachedBatch]].
+   * @param cachedPlan the plan to run.
+   * @return the RDD containing the batches of data to cache.
+   */
+  def convertForCache(cachedPlan: SparkPlan): RDD[CachedBatch]
+
+  /**
+   * Builds a function that can be used to filter which batches are loaded.
+   * In most cases extending [[SimpleMetricsCachedBatchSerializer]] will 
provide what
+   * you need with the added expense of calculating the min and max value for 
some
+   * data columns, depending on their data type. Note that this is intended to 
skip batches
+   * that are not needed, and the actual filtering of individual rows is 
handled later.
+   * @param predicates the set of expressions to use for filtering.
+   * @param cachedAttributes the schema/attributes of the data that is cached. 
This can be helpful
+   * if you don't store it with the data.
+   * @return a function that takes the partition id and the iterator of 
batches in the partition.
+   * It returns an iterator of batches that should be loaded.
+   */
+  def buildFilter(predicates: Seq[Expression],
+  cachedAttributes: Seq[Attribute]): (Int, Iterator[CachedBatch]) => 
Iterator[CachedBatch]
+
+  /**
+   * Decompress the cached data into a ColumnarBatch. This currently is only 
used for basic types
+   * BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | 
DoubleType
+   * That may change in the future.
+   * @param input the cached batches that should be decompressed.
+   * @param cacheAttributes the attributes of the data in the batch.
+   * @param selectedAttributes the field that should be loaded from the data, 
and the order they
+   *   should appear in the output batch.
+   * @param conf the configuration for the job.
+   * @return an RDD of the input cached batches transformed into the 
ColumnarBatch format.
+   */
+  def decompressColumnar(
+  input: RDD[CachedBatch],
+  cacheAttributes: Seq[Attribute],
+  selectedAttributes: Seq[Attribute],
+  conf: SQLConf): RDD[ColumnarBatch]
+
+  /**
+   * Decompress the cached batch into [[InternalRow]]. If you want this to be 
performant, code
+   * generation is advised.
+   * @param input the cached batches that should be decompressed.
+   * @param cacheAttributes the attributes of the data in the batch.
+   * @param selectedAttributes the field that should be loaded from the data, 
and the order they
+   

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29135: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29135:
URL: https://github.com/apache/spark/pull/29135#issuecomment-659833081


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-07-16 Thread GitBox


AmplabJenkins commented on pull request #28840:
URL: https://github.com/apache/spark/pull/28840#issuecomment-659833113







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29020: [SPARK-23431][CORE] Expose stage level peak executor metrics via REST API

2020-07-16 Thread GitBox


SparkQA commented on pull request #29020:
URL: https://github.com/apache/spark/pull/29020#issuecomment-659833193


   **[Test build #126024 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126024/testReport)**
 for PR 29020 at commit 
[`895f5fd`](https://github.com/apache/spark/commit/895f5fdaf21e36d5023f159f926c58b13d136021).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29135: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

2020-07-16 Thread GitBox


SparkQA removed a comment on pull request #29135:
URL: https://github.com/apache/spark/pull/29135#issuecomment-659810216


   **[Test build #126016 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126016/testReport)**
 for PR 29135 at commit 
[`fefbce0`](https://github.com/apache/spark/commit/fefbce04af1c62f02870a79686a54e7669584a69).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29135: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

2020-07-16 Thread GitBox


SparkQA commented on pull request #29135:
URL: https://github.com/apache/spark/pull/29135#issuecomment-659832907


   **[Test build #126016 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126016/testReport)**
 for PR 29135 at commit 
[`fefbce0`](https://github.com/apache/spark/commit/fefbce04af1c62f02870a79686a54e7669584a69).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0

2020-07-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29114:
URL: https://github.com/apache/spark/pull/29114#discussion_r456206191



##
File path: LICENSE
##
@@ -229,7 +229,7 @@ BSD 3-Clause
 
 
 python/lib/py4j-*-src.zip
-python/pyspark/cloudpickle.py
+python/pyspark/cloudpickle/*.py

Review comment:
   Oh





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0

2020-07-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29114:
URL: https://github.com/apache/spark/pull/29114#discussion_r456188555



##
File path: LICENSE
##
@@ -229,7 +229,7 @@ BSD 3-Clause
 
 
 python/lib/py4j-*-src.zip
-python/pyspark/cloudpickle.py
+python/pyspark/cloudpickle/*.py

Review comment:
   It catches the directory `python/pyspark/cloudpickle/` :-).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29117: [WIP] Debug flaky pip installation test failure

2020-07-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29117:
URL: https://github.com/apache/spark/pull/29117#issuecomment-659830392







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >