[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning

2020-06-26 Thread GitBox


imback82 commented on a change in pull request #28676:
URL: https://github.com/apache/spark/pull/28676#discussion_r446489214



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##
@@ -554,7 +554,7 @@ class AdaptiveQueryExecSuite
   val smj = findTopLevelSortMergeJoin(plan)
   assert(smj.size == 2)
   val smj2 = findTopLevelSortMergeJoin(adaptivePlan)
-  assert(smj2.size == 2, origPlan.toString)
+  assert(smj2.size == 1, origPlan.toString)

Review comment:
   Simply changing it to outer join may not work. For example,
   ```
   SELECT * FROM t1 LEFT JOIN t2 ON t1.a = t2.c LEFT JOIN t2 as t3 ON t2.c = 
t3.c
   ```
   For the left outer join between `t1` and `t2`, you can only build right side 
(`t2`), but the resulting output partitioning is from this join is on the left 
side (`t1`). Thus, the join between `t2` and `t3` will always introduce shuffle 
and this will not help getting the higher cost.
   
   (On top of this, putting any `WHERE` clause would convert outer join to 
inner join. :))





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28866: [SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing completeAndCheckAnswer and using completeNextStageWithFetc

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650495643


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124556/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28866: [SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing completeAndCheckAnswer and using completeNextStageWithFetchFailure

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650495636







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #27983: [SPARK-32105][SQL]Implement ScriptTransformation in sql/core

2020-06-26 Thread GitBox


AngersZh commented on a change in pull request #27983:
URL: https://github.com/apache/spark/pull/27983#discussion_r446486779



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/script/ScriptTransformationExec.scala
##
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.script
+
+import java.io._
+import java.nio.charset.StandardCharsets
+import java.sql.Date
+
+import scala.collection.JavaConverters._
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.conf.Configuration
+
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.ScriptInputOutputSchema
+import org.apache.spark.sql.catalyst.plans.physical.Partitioning
+import org.apache.spark.sql.catalyst.util.{DateFormatter, DateTimeUtils, 
TimestampFormatter}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLTimestamp
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.types.{DataType, DateType, TimestampType}
+import org.apache.spark.util.{CircularBuffer, RedirectThread}
+
+/**
+ * Transforms the input by forking and running the specified script.
+ *
+ * @param input the set of expression that should be passed to the script.
+ * @param script the command that should be executed.
+ * @param output the attributes that are produced by the script.
+ */
+case class ScriptTransformationExec(
+input: Seq[Expression],
+script: String,
+output: Seq[Attribute],
+child: SparkPlan,
+ioschema: ScriptTransformIOSchema)
+  extends ScriptTransformBase {
+
+  override def producedAttributes: AttributeSet = outputSet -- inputSet
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def processIterator(inputIterator: Iterator[InternalRow], 
hadoopConf: Configuration)
+  : Iterator[InternalRow] = {
+val cmd = List("/bin/bash", "-c", script)
+val builder = new ProcessBuilder(cmd.asJava)
+
+val proc = builder.start()
+val inputStream = proc.getInputStream
+val outputStream = proc.getOutputStream
+val errorStream = proc.getErrorStream
+
+// In order to avoid deadlocks, we need to consume the error output of the 
child process.
+// To avoid issues caused by large error output, we use a circular buffer 
to limit the amount
+// of error output that we retain. See SPARK-7862 for more discussion of 
the deadlock / hang
+// that motivates this.
+val stderrBuffer = new CircularBuffer(2048)
+new RedirectThread(
+  errorStream,
+  stderrBuffer,
+  "Thread-ScriptTransformation-STDERR-Consumer").start()
+
+val outputProjection = new InterpretedProjection(input, child.output)
+
+// This new thread will consume the ScriptTransformation's input rows and 
write them to the
+// external process. That process's output will be read by this current 
thread.
+val writerThread = new ScriptTransformationWriterThread(
+  inputIterator.map(outputProjection),
+  input.map(_.dataType),
+  ioschema,
+  outputStream,
+  proc,
+  stderrBuffer,
+  TaskContext.get(),
+  hadoopConf
+)
+
+val reader = new BufferedReader(new InputStreamReader(inputStream, 
StandardCharsets.UTF_8))
+val outputIterator: Iterator[InternalRow] = new Iterator[InternalRow] {
+  var curLine: String = null
+  val mutableRow = new SpecificInternalRow(output.map(_.dataType))

Review comment:
   > It seems unused.
   
   Remove now, will notice this in next pr





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28866: [SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing completeAndCheckAnswer and using completeNextStageWithFetc

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650495636


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28866: [SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing completeAndCheckAnswer and using completeNextStageWithFetchFailu

2020-06-26 Thread GitBox


SparkQA removed a comment on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650475411


   **[Test build #124556 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124556/testReport)**
 for PR 28866 at commit 
[`d1742fe`](https://github.com/apache/spark/commit/d1742fe43be3f1f66a557f56ff15299215a59684).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28866: [SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing completeAndCheckAnswer and using completeNextStageWithFetchFailure

2020-06-26 Thread GitBox


SparkQA commented on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650495387


   **[Test build #124556 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124556/testReport)**
 for PR 28866 at commit 
[`d1742fe`](https://github.com/apache/spark/commit/d1742fe43be3f1f66a557f56ff15299215a59684).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


viirya commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446485414



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   `oldVersionExternalTempPath` will use the first `Path` parameter's 
scheme to create the temp path.
   
   Does it guarantee `Path(sessionPath)` always uses HDFS scheme here?
   
   If `sessionPath` is a HDFS path, it is ok. But if we use `scratchDir` as 
`sessionPath` here, it will be something like `Path("/tmp/hive")`, is it still 
HDFS scheme?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650492074


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124555/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650492070


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650492070







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


SparkQA commented on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650491867


   **[Test build #124555 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124555/testReport)**
 for PR 28895 at commit 
[`2994050`](https://github.com/apache/spark/commit/29940501f8b5ffb845ad24c24d53234d71020881).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


SparkQA removed a comment on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650475412


   **[Test build #124555 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124555/testReport)**
 for PR 28895 at commit 
[`2994050`](https://github.com/apache/spark/commit/29940501f8b5ffb845ad24c24d53234d71020881).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning

2020-06-26 Thread GitBox


imback82 commented on a change in pull request #28676:
URL: https://github.com/apache/spark/pull/28676#discussion_r446484892



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##
@@ -60,6 +60,26 @@ case class BroadcastHashJoinExec(
 }
   }
 
+  override def outputPartitioning: Partitioning = {
+def buildKeys: Seq[Expression] = buildSide match {
+  case BuildLeft => leftKeys
+  case BuildRight => rightKeys
+}
+
+joinType match {
+  case _: InnerLike =>
+streamedPlan.outputPartitioning match {
+  case h: HashPartitioning =>
+PartitioningCollection(Seq(h, HashPartitioning(buildKeys, 
h.numPartitions)))

Review comment:
   I added this check, but when `BroadcastHashJoinExec` executes, `t1`'s 
streamed side should have been shuffled by `t1.a` even if `t1` was originally 
partitioned by an unrelated column `c`, no?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-650488630







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-650488630







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


SparkQA commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-650488481


   **[Test build #124560 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124560/testReport)**
 for PR 27690 at commit 
[`1cec7a4`](https://github.com/apache/spark/commit/1cec7a44eabe69794a1636d7090a8a58859b01c3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446482526



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {
-  newVersionExternalTempPath(path, hadoopConf, stagingDir)
+  // HIVE-14270: Write temporary data to HDFS when doing inserts on tables 
located on S3
+  // Copied from Context.java#getTempDirForPath of Hive 2.3.
+  if (supportSchemeToUseNonBlobStore(path)) {
+// Hive sets session_path as 
HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config
+val HDFS_SESSION_PATH_KEY = "_hive.hdfs.session.path"
+val sessionScratchDir = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
+  .client.getConf(HDFS_SESSION_PATH_KEY, "")
+logDebug(s"session scratch dir '$sessionScratchDir' is used")
+getMRTmpPath(hadoopConf, sessionScratchDir, scratchDir)

Review comment:
   Yes, Hive's `scratchDir` is still used in Hive 2.0 or later.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446482475



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {
-  newVersionExternalTempPath(path, hadoopConf, stagingDir)
+  // HIVE-14270: Write temporary data to HDFS when doing inserts on tables 
located on S3

Review comment:
   Added version check using `hiveVersionsSupportingNonBlobstoreTempPath`.
   
   ```
   val hiveVersionsSupportingNonBlobstoreTempPath: Set[HiveVersion] =
 Set(v2_0, v2_1, v2_2, v2_3, v3_0, v3_1)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446482392



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,7 +99,34 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getMRTmpPath(

Review comment:
   Renamed this method from `getMRTmpPath` to `getNonBlobTmpPath`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487591


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124559/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


SparkQA removed a comment on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487372


   **[Test build #124559 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124559/testReport)**
 for PR 28880 at commit 
[`b42c3ed`](https://github.com/apache/spark/commit/b42c3ed7498bad6d3607978c0c19979828da9ef0).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487589


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487589







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


SparkQA commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487582


   **[Test build #124559 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124559/testReport)**
 for PR 28880 at commit 
[`b42c3ed`](https://github.com/apache/spark/commit/b42c3ed7498bad6d3607978c0c19979828da9ef0).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487532







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487532







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-26 Thread GitBox


SparkQA commented on pull request #28880:
URL: https://github.com/apache/spark/pull/28880#issuecomment-650487372


   **[Test build #124559 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124559/testReport)**
 for PR 28880 at commit 
[`b42c3ed`](https://github.com/apache/spark/commit/b42c3ed7498bad6d3607978c0c19979828da9ef0).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-650481570







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-650481570


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650480608


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124558/
   Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650480605


   Merged build finished. Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650480605







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


SparkQA removed a comment on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650478040


   **[Test build #124558 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124558/testReport)**
 for PR 28931 at commit 
[`3e2eb3a`](https://github.com/apache/spark/commit/3e2eb3a11ee993605ca94779e4b8642a9cad2f3f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


SparkQA commented on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650480574


   **[Test build #124558 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124558/testReport)**
 for PR 28931 at commit 
[`3e2eb3a`](https://github.com/apache/spark/commit/3e2eb3a11ee993605ca94779e4b8642a9cad2f3f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28895:
URL: https://github.com/apache/spark/pull/28895#discussion_r446476737



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -335,23 +335,6 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* tuples describing the shuffle blocks that are stored at that 
block manager.
*/
   def getMapSizesByExecutorId(
-  shuffleId: Int,
-  startPartition: Int,
-  endPartition: Int)
-  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]
-
-  /**
-   * Called from executors to get the server URIs and output sizes for each 
shuffle block that
-   * needs to be read from a given range of map output partitions 
(startPartition is included but
-   * endPartition is excluded from the range) and is produced by
-   * a range of mappers (startMapIndex, endMapIndex, startMapIndex is included 
and
-   * the endMapIndex is excluded).

Review comment:
   Like `getReader` description, `endMapIndex == Int.MaxValue` seems worth 
to mention in a way here because `getMapSizesByExecutorId` is a different 
function.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28895:
URL: https://github.com/apache/spark/pull/28895#discussion_r446476737



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -335,23 +335,6 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* tuples describing the shuffle blocks that are stored at that 
block manager.
*/
   def getMapSizesByExecutorId(
-  shuffleId: Int,
-  startPartition: Int,
-  endPartition: Int)
-  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]
-
-  /**
-   * Called from executors to get the server URIs and output sizes for each 
shuffle block that
-   * needs to be read from a given range of map output partitions 
(startPartition is included but
-   * endPartition is excluded from the range) and is produced by
-   * a range of mappers (startMapIndex, endMapIndex, startMapIndex is included 
and
-   * the endMapIndex is excluded).

Review comment:
   Specifically, `endMapIndex == Int.MaxValue` seems worth to mention in a 
way here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28895:
URL: https://github.com/apache/spark/pull/28895#discussion_r446476096



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -335,23 +335,6 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* tuples describing the shuffle blocks that are stored at that 
block manager.
*/
   def getMapSizesByExecutorId(
-  shuffleId: Int,
-  startPartition: Int,
-  endPartition: Int)
-  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]
-
-  /**
-   * Called from executors to get the server URIs and output sizes for each 
shuffle block that
-   * needs to be read from a given range of map output partitions 
(startPartition is included but
-   * endPartition is excluded from the range) and is produced by
-   * a range of mappers (startMapIndex, endMapIndex, startMapIndex is included 
and
-   * the endMapIndex is excluded).

Review comment:
   Hi, @Ngone51 . This should be the function description of the unified 
`getMapSizesByExecutorId`. Did I understand correctly? Or, could you add a 
comment about `startMapIndex` and `endMapIndex`. And, when we don't care about 
that because of `actualEndMapIndex` (you don't need to mention this variable 
name specifically.)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28895:
URL: https://github.com/apache/spark/pull/28895#discussion_r446476096



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -335,23 +335,6 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* tuples describing the shuffle blocks that are stored at that 
block manager.
*/
   def getMapSizesByExecutorId(
-  shuffleId: Int,
-  startPartition: Int,
-  endPartition: Int)
-  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]
-
-  /**
-   * Called from executors to get the server URIs and output sizes for each 
shuffle block that
-   * needs to be read from a given range of map output partitions 
(startPartition is included but
-   * endPartition is excluded from the range) and is produced by
-   * a range of mappers (startMapIndex, endMapIndex, startMapIndex is included 
and
-   * the endMapIndex is excluded).

Review comment:
   Hi, @Ngone51 . This should be the function description of the unified 
`getMapSizesByExecutorId`. Did I understand correctly? Or, could you add a 
comment about the relationship `startMapIndex` and `endMapIndex` and 
`actualEndMapIndex`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28895:
URL: https://github.com/apache/spark/pull/28895#discussion_r446476096



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -335,23 +335,6 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* tuples describing the shuffle blocks that are stored at that 
block manager.
*/
   def getMapSizesByExecutorId(
-  shuffleId: Int,
-  startPartition: Int,
-  endPartition: Int)
-  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]
-
-  /**
-   * Called from executors to get the server URIs and output sizes for each 
shuffle block that
-   * needs to be read from a given range of map output partitions 
(startPartition is included but
-   * endPartition is excluded from the range) and is produced by
-   * a range of mappers (startMapIndex, endMapIndex, startMapIndex is included 
and
-   * the endMapIndex is excluded).

Review comment:
   Hi, @Ngone51 . This should be the function description of the unified 
`getMapSizesByExecutorId`. Did I understand correctly? Or, could you add a 
comment about `startMapIndex` and `endMapIndex`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

2020-06-26 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-650478832


   > The PR has lots of changed lines which are actually not changed 
(indentation). Indentation is the one of style guides, and they didn't seem to 
violate the guide (that said, these changed lines violate the guide).
   > 
   > https://github.com/databricks/scala-style-guide
   > 
   > Could you please go through the PR changes and revert these changes? 
Thanks in advance!
   
   Very good point.  I will correct.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28895:
URL: https://github.com/apache/spark/pull/28895#discussion_r446476096



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -335,23 +335,6 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* tuples describing the shuffle blocks that are stored at that 
block manager.
*/
   def getMapSizesByExecutorId(
-  shuffleId: Int,
-  startPartition: Int,
-  endPartition: Int)
-  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]
-
-  /**
-   * Called from executors to get the server URIs and output sizes for each 
shuffle block that
-   * needs to be read from a given range of map output partitions 
(startPartition is included but
-   * endPartition is excluded from the range) and is produced by
-   * a range of mappers (startMapIndex, endMapIndex, startMapIndex is included 
and
-   * the endMapIndex is excluded).

Review comment:
   Hi, @Ngone51 . This should be the function description of the unified 
`getMapSizesByExecutorId`. Did I understand correctly?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

2020-06-26 Thread GitBox


HeartSaVioR edited a comment on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-650477921


   The PR has lots of changed lines which are actually not changed 
(indentation). Indentation is the one of style guides, and they didn't seem to 
violate the guide (that said, these changed lines violate the guide).
   
   https://github.com/databricks/scala-style-guide
   
   Could you please go through the PR changes and revert these changes? Thanks 
in advance!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28895:
URL: https://github.com/apache/spark/pull/28895#discussion_r446476096



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -335,23 +335,6 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* tuples describing the shuffle blocks that are stored at that 
block manager.
*/
   def getMapSizesByExecutorId(
-  shuffleId: Int,
-  startPartition: Int,
-  endPartition: Int)
-  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]
-
-  /**
-   * Called from executors to get the server URIs and output sizes for each 
shuffle block that
-   * needs to be read from a given range of map output partitions 
(startPartition is included but
-   * endPartition is excluded from the range) and is produced by
-   * a range of mappers (startMapIndex, endMapIndex, startMapIndex is included 
and
-   * the endMapIndex is excluded).

Review comment:
   Hi, @Ngone51 . This should be the function description of the unified 
`getMapSizesByExecutorId`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650478156







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650478156







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


SparkQA commented on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650478040


   **[Test build #124558 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124558/testReport)**
 for PR 28931 at commit 
[`3e2eb3a`](https://github.com/apache/spark/commit/3e2eb3a11ee993605ca94779e4b8642a9cad2f3f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

2020-06-26 Thread GitBox


HeartSaVioR commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-650477921


   The PR has lots of changed lines which are actually not changed 
(indentation). Indentation is the one of style guides, and they didn't seem to 
violate the guide (that said, these changed lines violate the guide). 
   
   Could you please go through the PR changes and revert these changes? Thanks 
in advance!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28931:
URL: https://github.com/apache/spark/pull/28931#discussion_r446475568



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala
##
@@ -107,7 +107,7 @@ private[spark] class YarnRMClient extends Logging {
 // so not all stable releases have it.
 val prefix = WebAppUtils.getHttpSchemePrefix(conf)
 val proxies = WebAppUtils.getProxyHostsAndPortsForAmFilter(conf)
-val hosts = proxies.asScala.map(_.split(":").head)
+val hosts = proxies.asScala.map(proxy => 
proxy.splitAt(proxy.lastIndexOf(":"))._1)

Review comment:
   Hi, @PavithraRamachandran . If possible, could you check the code base 
if this is the last instance or not?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650477700


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28931: [SPARK-32103] Handle host/port split in IPV6 in YarnRMClient

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28931:
URL: https://github.com/apache/spark/pull/28931#issuecomment-650026318


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox


holdenk commented on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-650477676


   Just an FYI to folks I'm not as active on this PR as I would normally be as 
I'm waiting to see where the SPIP discussions go. I'll circle back to this next 
week.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox


holdenk commented on a change in pull request #28708:
URL: https://github.com/apache/spark/pull/28708#discussion_r446475360



##
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##
@@ -57,7 +57,7 @@ import org.apache.spark.resource._
 import org.apache.spark.resource.ResourceUtils._
 import org.apache.spark.rpc.RpcEndpointRef
 import org.apache.spark.scheduler._
-import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend
+import org.apache.spark.scheduler.cluster.{CoarseGrainedSchedulerBackend, 
StandaloneSchedulerBackend}

Review comment:
   Good point, yeah I'll take this change out.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox


SparkQA commented on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-650477379


   **[Test build #124557 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124557/testReport)**
 for PR 28708 at commit 
[`ac096f4`](https://github.com/apache/spark/commit/ac096f46aa6e658704d7726efe7d66ece280b83e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #28708:
URL: https://github.com/apache/spark/pull/28708#discussion_r446475030



##
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##
@@ -57,7 +57,7 @@ import org.apache.spark.resource._
 import org.apache.spark.resource.ResourceUtils._
 import org.apache.spark.rpc.RpcEndpointRef
 import org.apache.spark.scheduler._
-import org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend
+import org.apache.spark.scheduler.cluster.{CoarseGrainedSchedulerBackend, 
StandaloneSchedulerBackend}

Review comment:
   If this is not required, shall we revert this file?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-650477142


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #28897: [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

2020-06-26 Thread GitBox


dongjoon-hyun closed pull request #28897:
URL: https://github.com/apache/spark/pull/28897


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28897: [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28897:
URL: https://github.com/apache/spark/pull/28897#issuecomment-650476825


   Thank you all. Merged to master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #28866: [SPARK-31845][CORE][TESTS] Refactor DAGSchedulerSuite by introducing completeAndCheckAnswer and using completeNextStageWithFetchFailure

2020-06-26 Thread GitBox


dongjoon-hyun closed pull request #28866:
URL: https://github.com/apache/spark/pull/28866


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28866: [SPARK-31845][SPARK-31843][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure and DAGSchedulerSuite: For t

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650475583







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28866: [SPARK-31845][SPARK-31843][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure and DAGSchedulerSuite: For the patte

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650475583







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28919:
URL: https://github.com/apache/spark/pull/28919#issuecomment-650475578







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650475560







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28919:
URL: https://github.com/apache/spark/pull/28919#issuecomment-650475578







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650475560







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization

2020-06-26 Thread GitBox


SparkQA commented on pull request #28919:
URL: https://github.com/apache/spark/pull/28919#issuecomment-650475410


   **[Test build #124554 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124554/testReport)**
 for PR 28919 at commit 
[`c1d4321`](https://github.com/apache/spark/commit/c1d4321f0f361aa1b1d33957a945fcbc55cac154).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28866: [SPARK-31845][SPARK-31843][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure and DAGSchedulerSuite: For the pattern of

2020-06-26 Thread GitBox


SparkQA commented on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650475411


   **[Test build #124556 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124556/testReport)**
 for PR 28866 at commit 
[`d1742fe`](https://github.com/apache/spark/commit/d1742fe43be3f1f66a557f56ff15299215a59684).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


SparkQA commented on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650475412


   **[Test build #124555 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124555/testReport)**
 for PR 28895 at commit 
[`2994050`](https://github.com/apache/spark/commit/29940501f8b5ffb845ad24c24d53234d71020881).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28866: [SPARK-31845][SPARK-31843][CORE][TESTS] DAGSchedulerSuite: Reuse completeNextStageWithFetchFailure and DAGSchedulerSuite: For the patte

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28866:
URL: https://github.com/apache/spark/pull/28866#issuecomment-650475206


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28895: [SPARK-32055][CORE][SQL] Unify getReader and getReaderForRange in ShuffleManager

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28895:
URL: https://github.com/apache/spark/pull/28895#issuecomment-650475112


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28919:
URL: https://github.com/apache/spark/pull/28919#issuecomment-650474740


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #28927: [SPARK-32099][DOCS] Remove broken link in cloud integration documentation

2020-06-26 Thread GitBox


dongjoon-hyun closed pull request #28927:
URL: https://github.com/apache/spark/pull/28927


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #28932: [SPARK-32088][PYTHON] Pin the timezone in timestamp_seconds doctest

2020-06-26 Thread GitBox


dongjoon-hyun closed pull request #28932:
URL: https://github.com/apache/spark/pull/28932


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #28857: [SPARK-32023][Streaming]Generate spark streaming test jar with maven plugin maven-jar-plugin

2020-06-26 Thread GitBox


dongjoon-hyun closed pull request #28857:
URL: https://github.com/apache/spark/pull/28857


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28857: [SPARK-32023][Streaming]Generate spark streaming test jar with maven plugin maven-jar-plugin

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28857:
URL: https://github.com/apache/spark/pull/28857#issuecomment-650473151


   Oh, thank you for confirmation, @wankunde . Then, I close this PR and 
SPARK-32023.
   In any way, thank you for your contribution on this issue. Please don't 
hesitate to report a weird situation like this. We can solve together! Thanks 
again.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is l

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650472551







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650472551







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446036


   **[Test build #124553 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124553/testReport)**
 for PR 28848 at commit 
[`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650472356


   **[Test build #124553 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124553/testReport)**
 for PR 28848 at commit 
[`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #25721: [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9

2020-06-26 Thread GitBox


github-actions[bot] closed pull request #25721:
URL: https://github.com/apache/spark/pull/25721


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #27377: [SPARK-30666][Core][WIP] Reliable single-stage accumulators

2020-06-26 Thread GitBox


github-actions[bot] closed pull request #27377:
URL: https://github.com/apache/spark/pull/27377


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

2020-06-26 Thread GitBox


HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-650460286


   I've been running sustain tests (still running) and here's some observation:
   
   - SPARK-30946 + SPARK-30462 just wrote compact batch 7309 (73,100,000 
entries) which size is 3.5G, and RES has been less than 2G (It increases a bit. 
Maybe also better to run with smaller heap memory.) It took 169517 ms (2.8 
mins) to build a compact batch 7309, which is similar to build a compact batch 
479 (4,800,000) without SPARK-30946. (15x processed)
   - SPARK-30462 itself just wrote compact batch 1739 (17,390,000 entries) 
which size is 3.1G, and RES has been less than 2G.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

2020-06-26 Thread GitBox


HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-650460286


   I've been running sustain tests (still running) and here's some observation:
   
   - SPARK-30946 + SPARK-30462 just wrote compact batch 7309 (73,100,000 
entries) which size is 3.5G, and RES has been less than 2G. It took 169517 ms 
(2.8 mins) to build a compact batch 7309, which is similar to build a compact 
batch 479 (4,800,000) without SPARK-30946. (15x processed)
   - SPARK-30462 itself just wrote compact batch 1739 (17,390,000 entries) 
which size is 3.1G, and RES has been less than 2G.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446374







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is l

2020-06-26 Thread GitBox


AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446374







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446036


   **[Test build #124553 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124553/testReport)**
 for PR 28848 at commit 
[`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446447794



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
   hostToUnregisterOutputs: Option[String],
   maybeEpoch: Option[Long] = None): Unit = {
 val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+logDebug(s"Considering removal of executor $execId; " +
+  s"fileLost: $fileLost, currentEpoch: $currentEpoch")
 if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
   failedEpoch(execId) = currentEpoch
-  logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+  logInfo(s"Executor lost: $execId (epoch $currentEpoch)")

Review comment:
   @dongjoon-hyun based on Attila's and Imran's comments, I'm inclined to 
stick with the string interpolation here and below, since it is all within the 
method I'm actively modifying.
   Are you ok with resolving this then?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446446304



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -764,6 +825,7 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
   (FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, 
"ignored"), null)))
 // this will get called
 // blockManagerMaster.removeExecutor("hostA-exec")

Review comment:
   I left the original comment as an explanation for the verify. But I can 
remove it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446446147



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -1933,29 +1946,43 @@ private[spark] class DAGScheduler(
   maybeEpoch = None)
   }
 
+  /**
+   * Handles removing an executor from the BlockManagerMaster as well as 
unregistering shuffle
+   * outputs for the executor or optionally its host.
+   *
+   * The fileLost parameter being true indicates that we assume we've lost all 
shuffle blocks

Review comment:
   Added parameters to doc comment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446445975



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -170,13 +170,29 @@ private[spark] class DAGScheduler(
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, 
which is sent with
-  // every task. When we detect a node failing, we note the current epoch 
number and failed
-  // executor, increment it for new tasks, and use this to ignore stray 
ShuffleMapTask results.
+  // Tracks the latest epoch of a fully processed error related to the given 
executor. (We use
+  // the MapOutputTracker's epoch number, which is sent with every task.)
   //
-  // TODO: Garbage collect information about failure epochs when we know there 
are no more
-  //   stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  // When an executor fails, it can affect the results of many tasks, and we 
have to deal with
+  // all of them consistently. We don't simply ignore all future results from 
that executor,
+  // as the failures may have been transient; but we also don't want to 
"overreact" to follow-
+  // on errors we receive. Furthermore, we might receive notification of a 
task success, after
+  // we find out the executor has actually failed; we'll assume those 
successes are, in fact,
+  // simply delayed notifications and the results have been lost, if they come 
from the same
+  // epoch. In particular, we use this to control when we tell the 
BlockManagerMaster that the
+  // BlockManager has been lost.

Review comment:
   Adopted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

2020-06-26 Thread GitBox


wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446445843



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -170,13 +170,29 @@ private[spark] class DAGScheduler(
*/
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, 
which is sent with
-  // every task. When we detect a node failing, we note the current epoch 
number and failed
-  // executor, increment it for new tasks, and use this to ignore stray 
ShuffleMapTask results.
+  // Tracks the latest epoch of a fully processed error related to the given 
executor. (We use
+  // the MapOutputTracker's epoch number, which is sent with every task.)
   //
-  // TODO: Garbage collect information about failure epochs when we know there 
are no more
-  //   stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  // When an executor fails, it can affect the results of many tasks, and we 
have to deal with
+  // all of them consistently. We don't simply ignore all future results from 
that executor,
+  // as the failures may have been transient; but we also don't want to 
"overreact" to follow-
+  // on errors we receive. Furthermore, we might receive notification of a 
task success, after
+  // we find out the executor has actually failed; we'll assume those 
successes are, in fact,
+  // simply delayed notifications and the results have been lost, if they come 
from the same
+  // epoch. In particular, we use this to control when we tell the 
BlockManagerMaster that the
+  // BlockManager has been lost.
+  private val executorFailureEpoch = new HashMap[String, Long]
+  // Tracks the latest epoch of a fully processed error where shuffle files 
have been lost from

Review comment:
   Done.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector

2020-06-26 Thread GitBox


maropu commented on a change in pull request #28863:
URL: https://github.com/apache/spark/pull/28863#discussion_r446438925



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/OracleConnectionProvider.scala
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.jdbc.connection
+
+import java.security.PrivilegedExceptionAction
+import java.sql.{Connection, Driver}
+import java.util.Properties
+
+import org.apache.hadoop.security.UserGroupInformation
+
+import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
+
+private[sql] class OracleConnectionProvider(driver: Driver, options: 
JDBCOptions)
+extends SecureConnectionProvider(driver, options) {
+  override val appEntry: String = "kprb5module"
+
+  override def getConnection(): Connection = {
+setAuthenticationConfigIfNeeded()
+UserGroupInformation.loginUserFromKeytabAndReturnUGI(options.principal, 
options.keytab).doAs(
+  new PrivilegedExceptionAction[Connection]() {
+override def run(): Connection = {
+  OracleConnectionProvider.super.getConnection()
+}
+  }
+)
+  }
+
+  override def getAdditionalProperties(): Properties = {
+val result = new Properties()
+// This prop needed to turn on kerberos authentication in the JDBC driver. 
Please see:
+// scalastyle:off line.size.limit
+// 
https://docs.oracle.com/en/database/oracle/oracle-database/19/jajdb/oracle/jdbc/OracleConnection.html#CONNECTION_PROPERTY_THIN_NET_AUTHENTICATION_SERVICES
+// scalastyle:on line.size.limit
+result.put("oracle.net.authentication_services", "(KERBEROS5)");

Review comment:
   Thanks for the investigation! +1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28919:
URL: https://github.com/apache/spark/pull/28919#issuecomment-650433094


   Retest this please.
   I guess we are waiting for @HyukjinKwon 's request 
https://github.com/apache/spark/pull/28919#issuecomment-649213796 .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28897: [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28897:
URL: https://github.com/apache/spark/pull/28897#issuecomment-650428500


   Thank you so much, @holdenk ! Yes, we can discuss and improve it separately 
later.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28897: [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

2020-06-26 Thread GitBox


holdenk commented on pull request #28897:
URL: https://github.com/apache/spark/pull/28897#issuecomment-650418721


   LGTM, we can continue the PyPI discussion separately.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-26 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r446422811



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>

Review comment:
   I see that `withSQLConf ` is overridden here 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala#L59-L68
 and it works with `StaticSQLConf`: 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala#L59-L68
   
   It also works If I implement something similar in my test, but it still 
feels hacky. Do you think we can go with this plan? In this case, the test in 
the `HiveMetadataCacheSuite` should be in its own suite / class.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28897: [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28897:
URL: https://github.com/apache/spark/pull/28897#issuecomment-650408514


   Thank you so much, @dbtsai !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on pull request #28897: [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

2020-06-26 Thread GitBox


dbtsai commented on pull request #28897:
URL: https://github.com/apache/spark/pull/28897#issuecomment-650403512


   +1 from me. Users still have the option to use Hadoop 2.7, so I feel it's 
safe.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28897: [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

2020-06-26 Thread GitBox


dongjoon-hyun commented on pull request #28897:
URL: https://github.com/apache/spark/pull/28897#issuecomment-650401065


   Gentle ping once again.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mccheah commented on a change in pull request #28618: [SPARK-31801][API][SHUFFLE] Register map output metadata

2020-06-26 Thread GitBox


mccheah commented on a change in pull request #28618:
URL: https://github.com/apache/spark/pull/28618#discussion_r446401053



##
File path: 
core/src/main/scala/org/apache/spark/shuffle/MemoizingShuffleDataIO.scala
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.shuffle.api.{ShuffleDataIO, ShuffleDriverComponents, 
ShuffleExecutorComponents}
+
+/**
+ * Thin wrapper around {@link ShuffleDataIO} that ensures the given components 
are
+ * only initialized once and providing the same instance each time.
+ * 
+ * Used to ensure the SparkEnv only instantiates the given components once 
lazily
+ * and then reuses them throughout the lifetime of the SparkEnv.
+ */
+class MemoizingShuffleDataIO(delegate: ShuffleDataIO) {
+  private var _driver: ShuffleDriverComponents = _
+  private var _executor: ShuffleExecutorComponents = _
+
+  def getOrCreateDriverComponents(): ShuffleDriverComponents = synchronized {
+if (_driver == null) {
+  _driver = delegate.initializeShuffleDriverComponents()
+}
+_driver
+  }
+
+  def getOrCreateExecutorComponents(
+  appId: String,
+  execId: String,
+  extraConfigs: Map[String, String]): ShuffleExecutorComponents = 
synchronized {
+if (_executor == null) {
+  _executor = delegate.initializeShuffleExecutorComponents(appId, execId, 
extraConfigs.asJava)
+}
+_executor
+  }

Review comment:
   LGTM, I'll make those changes. Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >