[GitHub] [spark] LuciferYang commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


LuciferYang commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674156875


   @dongjoon-hyun "abort the job if total size of results is too large" in 
TaskSetManagerSuite failed, a little strange



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674160674







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


LuciferYang commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674160116


   Manual test 
   ```
   mvn clean install -pl core -am 
-DwildcardSuites=org.apache.spark.scheduler.TaskSetManagerSuite -Dtest=none
   
   Run completed in 46 seconds, 135 milliseconds.
   Total number of tests run: 53
   Suites: completed 2, aborted 0
   Tests: succeeded 53, failed 0, canceled 0, ignored 0, pending 0
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


srowen commented on a change in pull request #29434:
URL: https://github.com/apache/spark/pull/29434#discussion_r470793047



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala
##
@@ -351,6 +351,18 @@ class StarJoinCostBasedReorderSuite extends PlanTest with 
StatsEstimationTestBas
 .join(t5.join(t6, Inner, Some(nameToAttr("t5_c2") === 
nameToAttr("t6_c2"))), Inner,
   Some(nameToAttr("d2_c2") === nameToAttr("t5_c1")))
 .select(outputsOf(d1, t3, t4, f1, d2, t5, t6, d3, t1, t2): _*)
+} else {

Review comment:
   Why does this differ? we'd rather not allow different user-visible 
behavior

##
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
##
@@ -314,7 +314,7 @@ trait Row extends Serializable {
*
* @throws ClassCastException when data type does not match.
*/
-  def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i)
+  def getSeq[T](i: Int): scala.collection.Seq[T] = 
getAs[scala.collection.Seq[T]](i)

Review comment:
   Can you explain this fix a bit more? I don't doubt it, but do we need to 
promise a `scala.collection.Seq`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


SparkQA commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674229394


   **[Test build #127464 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127464/testReport)**
 for PR 29434 at commit 
[`d71064e`](https://github.com/apache/spark/commit/d71064e50fe39454e5720c59d2bfdc5c9079831c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674160102


   **[Test build #127464 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127464/testReport)**
 for PR 29434 at commit 
[`d71064e`](https://github.com/apache/spark/commit/d71064e50fe39454e5720c59d2bfdc5c9079831c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29435:
URL: https://github.com/apache/spark/pull/29435#issuecomment-674243886


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674160674







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


holdenk commented on pull request #29422:
URL: https://github.com/apache/spark/pull/29422#issuecomment-674186603


   Thank you for taking the time to resolve this and make such a clear writeup 
of the root cause. From an in-production not-in-test question: if the executor 
exits we also want to eagerly clean up everything and resubmit right?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


holdenk commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470765386



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -323,6 +326,7 @@ private[spark] class CoarseGrainedExecutorBackend(
   // move forward.
   lastTaskRunningTime = System.nanoTime()
 }
+Thread.sleep(sleep_time)

Review comment:
   This was moved so initial sleep time didn't have sleep_time added to it 
on the first pass through right? Nothing else?

##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -294,10 +294,13 @@ private[spark] class CoarseGrainedExecutorBackend(
 override def run(): Unit = {
   var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
-
+  val initialSleepMillis = env.conf.getInt(

Review comment:
   Maybe just add a comment here that this is for testing only.

##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2012,7 +2013,8 @@ private[spark] class DAGScheduler(
   execId: String,
   fileLost: Boolean,
   hostToUnregisterOutputs: Option[String],
-  maybeEpoch: Option[Long] = None): Unit = {
+  maybeEpoch: Option[Long] = None,
+  ignoreShuffleVersion: Boolean = false): Unit = {

Review comment:
   Please add this to the java doc. Also I'm not completely sure about the 
name of the variable.

##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -1027,7 +1036,15 @@ private[spark] class TaskSchedulerImpl(
   }
 }
 
-executorsPendingDecommission -= executorId
+
+val decomInfo = executorsPendingDecommission.get(executorId)
+if (decomInfo.isDefined) {
+  val rememberSeconds =
+conf.getInt("spark.decommissioningRememberAfterRemoval.seconds", 60)
+  val gcSecond = TimeUnit.MILLISECONDS.toSeconds(clock.getTimeMillis()) + 
rememberSeconds
+  decommissioningExecutorsToGc.computeIfAbsent(gcSecond, _ => 
mutable.ArrayBuffer.empty) +=
+executorId
+}

Review comment:
   Seems like repeated logic.

##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2022,16 +2024,25 @@ private[spark] class DAGScheduler(
   blockManagerMaster.removeExecutor(execId)
   clearCacheLocs()
 }
-if (fileLost &&
-(!shuffleFileLostEpoch.contains(execId) || 
shuffleFileLostEpoch(execId) < currentEpoch)) {
-  shuffleFileLostEpoch(execId) = currentEpoch
-  hostToUnregisterOutputs match {
-case Some(host) =>
-  logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
-  mapOutputTracker.removeOutputsOnHost(host)
-case None =>
-  logInfo(s"Shuffle files lost for executor: $execId (epoch 
$currentEpoch)")
-  mapOutputTracker.removeOutputsOnExecutor(execId)
+if (fileLost) {

Review comment:
   Can we have a comment here clarifying the reasoning behind this logic?

##
File path: 
core/src/test/scala/org/apache/spark/deploy/DecommissionWorkerSuite.scala
##
@@ -212,22 +226,27 @@ class DecommissionWorkerSuite
   override def handleRootTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
 val taskInfo = taskEnd.taskInfo
 if (taskInfo.executorId == executorToDecom && taskInfo.attemptNumber 
== 0 &&
-  taskEnd.stageAttemptId == 0) {
+  taskEnd.stageAttemptId == 0 && taskEnd.stageId == 0) {
   decommissionWorkerOnMaster(workerToDecom,
 "decommission worker after task on it is done")
 }
   }
 }
-TestUtils.withListener(sc, listener) { _ =>
+withListener(sc, listener) { _ =>
   val jobResult = sc.parallelize(1 to 2, 2).mapPartitionsWithIndex((_, _) 
=> {
 val executorId = SparkEnv.get.executorId
-val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
-Thread.sleep(sleepTimeSeconds * 1000L)
+val context = TaskContext.get()
+if (context.attemptNumber() == 0 && context.stageAttemptNumber() == 0) 
{
+  val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
+  Thread.sleep(sleepTimeSeconds * 1000L)
+}

Review comment:
   I assume this is for speed up right?

##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -136,7 +137,9 @@ private[spark] class TaskSchedulerImpl(
   // IDs of the tasks running on each executor
   private val executorIdToRunningTaskIds = new HashMap[String, HashSet[Long]]
 
-  private val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  // map of second to list of executors to 

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674201907







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674201907







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29435:
URL: https://github.com/apache/spark/pull/29435#issuecomment-674241724


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r470728557



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScan.scala
##
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.connector.read.V1Scan
+import org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation
+import org.apache.spark.sql.sources.{BaseRelation, Filter, TableScan}
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScan(
+relation: JDBCRelation,
+prunedSchema: StructType,
+pushedFilters: Array[Filter]) extends V1Scan {

Review comment:
   Do we have other choices than `V1Scan`? Based on its description,
   
   > This interface is designed to provide Spark DataSources time to migrate to 
DataSource V2 and will be removed in a future Spark release.
   
   Seems `V1Scan` is only for temporary migration.
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r470732952



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, 
SupportsPushDownFilters, SupportsPushDownRequiredColumns}
+import org.apache.spark.sql.execution.datasources.PartitioningUtils
+import org.apache.spark.sql.execution.datasources.jdbc.{JDBCOptions, JDBCRDD, 
JDBCRelation}
+import org.apache.spark.sql.jdbc.JdbcDialects
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScanBuilder(
+session: SparkSession,
+schema: StructType,
+jdbcOptions: JDBCOptions)
+  extends ScanBuilder with SupportsPushDownFilters with 
SupportsPushDownRequiredColumns {
+
+  private val isCaseSensitive = session.sessionState.conf.caseSensitiveAnalysis
+
+  private var pushedFilter = Array.empty[Filter]
+
+  private var prunedSchema = schema
+
+  override def pushFilters(filters: Array[Filter]): Array[Filter] = {

Review comment:
   Does JDBC support nested column pruning? V2 interface supposes to 
support it. If it doesn't, maybe we need to filter out nested column predicates 
here? Does `JDBCRDD.compileFilter` filter them out?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


SparkQA commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674160102


   **[Test build #127464 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127464/testReport)**
 for PR 29434 at commit 
[`d71064e`](https://github.com/apache/spark/commit/d71064e50fe39454e5720c59d2bfdc5c9079831c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674230500


   > This is against 2.4. Could you also check whether the master branch still 
has such an issue?
   
   I cannot repro the issue at master branch. 3.0. in unit test. 
   Besides, filters comparison is included in function BatchScanExec.equals, 
which does  " this.batch == other.batch"



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674230204







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r470732059



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, 
SupportsPushDownFilters, SupportsPushDownRequiredColumns}
+import org.apache.spark.sql.execution.datasources.PartitioningUtils
+import org.apache.spark.sql.execution.datasources.jdbc.{JDBCOptions, JDBCRDD, 
JDBCRelation}
+import org.apache.spark.sql.jdbc.JdbcDialects
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScanBuilder(
+session: SparkSession,
+schema: StructType,
+jdbcOptions: JDBCOptions)
+  extends ScanBuilder with SupportsPushDownFilters with 
SupportsPushDownRequiredColumns {
+
+  private val isCaseSensitive = session.sessionState.conf.caseSensitiveAnalysis
+
+  private var pushedFilter = Array.empty[Filter]
+
+  private var prunedSchema = schema
+
+  override def pushFilters(filters: Array[Filter]): Array[Filter] = {
+if (jdbcOptions.pushDownPredicate) {
+  val dialect = JdbcDialects.get(jdbcOptions.url)
+  val (pushed, unSupported) = filters.partition(JDBCRDD.compileFilter(_, 
dialect).isDefined)
+  this.pushedFilter = pushed
+  unSupported

Review comment:
   To be more safer, we should return original filter like ORC and Parquet 
datasource.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674085136


   **[Test build #127460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127460/testReport)**
 for PR 29431 at commit 
[`98c8be3`](https://github.com/apache/spark/commit/98c8be31530d7d146746869acf4cadf0c5d495ee).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674210880







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674101424


   **[Test build #127462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127462/testReport)**
 for PR 29428 at commit 
[`0a6c574`](https://github.com/apache/spark/commit/0a6c5743a8808b55f399e3298116a0e92bd72d0d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu opened a new pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu opened a new pull request #29435:
URL: https://github.com/apache/spark/pull/29435


   
   
   ### What changes were proposed in this pull request?
   Copy  to master branch the unit test added for 
branch-2.4(https://github.com/apache/spark/pull/29430).
   
   
   
   
   ### Why are the changes needed?
   The unit test will pass at master branch, indicating that issue reported in 
https://issues.apache.org/jira/browse/SPARK-32609 is already fixed at master 
branch. But adding this unit test for future possible failure catch. 
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   no.
   
   
   
   ### How was this patch tested?
   sbt test run
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


dongjoon-hyun commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674157678


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r470736186



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCWriteBuilder.scala
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.connector.write._
+import org.apache.spark.sql.execution.datasources.jdbc.{JdbcOptionsInWrite, 
JdbcUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.InsertableRelation
+import org.apache.spark.sql.types.StructType
+
+case class JDBCWriteBuilder(schema: StructType, options: JdbcOptionsInWrite) 
extends V1WriteBuilder
+  with SupportsTruncate {
+
+  private var isTruncate = false
+
+  override def truncate(): WriteBuilder = {
+isTruncate = true
+this
+  }
+
+  override def buildForV1Write(): InsertableRelation = new InsertableRelation {
+override def insert(data: DataFrame, overwrite: Boolean): Unit = {

Review comment:
   `overwrite` should be true when `isTruncate` is true right? Can you add 
an assert here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470753712



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {

Review comment:
   @cloud-fan - I think with `lazy val` I am to address [comment from 
@viirya ](https://github.com/apache/spark/pull/29342#discussion_r467252553), to 
only set stream side NULL row once, but not per row, because every row would 
have stream side NULL row so we only need to set it once. If not `lazy val`, 
but `val`, the `joinRow.withRight(streamNullRow)` would be eagerly evaluated 
here which is not right, as `joinRow` being reused later.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674088406


   **[Test build #127461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127461/testReport)**
 for PR 29431 at commit 
[`09ef17b`](https://github.com/apache/spark/commit/09ef17b914060fe107d61bfd2af1ee69003a6fee).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674210880







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674220724







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674220724







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emkornfield commented on a change in pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


emkornfield commented on a change in pull request #29430:
URL: https://github.com/apache/spark/pull/29430#discussion_r470845244



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##
@@ -371,6 +371,25 @@ class DataSourceV2Suite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-32609: DataSourceV2 with different pushedfilters should be 
different") {
+def getScanExec(query: DataFrame): DataSourceV2ScanExec = {
+  query.queryExecution.executedPlan.collect {
+case d: DataSourceV2ScanExec => d
+  }.head
+}
+
+Seq(classOf[AdvancedDataSourceV2], 
classOf[JavaAdvancedDataSourceV2]).foreach { cls =>
+  withClue(cls.getName) {
+val df = spark.read.format(cls.getName).load()
+val q1 = df.select('i).filter('i > 6)
+val q2 = df.select('i).filter('i > 5)

Review comment:
   i think it might also be good to verify that two dataframes with the 
same filter compare to equal (i.e. we don't break the exchange reuse)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674121960


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29427: [SPARK-25557][SQL][TEST][Followup] Add case-sensitivity test for ORC predicate pushdown

2020-08-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #29427:
URL: https://github.com/apache/spark/pull/29427#discussion_r470731273



##
File path: 
sql/core/v1.2/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
##
@@ -513,5 +513,98 @@ class OrcFilterSuite extends OrcTest with 
SharedSparkSession {
   ).get.toString
 }
   }
+
+  test("SPARK-25557: case sensitivity in predicate pushdown") {
+withTempPath { dir =>
+  val count = 10
+  val tableName = "spark_25557"
+  val tableDir1 = dir.getAbsoluteFile + "/table1"
+
+  // Physical ORC files have both `A` and `a` fields.
+  withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
+spark.range(count).repartition(count).selectExpr("id - 1 as A", "id as 
a")
+  .write.mode("overwrite").orc(tableDir1)
+  }
+
+  // Metastore table has both `A` and `a` fields too.
+  withTable(tableName) {
+withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
+  sql(
+s"""
+   |CREATE TABLE $tableName (A LONG, a LONG) USING ORC LOCATION 
'$tableDir1'
+ """.stripMargin)
+
+  checkAnswer(sql(s"select a, A from $tableName"), (0 until 
count).map(c => Row(c, c - 1)))
+
+  val actual1 = stripSparkFilter(sql(s"select A from $tableName where 
A < 0"))
+  assert(actual1.count() == 1)
+
+  val actual2 = stripSparkFilter(sql(s"select A from $tableName where 
a < 0"))
+  assert(actual2.count() == 0)
+}
+
+// Exception thrown for ambiguous case.
+withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
+  val e = intercept[AnalysisException] {
+sql(s"select a from $tableName where a < 0").collect()
+  }
+  assert(e.getMessage.contains(
+"Reference 'a' is ambiguous"))
+}
+  }
+
+  // Metastore table has only `A` field.
+  withTable(tableName) {
+withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
+  sql(
+s"""
+   |CREATE TABLE $tableName (A LONG) USING ORC LOCATION 
'$tableDir1'
+ """.stripMargin)
+
+  val e = intercept[SparkException] {
+sql(s"select A from $tableName where A < 0").collect()
+  }
+  assert(e.getCause.isInstanceOf[RuntimeException] && 
e.getCause.getMessage.contains(
+"""Found duplicate field(s) "A": [A, a] in case-insensitive 
mode"""))
+}
+  }
+
+  // Physical ORC files have only `A` field.
+  val tableDir2 = dir.getAbsoluteFile + "/table2"
+  withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
+spark.range(count).repartition(count).selectExpr("id - 1 as A")
+  .write.mode("overwrite").orc(tableDir2)
+  }
+
+  withTable(tableName) {
+withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
+  sql(
+s"""
+   |CREATE TABLE $tableName (a LONG) USING ORC LOCATION 
'$tableDir2'
+ """.stripMargin)
+
+  checkAnswer(sql(s"select a from $tableName"), (0 until count).map(c 
=> Row(c - 1)))
+
+  val actual = stripSparkFilter(sql(s"select a from $tableName where a 
< 0"))
+  // TODO: ORC predicate pushdown should work under case-insensitive 
analysis.
+  // assert(actual.count() == 1)

Review comment:
   Yes, please~





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674201047


   **[Test build #127460 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127460/testReport)**
 for PR 29431 at commit 
[`98c8be3`](https://github.com/apache/spark/commit/98c8be31530d7d146746869acf4cadf0c5d495ee).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674210062


   **[Test build #127461 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127461/testReport)**
 for PR 29431 at commit 
[`09ef17b`](https://github.com/apache/spark/commit/09ef17b914060fe107d61bfd2af1ee69003a6fee).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


SparkQA commented on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674217964


   **[Test build #127462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127462/testReport)**
 for PR 29428 at commit 
[`0a6c574`](https://github.com/apache/spark/commit/0a6c5743a8808b55f399e3298116a0e92bd72d0d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674230204







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


SparkQA commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674238955


   **[Test build #127463 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127463/testReport)**
 for PR 29396 at commit 
[`a80c544`](https://github.com/apache/spark/commit/a80c5441a01c8c516f4fa3288dfddf090ae2b060).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674125075


   **[Test build #127463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127463/testReport)**
 for PR 29396 at commit 
[`a80c544`](https://github.com/apache/spark/commit/a80c5441a01c8c516f4fa3288dfddf090ae2b060).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674239618







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674239618







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29435:
URL: https://github.com/apache/spark/pull/29435#issuecomment-674241724


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29378: [SPARK-30069][CORE][YARN] Clean up non-shuffle disk block manager files following executor exists on YARN

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29378:
URL: https://github.com/apache/spark/pull/29378#issuecomment-673917333







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leanken opened a new pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementEx…

2020-08-14 Thread GitBox


leanken opened a new pull request #29431:
URL: https://github.com/apache/spark/pull/29431


   ### What changes were proposed in this pull request?
   Found java.util.NoSuchElementException in UT log of AdaptiveQueryExecSuite. 
During AQE, when sub-plan changed, LiveExecutionData is using the new sub-plan 
SQLMetrics to override the old ones, But in the final aggregateMetrics, it will 
cause NoSuchElementException, so it should be a upsert for SQLMetrics when it 
comes to a new sub-plan.
   
   ### Why are the changes needed?
   SQL Metrics is not correct for some AQE cases.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Run AdaptiveQueryExecSuite with no "java.util.NoSuchElementException".



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


AngersZh commented on a change in pull request #29428:
URL: https://github.com/apache/spark/pull/29428#discussion_r470449703



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
##
@@ -330,4 +331,44 @@ class SparkSqlParserSuite extends AnalysisTest {
 assertEqual("ADD FILE /path with space/abc.txt", AddFileCommand("/path 
with space/abc.txt"))
 assertEqual("ADD JAR /path with space/abc.jar", AddJarCommand("/path with 
space/abc.jar"))
   }
+
+  test("SPARK-32608: script transform with row format delimit") {
+assertEqual(

Review comment:
   > Could you add end-2-end tests, too?
   
   Added in `BasicScriptTransformationExecSuite`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-673947748







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-673947748







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


SparkQA commented on pull request #28939:
URL: https://github.com/apache/spark/pull/28939#issuecomment-673947714


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32066/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28939:
URL: https://github.com/apache/spark/pull/28939#issuecomment-673956114


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/32066/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


sarutak commented on pull request #28939:
URL: https://github.com/apache/spark/pull/28939#issuecomment-673964388


   retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #26319: [SPARK-29594][SQL] Provide better error message when creating a Dataset from a Sequence of Case class where a field name started with a numbe

2020-08-14 Thread GitBox


SparkQA commented on pull request #26319:
URL: https://github.com/apache/spark/pull/26319#issuecomment-673991562


   **[Test build #127452 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127452/testReport)**
 for PR 26319 at commit 
[`bf06d0c`](https://github.com/apache/spark/commit/bf06d0ce5131a9afa0e7db8024330dcada71d4e1).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #29423: [SPARK-20680][SQL][FOLLOW-UP] Add HiveVoidType in HiveClientImpl

2020-08-14 Thread GitBox


ulysses-you commented on pull request #29423:
URL: https://github.com/apache/spark/pull/29423#issuecomment-673991291


   thanks for merging!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28490: [SPARK-31670][SQL]Resolve Struct Field in Grouping Aggregate with same ExprId

2020-08-14 Thread GitBox


cloud-fan commented on a change in pull request #28490:
URL: https://github.com/apache/spark/pull/28490#discussion_r470522985



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##
@@ -1479,6 +1479,33 @@ class Analyzer(
   // Skip the having clause here, this will be handled in 
ResolveAggregateFunctions.
   case h: UnresolvedHaving => h
 
+  case agg @ (_: Aggregate | _: GroupingSets) =>
+val resolved = agg.mapExpressions(resolveExpressionTopDown(_, agg))
+val hasStructField = resolved.expressions.exists {
+  _.collectFirst { case gsf: GetStructField => gsf }.isDefined
+}
+if (hasStructField) {
+  // For struct field, it will be resolve as Alias(GetStructField, 
name),
+  // In Aggregate/GroupingSets this behavior will cause same struct 
field
+  // in aggExprs/groupExprs/selectedGroupByExprs will be resolved 
divided
+  // with different ExprId of Alias and replace failed when construct
+  // Aggregate in ResolveGroupingAnalytics, so we resolve duplicated 
struct

Review comment:
   ```
   ... will cause the same struct fields in 
aggExprs/groupExprs/selectedGroupByExprs
   be treated as different ones due to different ExprIds in Alias, and stops us
   finding the grouping expressions in aggExprs. Here we resolve ...
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29432: [SPARK-32616][SQL] Window operators should be added determinedly

2020-08-14 Thread GitBox


maropu commented on pull request #29432:
URL: https://github.com/apache/spark/pull/29432#issuecomment-673995926


   oh.. good catch! LGTM, pending Jenkins.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


AngersZh commented on a change in pull request #29428:
URL: https://github.com/apache/spark/pull/29428#discussion_r470529869



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala
##
@@ -311,6 +311,77 @@ abstract class BaseScriptTransformationSuite extends 
SparkPlanTest with SQLTestU
   }
 }
   }
+
+  test("SPARK-32608: Script Transform ROW FORMAT DELIMIT value should format 
value") {
+withTempView("v") {
+  val df = Seq(
+(1, "1", 1.0, BigDecimal(1.0), new Timestamp(1)),
+(2, "2", 2.0, BigDecimal(2.0), new Timestamp(2)),
+(3, "3", 3.0, BigDecimal(3.0), new Timestamp(3))
+  ).toDF("a", "b", "c", "d", "e") // Note column d's data type is 
Decimal(38, 18)
+  df.createTempView("v")
+
+  // input/output same delimit
+  val query1 = sql(

Review comment:
   > 369
   
   Yea and extract `decimalToString`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28939:
URL: https://github.com/apache/spark/pull/28939#issuecomment-674021539


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127449/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #28939:
URL: https://github.com/apache/spark/pull/28939#issuecomment-674021532


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470845507



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): Iterator[ValueRowWithKeyIndex]
+
+  /**
+   * Returns key index and matched single row.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getValueWithKeyIndex(key: InternalRow): ValueRowWithKeyIndex

Review comment:
   Is this for unique key case only?

##
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##
@@ -428,6 +428,68 @@ public MapIterator destructiveIterator() {
 return new MapIterator(numValues, new Location(), true);
   }
 
+  /**
+   * Iterator for the entries of this map. This is to first iterate over key 
index array
+   * `longArray` then accessing values in `dataPages`. NOTE: this is different 
from `MapIterator`
+   * in the sense that key index is preserved here
+   * (See `UnsafeHashedRelation` for example of usage).
+   */
+  public final class MapIteratorWithKeyIndex implements Iterator {

Review comment:
   Looks like `keyIndex` is not exposed outside this map iterator? then 
maybe call it `MapIteratorPreserveKeyIndex`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470855625



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): Iterator[ValueRowWithKeyIndex]
+
+  /**
+   * Returns key index and matched single row.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getValueWithKeyIndex(key: InternalRow): ValueRowWithKeyIndex

Review comment:
   @viirya - yes. Similar to definition of `def getValue(key: InternalRow): 
InternalRow` above. I added comment to say `Returns key index and matched 
single row.`, to be consistent with comment of `getValue`. Hope this is clear 
enough.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470883805



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `HashSet[Long]` is used to track matched rows with
+   *key index (Int) and value index (Int) together.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index and value index from `HashSet`.
+   *
+   * The "value index" is defined as the index of 

[GitHub] [spark] agrawaldevesh commented on pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on pull request #29422:
URL: https://github.com/apache/spark/pull/29422#issuecomment-674294907


   > Thank you for taking the time to resolve this and make such a clear 
writeup of the root cause. From an in-production not-in-test question: if the 
executor exits we also want to eagerly clean up everything and resubmit right?
   
   Yes for sure. That will happen on its own. I haven't really changed that 
behavior. I have only changed the way fetch failures are handled (stemming from 
a decommissioned host). And the way they lead to a rerun is that 
`org.apache.spark.scheduler.DAGScheduler#resubmitFailedStages` gets invoked on 
a fetch failure asynchronously. The driver will then figure out what stages are 
missing map outputs and rerun them in topological order. 
   
   When an executor exits, it will normally clean up just its shuffle data (it 
does not know that its peer executors on the same host will soon be dying as 
well). Its the incrementing of the `shuffleFileLostEpoch` as a part of this 
cleanup that prevents future cleanups when a fetch failure is observed. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470888548



##
File path: 
core/src/test/scala/org/apache/spark/deploy/DecommissionWorkerSuite.scala
##
@@ -212,22 +226,27 @@ class DecommissionWorkerSuite
   override def handleRootTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
 val taskInfo = taskEnd.taskInfo
 if (taskInfo.executorId == executorToDecom && taskInfo.attemptNumber 
== 0 &&
-  taskEnd.stageAttemptId == 0) {
+  taskEnd.stageAttemptId == 0 && taskEnd.stageId == 0) {
   decommissionWorkerOnMaster(workerToDecom,
 "decommission worker after task on it is done")
 }
   }
 }
-TestUtils.withListener(sc, listener) { _ =>
+withListener(sc, listener) { _ =>
   val jobResult = sc.parallelize(1 to 2, 2).mapPartitionsWithIndex((_, _) 
=> {
 val executorId = SparkEnv.get.executorId
-val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
-Thread.sleep(sleepTimeSeconds * 1000L)
+val context = TaskContext.get()
+if (context.attemptNumber() == 0 && context.stageAttemptNumber() == 0) 
{
+  val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
+  Thread.sleep(sleepTimeSeconds * 1000L)
+}

Review comment:
   Exactly. Got tired of waiting for the test to run and trying to cut 
slack.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rohitmishr1484 commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674294866


   @HyukjinKwon,
   
   Thanks for your helpful comment. I have done the suggested changes but if 
you still find something which requires modification, please let me know, I 
will update it. 
   
   Thanks!!!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674325021







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


SparkQA commented on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674324895


   **[Test build #127468 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127468/testReport)**
 for PR 29436 at commit 
[`18cac6a`](https://github.com/apache/spark/commit/18cac6a9f0bf4a6d449393f1ee84004623b3c893).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


SparkQA commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674328817


   **[Test build #127466 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127466/testReport)**
 for PR 29430 at commit 
[`5b1b9b3`](https://github.com/apache/spark/commit/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470854938



##
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##
@@ -428,6 +428,68 @@ public MapIterator destructiveIterator() {
 return new MapIterator(numValues, new Location(), true);
   }
 
+  /**
+   * Iterator for the entries of this map. This is to first iterate over key 
index array
+   * `longArray` then accessing values in `dataPages`. NOTE: this is different 
from `MapIterator`
+   * in the sense that key index is preserved here
+   * (See `UnsafeHashedRelation` for example of usage).
+   */
+  public final class MapIteratorWithKeyIndex implements Iterator {

Review comment:
   @viirya - you can think of `Location` returned by 
`MapIteratorWithKeyIndex.next()` indirectly exposes the `keyIndex`. I don't 
have a strong preference here, @cloud-fan WDYT here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274794







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274794







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] asfgit closed pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


asfgit closed pull request #28939:
URL: https://github.com/apache/spark/pull/28939


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


mridulm commented on pull request #28939:
URL: https://github.com/apache/spark/pull/28939#issuecomment-674293037


   Thanks @sarutak !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674297234







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674297234







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470898482



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): Iterator[ValueRowWithKeyIndex]
+
+  /**
+   * Returns key index and matched single row.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getValueWithKeyIndex(key: InternalRow): ValueRowWithKeyIndex

Review comment:
   Can you add comment saying this is for unique key case?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470908654



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:

Review comment:
   @cloud-fan - my bad, updated.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): 

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674315733







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674315733







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on pull request #28818: [WIP][SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling

2020-08-14 Thread GitBox


agrawaldevesh commented on pull request #28818:
URL: https://github.com/apache/spark/pull/28818#issuecomment-674318466


   @holdenk can this PR be abandoned/closed now since this is finally in ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum opened a new pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


wangyum opened a new pull request #29436:
URL: https://github.com/apache/spark/pull/29436


   ### What changes were proposed in this pull request?
   
   This pr reset the `numPartitions` metric when DPP is enabled.
   
   
   ### Why are the changes needed?
   
   Fix metric issue.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test and manual test
   
   For [this test 
case](https://github.com/apache/spark/blob/18cac6a9f0bf4a6d449393f1ee84004623b3c893/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala#L252-L280).
   
   Before this pr:
   
![image](https://user-images.githubusercontent.com/5399861/90301798-9310b480-ded4-11ea-9294-49bcaba46f83.png)
   
   After this pr:
   
![image](https://user-images.githubusercontent.com/5399861/90301709-0fef5e80-ded4-11ea-942d-4d45d1dd15bc.png)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274381


   **[Test build #127465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127465/testReport)**
 for PR 29410 at commit 
[`5df88b4`](https://github.com/apache/spark/commit/5df88b4cd441116cb023d932a8023cd8ef068307).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674282565







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


SparkQA commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674282273


   **[Test build #127465 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127465/testReport)**
 for PR 29410 at commit 
[`5df88b4`](https://github.com/apache/spark/commit/5df88b4cd441116cb023d932a8023cd8ef068307).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674282565







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu commented on a change in pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu commented on a change in pull request #29430:
URL: https://github.com/apache/spark/pull/29430#discussion_r470889123



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##
@@ -371,6 +371,25 @@ class DataSourceV2Suite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-32609: DataSourceV2 with different pushedfilters should be 
different") {
+def getScanExec(query: DataFrame): DataSourceV2ScanExec = {
+  query.queryExecution.executedPlan.collect {
+case d: DataSourceV2ScanExec => d
+  }.head
+}
+
+Seq(classOf[AdvancedDataSourceV2], 
classOf[JavaAdvancedDataSourceV2]).foreach { cls =>
+  withClue(cls.getName) {
+val df = spark.read.format(cls.getName).load()
+val q1 = df.select('i).filter('i > 6)
+val q2 = df.select('i).filter('i > 5)

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470889426



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -136,7 +137,9 @@ private[spark] class TaskSchedulerImpl(
   // IDs of the tasks running on each executor
   private val executorIdToRunningTaskIds = new HashMap[String, HashSet[Long]]
 
-  private val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  // map of second to list of executors to clear form the above map
+  val decommissioningExecutorsToGc = new util.TreeMap[Long, 
mutable.ArrayBuffer[String]]()

Review comment:
   Sure. Any structure that lets me GC by time will do. I just wanted 
something lightweight and custom to this use case. 
   
   I expect the treemap to contain no more than 60 seconds worth of entries 
since things are keyed by the second, and they are also cleaned up on every 
check. The check happens on every executor loss and fetch failures. But yeah it 
is possible that if there are no failures then the entries could just sit there 
:-P. 
   
   I will change it to Cache. good idea. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674302449


   > I think it might make sense to add a test with a self join between the two 
dfs yields the correct results to mirror the issue observed
   
   The issue observed cannot be necessarily mirrored this way. The issue 
happens when 1. An exchange exists in optimized physical plan 2. Reuse exchange 
rule is applied.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


huaxingao commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r470898996



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScan.scala
##
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.connector.read.V1Scan
+import org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation
+import org.apache.spark.sql.sources.{BaseRelation, Filter, TableScan}
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScan(
+relation: JDBCRelation,
+prunedSchema: StructType,
+pushedFilters: Array[Filter]) extends V1Scan {

Review comment:
   Using ```V1Scan``` here for now because we are still calling V1 
```JDBCRDD.scanTable``` underneath. Migrating step by step.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, 
SupportsPushDownFilters, SupportsPushDownRequiredColumns}
+import org.apache.spark.sql.execution.datasources.PartitioningUtils
+import org.apache.spark.sql.execution.datasources.jdbc.{JDBCOptions, JDBCRDD, 
JDBCRelation}
+import org.apache.spark.sql.jdbc.JdbcDialects
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScanBuilder(
+session: SparkSession,
+schema: StructType,
+jdbcOptions: JDBCOptions)
+  extends ScanBuilder with SupportsPushDownFilters with 
SupportsPushDownRequiredColumns {
+
+  private val isCaseSensitive = session.sessionState.conf.caseSensitiveAnalysis
+
+  private var pushedFilter = Array.empty[Filter]
+
+  private var prunedSchema = schema
+
+  override def pushFilters(filters: Array[Filter]): Array[Filter] = {
+if (jdbcOptions.pushDownPredicate) {
+  val dialect = JdbcDialects.get(jdbcOptions.url)
+  val (pushed, unSupported) = filters.partition(JDBCRDD.compileFilter(_, 
dialect).isDefined)
+  this.pushedFilter = pushed
+  unSupported

Review comment:
   Thanks for your comment. I agree that it's safer to return the original 
filters, but it seems to me that we want to push down filters to the underlying 
datasource for better performance, so I guess we don't want to return the 
original filter to re-evaluate the filters that have already evaluated in the 
datasources. 

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, 

[GitHub] [spark] rohitmishr1484 commented on a change in pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on a change in pull request #29410:
URL: https://github.com/apache/spark/pull/29410#discussion_r470866807



##
File path: python/docs/source/getting_started/index.rst
##
@@ -20,3 +20,13 @@
 Getting Started
 ===
 
+**PySpark** is the Python API for Spark.

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rohitmishr1484 commented on a change in pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on a change in pull request #29410:
URL: https://github.com/apache/spark/pull/29410#discussion_r470866885



##
File path: python/docs/source/getting_started/index.rst
##
@@ -20,3 +20,13 @@
 Getting Started
 ===
 
+**PySpark** is the Python API for Spark.
+
+This page lists an overview of the basic steps required to setup & get started 
with PySpark.
+
+.. toctree::
+   :maxdepth: 2
+
+   installation
+   package_overview

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470873077



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `HashSet[Long]` is used to track matched rows with
+   *key index (Int) and value index (Int) together.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index and value index from `HashSet`.
+   *
+   * The "value index" is defined as 

[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470884722



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `HashSet[Long]` is used to track matched rows with
+   *key index (Int) and value index (Int) together.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index and value index from `HashSet`.
+   *
+   * The "value index" is defined as the index of 

[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470890156



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -323,6 +326,7 @@ private[spark] class CoarseGrainedExecutorBackend(
   // move forward.
   lastTaskRunningTime = System.nanoTime()
 }
+Thread.sleep(sleep_time)

Review comment:
   Yeah. No semantic change. We are still by default waiting for sleep_time 
the first time and the last time around (it is an infinite while loop that can 
only exit via an `exit(1)` -- via process death). I just wanted the first sleep 
interval to be configurable for testing. But no production change to the 
shutdown thread. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470890206



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -294,10 +294,13 @@ private[spark] class CoarseGrainedExecutorBackend(
 override def run(): Unit = {
   var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
-
+  val initialSleepMillis = env.conf.getInt(

Review comment:
   sure





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


SparkQA commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674296775


   **[Test build #127466 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127466/testReport)**
 for PR 29430 at commit 
[`5b1b9b3`](https://github.com/apache/spark/commit/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


SparkQA commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674316955


   **[Test build #127467 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127467/testReport)**
 for PR 29342 at commit 
[`cf04e2f`](https://github.com/apache/spark/commit/cf04e2f9edeb0364ffed180d49b64d5b6969ef36).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674328932







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674296775


   **[Test build #127466 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127466/testReport)**
 for PR 29430 at commit 
[`5b1b9b3`](https://github.com/apache/spark/commit/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674328932







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rohitmishr1484 commented on a change in pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on a change in pull request #29410:
URL: https://github.com/apache/spark/pull/29410#discussion_r470866980



##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===

Review comment:
   Done

##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Installation
+===
+
+Using Conda 
+~~

Review comment:
   Done

##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Installation
+===
+
+Using Conda 

Review comment:
   Added this line

##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Installation
+===
+
+Using Conda 
+~~
+PySpark installation using `Conda `_ 
can be performed using the below command::
+
+conda install -c conda-forge pyspark
+   
+Using PyPI
+~~
+PySpark installation using `PyPI `_::
+
+pip install pyspark
+
+Official release channel
+
+
+Different flavor of PySpark is available in `the official release channel 
`__.
+Any suitable version can be downloaded and extracted as below::
+
+tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
+
+An important step is to ensure ``SPARK_HOME`` environment variable points to 
the directory where the code has been extracted. The next step is to properly 
define ``PYTHONPATH`` such that 

[GitHub] [spark] SparkQA commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


SparkQA commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274381


   **[Test build #127465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127465/testReport)**
 for PR 29410 at commit 
[`5df88b4`](https://github.com/apache/spark/commit/5df88b4cd441116cb023d932a8023cd8ef068307).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >