[GitHub] spark issue #11207: [SPARK-12583][Mesos] Mesos shuffle service: Don't delete...

2018-02-22 Thread IgorBerman
Github user IgorBerman commented on the issue:

https://github.com/apache/spark/pull/11207
  
@bbossy so my work around:  ive disabled cleanup of external shuffle 
service and removing shuffle files by cron that finds files that were not 
accessed in last X hours.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20382: [SPARK-23097][SQL][SS] Migrate text socket source...

2018-02-22 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20382#discussion_r170178735
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala
 ---
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.sources
+
+import java.io.{IOException, OutputStreamWriter}
+import java.net.ServerSocket
+import java.sql.Timestamp
+import java.util.Optional
+import java.util.concurrent.LinkedBlockingQueue
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{AnalysisException, Row}
+import org.apache.spark.sql.execution.streaming.LongOffset
+import org.apache.spark.sql.sources.v2.DataSourceOptions
+import org.apache.spark.sql.sources.v2.reader.streaming.MicroBatchReader
+import org.apache.spark.sql.streaming.StreamTest
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.{StringType, StructField, StructType, 
TimestampType}
+
+class TextSocketStreamSuite extends StreamTest with SharedSQLContext with 
BeforeAndAfterEach {
+
+  override def afterEach() {
+sqlContext.streams.active.foreach(_.stop())
+if (serverThread != null) {
+  serverThread.interrupt()
+  serverThread.join()
+  serverThread = null
+}
+if (batchReader != null) {
+  batchReader.stop()
+  batchReader = null
+}
+  }
+
+  private var serverThread: ServerThread = null
+  private var batchReader: MicroBatchReader = null
+
+  test("V2 basic usage") {
+serverThread = new ServerThread()
+serverThread.start()
+
+val provider = new TextSocketSourceProvider
+val options = new DataSourceOptions(
+  Map("host" -> "localhost", "port" -> 
serverThread.port.toString).asJava)
+batchReader = provider.createMicroBatchReader(Optional.empty(), "", 
options)
+
+val schema = batchReader.readSchema()
+assert(schema === StructType(StructField("value", StringType) :: Nil))
+
+failAfter(streamingTimeout) {
+  serverThread.enqueue("hello")
+  batchReader.setOffsetRange(Optional.empty(), Optional.empty())
+  while (batchReader.getEndOffset.asInstanceOf[LongOffset].offset == 
-1L) {
+batchReader.setOffsetRange(Optional.empty(), Optional.empty())
+Thread.sleep(10)
+  }
+  withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> 
"false") {
+val offset1 = batchReader.getEndOffset
+val batch1 = new ListBuffer[Row]
+
batchReader.createDataReaderFactories().asScala.map(_.createDataReader()).foreach
 { r =>
+  while (r.next()) {
+batch1.append(r.get())
+  }
+}
+assert(batch1.map(_.getAs[String](0)) === Seq("hello"))
+
+serverThread.enqueue("world")
+while (batchReader.getEndOffset === offset1) {
+  batchReader.setOffsetRange(Optional.of(offset1), 
Optional.empty())
+  Thread.sleep(10)
+}
+val offset2 = batchReader.getEndOffset
+val batch2 = new ListBuffer[Row]
+
batchReader.createDataReaderFactories().asScala.map(_.createDataReader()).foreach
 { r =>
+  while (r.next()) {
+batch2.append(r.get())
+  }
+}
+assert(batch2.map(_.getAs[String](0)) === Seq("world"))
+
+batchReader.setOffsetRange(Optional.empty(), Optional.of(offset2))
+val both = new ListBuffer[Row]
+
batchReader.createDataReaderFactories().asScala.map(_.createDataReader()).foreach
 { r =>
+  while (r.next()) {
+both.append(r.get())
+  }
+}
+as

[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2

2018-02-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20382
  
Sorry @tdas for the delay. I'm working on this, will push new changes soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

2018-02-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20624#discussion_r170171868
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -387,6 +390,101 @@ case class CatalogStatistics(
   }
 }
 
+/**
+ * This class of statistics for a column is used in [[CatalogTable]] to 
interact with metastore.
+ */
+case class CatalogColumnStat(
+  distinctCount: Option[BigInt] = None,
+  min: Option[String] = None,
+  max: Option[String] = None,
+  nullCount: Option[BigInt] = None,
+  avgLen: Option[Long] = None,
+  maxLen: Option[Long] = None,
+  histogram: Option[Histogram] = None) {
+
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the column and name of the field (e.g. 
"colName.distinctCount"),
+   * and the value is the string representation for the value.
+   * min/max values are stored as Strings. They can be deserialized using
+   * [[ColumnStat.fromExternalString]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   * In the case min/max values are null (None), they won't appear in the 
map.
+   */
+  def toMap(colName: String): Map[String, String] = {
+val map = new scala.collection.mutable.HashMap[String, String]
+map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1")
+distinctCount.foreach { v =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", 
v.toString)
+}
+nullCount.foreach { v =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_NULL_COUNT}", 
v.toString)
+}
+avgLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_AVG_LEN}", v.toString) }
+maxLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_LEN}", v.toString) }
+min.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MIN_VALUE}", v) }
+max.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_VALUE}", v) }
+histogram.foreach { h =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_HISTOGRAM}", 
HistogramSerializer.serialize(h))
+}
+map.toMap
+  }
+
+  /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */
+  def toPlanStat(
+  colName: String,
+  dataType: DataType): ColumnStat =
+ColumnStat(
+  distinctCount = distinctCount,
+  min = min.map(ColumnStat.fromExternalString(_, colName, dataType)),
+  max = max.map(ColumnStat.fromExternalString(_, colName, dataType)),
+  nullCount = nullCount,
+  avgLen = avgLen,
+  maxLen = maxLen,
+  histogram = histogram)
+}
+
+object CatalogColumnStat extends Logging {
+
+  // List of string keys used to serialize CatalogColumnStat
+  val KEY_VERSION = "version"
+  private val KEY_DISTINCT_COUNT = "distinctCount"
+  private val KEY_MIN_VALUE = "min"
+  private val KEY_MAX_VALUE = "max"
+  private val KEY_NULL_COUNT = "nullCount"
+  private val KEY_AVG_LEN = "avgLen"
+  private val KEY_MAX_LEN = "maxLen"
+  private val KEY_HISTOGRAM = "histogram"
+
+  /**
+   * Creates a [[CatalogColumnStat]] object from the given map.
+   * This is used to deserialize column stats from some external storage.
+   * The serialization side is defined in [[CatalogColumnStat.toMap]].
+   */
+  def fromMap(
+table: String,
+colName: String,
+map: Map[String, String]): Option[CatalogColumnStat] = {
+
+try {
+  Some(CatalogColumnStat(
+distinctCount = map.get(s"${colName}.${KEY_DISTINCT_COUNT}").map(v 
=> BigInt(v.toLong)),
--- End diff --

IIUC the format doesn't change, we just change the way to save/restore 
stats in metastore, which looks cleaner.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20618: [SPARK-23329][SQL] Fix documentation of trigonometric fu...

2018-02-22 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20618
  
Also cc @srinathshankar 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20660: [SPARK-23490][SQL]Check storage.locationUri with ...

2018-02-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20660


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20660: [SPARK-23490][SQL]Check storage.locationUri with existin...

2018-02-22 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20660
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1009/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
**[Test build #87626 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87626/testReport)**
 for PR 20553 at commit 
[`8f457ce`](https://github.com/apache/spark/commit/8f457cee17ffdb478fca3d3d2ff05c343217aef4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87626/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1009/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1015/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10942: [SPARK-12850] [SQL] Support Bucket Pruning (Predicate Pu...

2018-02-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/10942
  
FYI this feature will be re-designed when migrating file-based data sources 
to data source v2. From a quick look seems this feature is not in the current 
code base anymore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1008/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87625/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
**[Test build #87625 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87625/testReport)**
 for PR 20553 at commit 
[`11f94e2`](https://github.com/apache/spark/commit/11f94e2f57b6af331ec0d9e58708e03ac5a44e2a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1008/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1014/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
**[Test build #87626 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87626/testReport)**
 for PR 20553 at commit 
[`8f457ce`](https://github.com/apache/spark/commit/8f457cee17ffdb478fca3d3d2ff05c343217aef4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-22 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r170169357
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -576,14 +576,21 @@ specific to Spark on Kubernetes.
   spark.kubernetes.driver.limit.cores
   (none)
   
-Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+Specify a hard 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 on the amount of CPU cores for the driver pod.
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
**[Test build #87625 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87625/testReport)**
 for PR 20553 at commit 
[`11f94e2`](https://github.com/apache/spark/commit/11f94e2f57b6af331ec0d9e58708e03ac5a44e2a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-22 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r170168795
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -576,14 +576,21 @@ specific to Spark on Kubernetes.
   spark.kubernetes.driver.limit.cores
   (none)
   
-Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+Specify a hard 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 on the amount of CPU cores for the driver pod.
+  
+
+
+  spark.kubernetes.executor.cores
+  (none)
+  
+Specify the amount of CPU cores to request for each executor pod. 
Values conform to the Kubernetes 
[convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu).
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-22 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r170168480
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala
 ---
@@ -144,7 +149,7 @@ private[spark] class ExecutorPodFactory(
 val executorEnv = (Seq(
   (ENV_DRIVER_URL, driverUrl),
   // Executor backend expects integral value for executor cores, so 
round it up to an int.
-  (ENV_EXECUTOR_CORES, math.ceil(executorCores).toInt.toString),
+  (ENV_EXECUTOR_CORES, executorCores.toString),
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20628: [SPARK-23449][K8S] Preserve extraJavaOptions ordering

2018-02-22 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20628
  
also cc @vanzin to verify this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20647
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20647
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1013/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20647
  
**[Test build #87624 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87624/testReport)**
 for PR 20647 at commit 
[`dbee281`](https://github.com/apache/spark/commit/dbee2813accd4c8f5937b28eb9142cc6a50f8c6a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...

2018-02-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20648
  
Yes, thanks @HyukjinKwon for checking the behavior. If we look at the codes 
of JSON parser, we will find many places indicating the expectation of 
availability of partial results.

For example in `BadRecordException`, there is `partialResult` which is 
supposed to hold partial result of parsing a bad record. But we never really 
use it to return partial result but just use `None` for it.

Note: If we don't want to return partial result at all, we should refactor 
this part of code to make it clear. If we decide not to change current 
behavior, I can submit another PR to do refactoring.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170162261
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -35,15 +35,14 @@ case class DataSourceV2Relation(
 options: Map[String, String],
 projection: Seq[AttributeReference],
 filters: Option[Seq[Expression]] = None,
-userSpecifiedSchema: Option[StructType] = None) extends LeafNode with 
MultiInstanceRelation {
+userSpecifiedSchema: Option[StructType] = None)
+  extends LeafNode with MultiInstanceRelation with DataSourceV2QueryPlan {
--- End diff --

Sorry I didn't realize you removed the extending of 
`DataSourceReaderHolder`, otherwise I would point it out in your PR. In general 
the question is, how should we define the equality of data source relation? It 
was defined by `output, reader.getClass, filters` before, and your PR changed 
it to the default equality of `DataSourceV2Relation` silently.

The major difference is, should `options` take part in the equality? The 
answer is obviously yes, like `path`, so I'll add `options` to the equality.

BTW `DataSourceV2QueryPlan` is needed, as there are 3 plans need to 
implement explain: `DataSourceV2Relation`, `StreamingDataSourceV2Relation`, 
`DataSourceV2ScanExec`.

> This doesn't use the source name if it is named

I missed that, let me add back.

> doesn't indicate that the source is v2

I'll improve it

> it doesn't show the most important part of the scan

Do you mean the path option? First I don't think showing all the options is 
a good idea, as it can be a lot. My future plan is to show these standard 
options like path, table, etc. Again this is something added silently, there is 
no consensus about how to explain a data source v2 relation, my PR tries to 
have people focus on this part and have a consensus.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170160050
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -77,31 +79,32 @@ class MicroBatchExecution(
   
sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",")
 
 val _logicalPlan = analyzedPlan.transform {
-  case streamingRelation@StreamingRelation(dataSourceV1, sourceName, 
output) =>
-toExecutionRelationMap.getOrElseUpdate(streamingRelation, {
+  case s @ StreamingRelation(dsV1, sourceName, output) =>
--- End diff --

This is to make the naming consistent, see 
https://github.com/apache/spark/pull/20647/files#diff-c2959c723f334c32806217216014362eL89

In general we don't forbid users to fix some code style issue in related 
PRs, otherwise we need to have code-style-fix-only PRs, which is not the common 
case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170159543
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -107,17 +106,24 @@ case class DataSourceV2Relation(
 }
 
 /**
- * A specialization of DataSourceV2Relation with the streaming bit set to 
true. Otherwise identical
- * to the non-streaming relation.
+ * A specialization of [[DataSourceV2Relation]] with the streaming bit set 
to true.
+ *
+ * Note that, this plan has a mutable reader, so Spark won't apply 
operator push-down for this plan,
--- End diff --

are you suggesting we should open a JIRA to fix the internal document? We 
usually fix them in related PRs that touch the class...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20658: [SPARK-23488][python] Add missing catalog methods...

2018-02-22 Thread drboyer
Github user drboyer commented on a diff in the pull request:

https://github.com/apache/spark/pull/20658#discussion_r170158422
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -28,7 +28,7 @@
 Database = namedtuple("Database", "name description locationUri")
 Table = namedtuple("Table", "name database description tableType 
isTemporary")
 Column = namedtuple("Column", "name description dataType nullable 
isPartition isBucket")
-Function = namedtuple("Function", "name description className isTemporary")
+Function = namedtuple("Function", "name database description className 
isTemporary")
--- End diff --

Ah yes, `database` is in the [Scala 
api](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function)
 so I added it in for the sake of completeness, but I'm happy to remove it if 
there's a concern.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20519: [Spark-23240][python] Don't let python site custo...

2018-02-22 Thread bersprockets
Github user bersprockets closed the pull request at:

https://github.com/apache/spark/pull/20519


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-22 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r170140738
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -362,10 +365,10 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(topNode.isLeaf === false)
 assert(topNode.stats === null)
 
-val nodesForGroup = Map((0, Array(topNode)))
-val treeToNodeToIndexInfo = Map((0, Map(
-  (topNode.id, new RandomForest.NodeIndexInfo(0, None))
-)))
+val nodesForGroup = Map(0 -> Array(topNode))
--- End diff --

Two tests previously moved here have now been moved back, there is still
_"Use soft prediction for binary classification with ordered categorical 
features"_ to which I have applied @srowen 's comment, so the consistency 
argument still holds (even if weakened a bit).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-22 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r170140280
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -359,29 +339,6 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(rootNode.stats.isEmpty)
   }
 
-  test("do not choose split that does not satisfy min instance per node 
requirements") {
-// if a split does not satisfy min instances per node requirements,
-// this split is invalid, even though the information gain of split is 
large.
-val arr = Array(
-  LabeledPoint(0.0, Vectors.dense(0.0, 1.0)),
--- End diff --

That's true , I have modified the input data for both tests as suggested, 
and "moved back" the two tests from _.../ml/tree/impl/RandomForestSuite.scala_ 
to _.../mllib/tree/DecisionTreeSuite.scala_ where they originally were. The 
whole suite of tests for mllib passes. 

As a recap, 2 tests have been adapted by slightly changing the input data:
- _"Multiclass classification stump with 10-ary (ordered) categorical 
features"_
-  _"do not choose split that does not satisfy min instance per node 
requirements"_

_"Use soft prediction for binary classification with ordered categorical 
features"_ was present in two files:

1. _.../ml/classification/DecisionTreeClassifierSuite.scala_
2. _.../ml/tree/impl/RandomForestSuite.scala_

The one in 1. has been removed because it had to be adapted and it was 
redundant, while the one in 2. has been adapted following the same principle of 
other tests in that file such as _"Avoid aggregation on the last level" test, 
for instance"_.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-22 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r170139957
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -402,20 +405,40 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   LabeledPoint(1.0, Vectors.dense(2.0)))
 val input = sc.parallelize(arr)
 
+val seed = 42
+val numTrees = 1
+
 // Must set maxBins s.t. the feature will be treated as an ordered 
categorical feature.
 val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity 
= Gini, maxDepth = 1,
   numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3), maxBins = 3)
 
-val model = RandomForest.run(input, strategy, numTrees = 1, 
featureSubsetStrategy = "all",
-  seed = 42, instr = None).head
-model.rootNode match {
-  case n: InternalNode => n.split match {
-case s: CategoricalSplit =>
-  assert(s.leftCategories === Array(1.0))
-case _ => throw new AssertionError("model.rootNode.split was not a 
CategoricalSplit")
-  }
-  case _ => throw new AssertionError("model.rootNode was not an 
InternalNode")
-}
+val metadata = DecisionTreeMetadata.buildMetadata(input, strategy, 
numTrees = numTrees,
+  featureSubsetStrategy = "all")
+val splits = RandomForest.findSplits(input, metadata, seed = seed)
+
+val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
+val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput,
+  strategy.subsamplingRate, numTrees, false, seed = seed)
+
+val topNode = LearningNode.emptyNode(nodeIndex = 1)
+assert(topNode.isLeaf === false)
+assert(topNode.stats === null)
+
+val nodesForGroup = Map(0 -> Array(topNode))
+val treeToNodeToIndexInfo = Map(0 -> Map(
+  topNode.id -> new RandomForest.NodeIndexInfo(0, None)
+))
+val nodeStack = new mutable.ArrayStack[(Int, LearningNode)]
+val bestSplit = RandomForest.findBestSplits(baggedInput, metadata, 
Map(0 -> topNode),
+  nodesForGroup, treeToNodeToIndexInfo, splits, nodeStack)
+
+assert(topNode.split.isDefined, "rootNode does not have a split")
--- End diff --

It is true but I need to call these internal methods to initialise the 
structure correctly, including _rootNode_.

I have removed the only lines that did not look necessary to me:
`  assert(topNode.isLeaf === false)`
`   assert(topNode.stats === null)`

What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-22 Thread foxish
Github user foxish commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r170138424
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -576,14 +576,21 @@ specific to Spark on Kubernetes.
   spark.kubernetes.driver.limit.cores
   (none)
   
-Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+Specify a hard 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 on the amount of CPU cores for the driver pod.
+  
+
+
+  spark.kubernetes.executor.cores
+  (none)
+  
+Specify the amount of CPU cores to request for each executor pod. 
Values conform to the Kubernetes 
[convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu).
--- End diff --

Should we mention that this value overrides `spark.executor.cores`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-22 Thread foxish
Github user foxish commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r170137631
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -576,14 +576,21 @@ specific to Spark on Kubernetes.
   spark.kubernetes.driver.limit.cores
   (none)
   
-Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+Specify a hard 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 on the amount of CPU cores for the driver pod.
--- End diff --

I think it reads better without "the amount of", i.e. "Specify a hard limit 
on CPU cores for the driver pod". Same comment for the below section as well. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-22 Thread foxish
Github user foxish commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r170138218
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala
 ---
@@ -144,7 +149,7 @@ private[spark] class ExecutorPodFactory(
 val executorEnv = (Seq(
   (ENV_DRIVER_URL, driverUrl),
   // Executor backend expects integral value for executor cores, so 
round it up to an int.
-  (ENV_EXECUTOR_CORES, math.ceil(executorCores).toInt.toString),
+  (ENV_EXECUTOR_CORES, executorCores.toString),
--- End diff --

The comment above is now outdated and can be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4129 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4129/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-22 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/20553
  
Any comment or concern on this? Is this good to merge?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4134 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4134/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4132 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4132/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4131 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4131/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4133 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4133/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4130 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4130/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4127 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4127/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20622
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20622
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87623/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #87623 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87623/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20660: [SPARK-23490][SQL]Check storage.locationUri with existin...

2018-02-22 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/20660
  
@gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4135 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4135/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4128 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4128/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20650: [SPARK-23408][SS] Synchronize successive AddData actions...

2018-02-22 Thread jose-torres
Github user jose-torres commented on the issue:

https://github.com/apache/spark/pull/20650
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...

2018-02-22 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20645
  
> A quick question: after this change, extraJavaOptions is still able to 
cleanly override whatever's set in defaultJavaOptions, is that right?

No, the intent is for both sets of options to be passed. How the JVM 
interprets the options is not up to Spark. This is intended to provide a way 
for administrators to default properties so they are not accidentally 
overridden when a user adds `--driver-java-options`. Users can still override 
`defaultJavaOptions` if they need to deviate from adminstrator defaults.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20657: [SPARK-23361][yarn] Allow AM to restart after initial to...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20657
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20657: [SPARK-23361][yarn] Allow AM to restart after initial to...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20657
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87621/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20653
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20657: [SPARK-23361][yarn] Allow AM to restart after initial to...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20657
  
**[Test build #87621 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87621/testReport)**
 for PR 20657 at commit 
[`2c3448d`](https://github.com/apache/spark/commit/2c3448dd3aa4071234a65a1c9317b1a3c4fe8d24).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20653
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87622/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20653
  
**[Test build #87622 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87622/testReport)**
 for PR 20653 at commit 
[`d6ac338`](https://github.com/apache/spark/commit/d6ac338a594d823a8eca0356102c6e9bf2bf09d6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170115965
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -415,12 +418,14 @@ class MicroBatchExecution(
 case v1: SerializedOffset => reader.deserializeOffset(v1.json)
 case v2: OffsetV2 => v2
   }
-  reader.setOffsetRange(
-toJava(current),
-Optional.of(availableV2))
+  reader.setOffsetRange(toJava(current), Optional.of(availableV2))
   logDebug(s"Retrieving data from $reader: $current -> 
$availableV2")
-  Some(reader ->
-new 
StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader))
+  Some(reader -> StreamingDataSourceV2Relation(
--- End diff --

@rdblue There was a doc as part of this SPIP: 
https://issues.apache.org/jira/browse/SPARK-20928, but it has definitely 
evolved enough past that we should update and send to the dev list again.

Things like the logical plan requirement in execution will likely be 
significantly easier to remove once we have a full V2 API and can remove the 
legacy internal API for streaming.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...

2018-02-22 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20645
  
A quick question: after this change, `extraJavaOptions` is still able to 
cleanly override whatever's set in `defaultJavaOptions`, is that right?

^^ Just making sure I understood the intent correctly and not the other way 
around. There may well be the other side of administrative needs which is to 
"force options", e.g. force `-XX:-DisableExplicitGC` so that NIO direct memory 
doesn't get into trouble, but that'd be out of scope of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170106104
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -415,12 +418,14 @@ class MicroBatchExecution(
 case v1: SerializedOffset => reader.deserializeOffset(v1.json)
 case v2: OffsetV2 => v2
   }
-  reader.setOffsetRange(
-toJava(current),
-Optional.of(availableV2))
+  reader.setOffsetRange(toJava(current), Optional.of(availableV2))
   logDebug(s"Retrieving data from $reader: $current -> 
$availableV2")
-  Some(reader ->
-new 
StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader))
+  Some(reader -> StreamingDataSourceV2Relation(
--- End diff --

When it comes to the streaming execution code, the basic problem is that it 
was more evolved than designed. For example, there's no particular reason to 
use a logical plan; the map is only ever used in order to construct another map 
of source -> physical plan stats. Untangling StreamExecution is definitely 
something we need to do, but that's going to be annoying and I think it's 
sufficiently orthogonal to the V2 migration to put off.

There's currently no design doc for the streaming aspects of DataSourceV2. 
We kinda rushed an experimental version out the door, because it was coupled 
with the experimental ContinuousExecution streaming mode. I'm working on going 
back and cleaning things up; I'll send docs to the dev list and make sure to @ 
you on the changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...

2018-02-22 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20604#discussion_r170103841
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala ---
@@ -55,18 +55,18 @@ private[spark] trait ExecutorAllocationClient {
   /**
* Request that the cluster manager kill the specified executors.
*
-   * When asking the executor to be replaced, the executor loss is 
considered a failure, and
-   * killed tasks that are running on the executor will count towards the 
failure limits. If no
-   * replacement is being requested, then the tasks will not count towards 
the limit.
-   *
* @param executorIds identifiers of executors to kill
-   * @param replace whether to replace the killed executors with new ones, 
default false
+   * @param adjustTargetNumExecutors whether the target number of 
executors will be adjusted down
+   * after these executors have been killed
+   * @param countFailures if there are tasks running on the executors when 
they are killed, whether
--- End diff --

I'm still a little confused about this parameter.

If `force = false`, it's a no op. And all call sites I've seen seem to set 
this parameter to `false`. So is there something I'm missing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...

2018-02-22 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20604#discussion_r170102063
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -334,6 +336,11 @@ private[spark] class ExecutorAllocationManager(
 
   // If the new target has not changed, avoid sending a message to the 
cluster manager
   if (numExecutorsTarget < oldNumExecutorsTarget) {
+// We lower the target number of executors but don't actively kill 
any yet.  Killing is
+// controlled separately by an idle timeout.  Its still helpful to 
reduce the target number
--- End diff --

nit: it's


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...

2018-02-22 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20604#discussion_r170102582
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -334,6 +336,11 @@ private[spark] class ExecutorAllocationManager(
 
   // If the new target has not changed, avoid sending a message to the 
cluster manager
   if (numExecutorsTarget < oldNumExecutorsTarget) {
+// We lower the target number of executors but don't actively kill 
any yet.  Killing is
+// controlled separately by an idle timeout.  Its still helpful to 
reduce the target number
+// in case an executor just happens to get lost (eg., bad 
hardware, or the cluster manager
+// preempts it) -- in that case, there is no point in trying to 
immediately  get a new
+// executor, since we couldn't even use it yet.
--- End diff --

s/couldn't/wouldn't


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...

2018-02-22 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20604#discussion_r170102363
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala ---
@@ -55,18 +55,18 @@ private[spark] trait ExecutorAllocationClient {
   /**
* Request that the cluster manager kill the specified executors.
*
-   * When asking the executor to be replaced, the executor loss is 
considered a failure, and
-   * killed tasks that are running on the executor will count towards the 
failure limits. If no
-   * replacement is being requested, then the tasks will not count towards 
the limit.
-   *
* @param executorIds identifiers of executors to kill
-   * @param replace whether to replace the killed executors with new ones, 
default false
+   * @param adjustTargetNumExecutors whether the target number of 
executors will be adjusted down
+   * after these executors have been killed
+   * @param countFailures if there are tasks running on the executors when 
they are killed, whether
+   *  those failures be counted to task failure limits?
--- End diff --

nit: "whether to count those failures toward task failure limits"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

2018-02-22 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20624#discussion_r170102064
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -387,6 +390,101 @@ case class CatalogStatistics(
   }
 }
 
+/**
+ * This class of statistics for a column is used in [[CatalogTable]] to 
interact with metastore.
+ */
+case class CatalogColumnStat(
+  distinctCount: Option[BigInt] = None,
+  min: Option[String] = None,
+  max: Option[String] = None,
+  nullCount: Option[BigInt] = None,
+  avgLen: Option[Long] = None,
+  maxLen: Option[Long] = None,
+  histogram: Option[Histogram] = None) {
+
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the column and name of the field (e.g. 
"colName.distinctCount"),
+   * and the value is the string representation for the value.
+   * min/max values are stored as Strings. They can be deserialized using
+   * [[ColumnStat.fromExternalString]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   * In the case min/max values are null (None), they won't appear in the 
map.
+   */
+  def toMap(colName: String): Map[String, String] = {
+val map = new scala.collection.mutable.HashMap[String, String]
+map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1")
+distinctCount.foreach { v =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", 
v.toString)
+}
+nullCount.foreach { v =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_NULL_COUNT}", 
v.toString)
+}
+avgLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_AVG_LEN}", v.toString) }
+maxLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_LEN}", v.toString) }
+min.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MIN_VALUE}", v) }
+max.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_VALUE}", v) }
+histogram.foreach { h =>
+  map.put(s"${colName}.${CatalogColumnStat.KEY_HISTOGRAM}", 
HistogramSerializer.serialize(h))
+}
+map.toMap
+  }
+
+  /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */
+  def toPlanStat(
+  colName: String,
+  dataType: DataType): ColumnStat =
+ColumnStat(
+  distinctCount = distinctCount,
+  min = min.map(ColumnStat.fromExternalString(_, colName, dataType)),
+  max = max.map(ColumnStat.fromExternalString(_, colName, dataType)),
+  nullCount = nullCount,
+  avgLen = avgLen,
+  maxLen = maxLen,
+  histogram = histogram)
+}
+
+object CatalogColumnStat extends Logging {
+
+  // List of string keys used to serialize CatalogColumnStat
+  val KEY_VERSION = "version"
+  private val KEY_DISTINCT_COUNT = "distinctCount"
+  private val KEY_MIN_VALUE = "min"
+  private val KEY_MAX_VALUE = "max"
+  private val KEY_NULL_COUNT = "nullCount"
+  private val KEY_AVG_LEN = "avgLen"
+  private val KEY_MAX_LEN = "maxLen"
+  private val KEY_HISTOGRAM = "histogram"
+
+  /**
+   * Creates a [[CatalogColumnStat]] object from the given map.
+   * This is used to deserialize column stats from some external storage.
+   * The serialization side is defined in [[CatalogColumnStat.toMap]].
+   */
+  def fromMap(
+table: String,
+colName: String,
+map: Map[String, String]): Option[CatalogColumnStat] = {
+
+try {
+  Some(CatalogColumnStat(
+distinctCount = map.get(s"${colName}.${KEY_DISTINCT_COUNT}").map(v 
=> BigInt(v.toLong)),
--- End diff --

Could you add a test case? BTW, forwards compatibility is also needed since 
Hive metastore is being shared by different Spark versions.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2018-02-22 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/20640
  
@susanxhuynh I agree in case it is not enabled we can log failures as 
usual, but not for blacklisting as it is disabled it wouldnt make sense. User 
should have this option not to care. 

> In the case where an executor fails before entering Spark code (for 
example, Mesos agent failed to create the sandbox), would it be detected?

Good question forgot to mention this. In this scenario a task status update 
will be given
eg. 
[REASON_CONTAINER_LAUNCH_FAILED](https://github.com/apache/mesos/blob/5e5a8102c3281db25a37157dac123b0ca546e030/docs/task-state-reasons.md)

This is done implicitly 
[here](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L658)
  in status update which then calls [removeExecutor 
](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L728)
 which then sends a message to drivers point to remove the executor and then 
this 
[line](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L330)
 is called which then will calls another helper 
[method](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L536),
 which calls this [one](ht
 
tps://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L595)
 and in there blacklist info is updated.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170097795
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -415,12 +418,14 @@ class MicroBatchExecution(
 case v1: SerializedOffset => reader.deserializeOffset(v1.json)
 case v2: OffsetV2 => v2
   }
-  reader.setOffsetRange(
-toJava(current),
-Optional.of(availableV2))
+  reader.setOffsetRange(toJava(current), Optional.of(availableV2))
   logDebug(s"Retrieving data from $reader: $current -> 
$availableV2")
-  Some(reader ->
-new 
StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader))
+  Some(reader -> StreamingDataSourceV2Relation(
--- End diff --

Sounds like this isn't something that should hold up this commit then.

Is there a design doc for what you're describing that I can read to 
familiarize myself with the issues here? I'd like to participate more on the 
streaming side as it relates to the new data source API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4135 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4135/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4134 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4134/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4133 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4133/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4132 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4132/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4131 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4131/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4130 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4130/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4129 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4129/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4128 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4128/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #4127 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4127/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170095169
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -415,12 +418,14 @@ class MicroBatchExecution(
 case v1: SerializedOffset => reader.deserializeOffset(v1.json)
 case v2: OffsetV2 => v2
   }
-  reader.setOffsetRange(
-toJava(current),
-Optional.of(availableV2))
+  reader.setOffsetRange(toJava(current), Optional.of(availableV2))
   logDebug(s"Retrieving data from $reader: $current -> 
$availableV2")
-  Some(reader ->
-new 
StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader))
+  Some(reader -> StreamingDataSourceV2Relation(
--- End diff --

It's an artifact of the current implementation of streaming progress 
reporting, which assumes at a deep and hard to untangle level that new data is 
represented by a map of source -> logical plan.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2018-02-22 Thread susanxhuynh
Github user susanxhuynh commented on the issue:

https://github.com/apache/spark/pull/20640
  
@skonto We should not remove the logging. The logging 
[here](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L194)
 is only available if blacklisting is enabled, but by default blacklisting is 
disabled. The BlacklistTracker object [is not 
created](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L748)
 if blacklisting is disabled. But, we might still want to see the log of 
executor failure in this case.

In the case where an executor fails before entering Spark code (for 
example, Mesos agent failed to create the sandbox), would it be detected?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170092263
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala
 ---
@@ -17,15 +17,12 @@
 
 package org.apache.spark.sql.streaming.continuous
 
-import java.util.UUID
--- End diff --

This is another cosmetic change that could cause conflicts.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170092091
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
@@ -492,16 +492,16 @@ class StreamSuite extends StreamTest {
 
   val explainWithoutExtended = q.explainInternal(false)
   // `extended = false` only displays the physical plan.
-  
assert("StreamingDataSourceV2Relation".r.findAllMatchIn(explainWithoutExtended).size
 === 0)
-  
assert("DataSourceV2Scan".r.findAllMatchIn(explainWithoutExtended).size === 1)
+  assert("Streaming 
Relation".r.findAllMatchIn(explainWithoutExtended).size === 0)
+  assert("Scan 
FakeDataSourceV2".r.findAllMatchIn(explainWithoutExtended).size === 1)
--- End diff --

Why is this using the fake?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170091954
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -415,12 +418,14 @@ class MicroBatchExecution(
 case v1: SerializedOffset => reader.deserializeOffset(v1.json)
 case v2: OffsetV2 => v2
   }
-  reader.setOffsetRange(
-toJava(current),
-Optional.of(availableV2))
+  reader.setOffsetRange(toJava(current), Optional.of(availableV2))
   logDebug(s"Retrieving data from $reader: $current -> 
$availableV2")
-  Some(reader ->
-new 
StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader))
+  Some(reader -> StreamingDataSourceV2Relation(
--- End diff --

I realize that this is a pre-existing problem, but why is it necessary to 
create a relation from a reader here? The addition of `FakeDataSourceV2` and 
the `readerToDataSourceMap` aren't unresonable because the relation should have 
a reference to the `DataSourceV2` instance, but I doubt that the relation 
should be created here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #87623 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87623/testReport)**
 for PR 20622 at commit 
[`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20622
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20622
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87620/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...

2018-02-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20622
  
**[Test build #87620 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87620/testReport)**
 for PR 20622 at commit 
[`0e5e52f`](https://github.com/apache/spark/commit/0e5e52f2c6b934a372d098a0d7780da18d3f99e0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170090758
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -415,12 +418,14 @@ class MicroBatchExecution(
 case v1: SerializedOffset => reader.deserializeOffset(v1.json)
 case v2: OffsetV2 => v2
   }
-  reader.setOffsetRange(
-toJava(current),
-Optional.of(availableV2))
+  reader.setOffsetRange(toJava(current), Optional.of(availableV2))
--- End diff --

This is another style-only change that will cause conflicts.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170090323
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -77,31 +79,32 @@ class MicroBatchExecution(
   
sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",")
 
 val _logicalPlan = analyzedPlan.transform {
-  case streamingRelation@StreamingRelation(dataSourceV1, sourceName, 
output) =>
-toExecutionRelationMap.getOrElseUpdate(streamingRelation, {
+  case s @ StreamingRelation(dsV1, sourceName, output) =>
--- End diff --

These renames are cosmetic, and unnecessary changes like this cause commit 
conflicts. I think these changes should be reverted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170089913
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -26,7 +26,7 @@ object PushDownOperatorsToDataSource extends 
Rule[LogicalPlan] {
   override def apply(
   plan: LogicalPlan): LogicalPlan = plan transformUp {
 // PhysicalOperation guarantees that filters are deterministic; no 
need to check
-case PhysicalOperation(project, newFilters, relation : 
DataSourceV2Relation) =>
+case PhysicalOperation(project, newFilters, relation: 
DataSourceV2Relation) =>
--- End diff --

While I'd rather not have this space there either (looks like an accident), 
there are no other changes to this file and this "fix" is not necessary. The 
risk of this causing commit conflicts outweighs the benefit of conforming to 
style so it should be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170089465
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
@@ -23,11 +23,11 @@ import org.apache.spark.sql.execution.SparkPlan
 
 object DataSourceV2Strategy extends Strategy {
   override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-case relation: DataSourceV2Relation =>
-  DataSourceV2ScanExec(relation.output, relation.reader) :: Nil
+case r: DataSourceV2Relation =>
--- End diff --

Why rename this variable?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170089368
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -107,17 +106,24 @@ case class DataSourceV2Relation(
 }
 
 /**
- * A specialization of DataSourceV2Relation with the streaming bit set to 
true. Otherwise identical
- * to the non-streaming relation.
+ * A specialization of [[DataSourceV2Relation]] with the streaming bit set 
to true.
+ *
+ * Note that, this plan has a mutable reader, so Spark won't apply 
operator push-down for this plan,
--- End diff --

I'm -0 on including this in a PR to improve explain results. This could 
needlessly cause commit conflicts when maintaining a branch. But, this is small 
and needs to go in somewhere. I would remove it, though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20650: [SPARK-23408][SS] Synchronize successive AddData actions...

2018-02-22 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20650
  
All the failures above can be attributed to other flakiness unrelated to 
the flakiness this PR trying to address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-02-22 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20647#discussion_r170088798
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -35,15 +35,14 @@ case class DataSourceV2Relation(
 options: Map[String, String],
 projection: Seq[AttributeReference],
 filters: Option[Seq[Expression]] = None,
-userSpecifiedSchema: Option[StructType] = None) extends LeafNode with 
MultiInstanceRelation {
+userSpecifiedSchema: Option[StructType] = None)
+  extends LeafNode with MultiInstanceRelation with DataSourceV2QueryPlan {
--- End diff --

Extending `DataSourceV2QueryPlan` changes the definition of equality, which 
I don't think is correct. At a minimum, I would say that two relations are 
equal if they produce the same sequence of records. Equality as implemented in 
this PR would allow completely different folders of data to be considered equal 
as long as they produce the same schema and have the same filters.

In general, I don't see the utility of `DataSourceV2QueryPlan` for the 
purpose of this PR, which is to improve explain results. This doesn't use the 
source name if it is named, which is a regression in the explain results. It 
also doesn't indicate that the source is v2. And finally, it doesn't show the 
most important part of the scan, which is where the data comes from.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >