[GitHub] spark issue #11207: [SPARK-12583][Mesos] Mesos shuffle service: Don't delete...
Github user IgorBerman commented on the issue: https://github.com/apache/spark/pull/11207 @bbossy so my work around: ive disabled cleanup of external shuffle service and removing shuffle files by cron that finds files that were not accessed in last X hours. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20382: [SPARK-23097][SQL][SS] Migrate text socket source...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20382#discussion_r170178735 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.sources + +import java.io.{IOException, OutputStreamWriter} +import java.net.ServerSocket +import java.sql.Timestamp +import java.util.Optional +import java.util.concurrent.LinkedBlockingQueue + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ListBuffer + +import org.scalatest.BeforeAndAfterEach + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{AnalysisException, Row} +import org.apache.spark.sql.execution.streaming.LongOffset +import org.apache.spark.sql.sources.v2.DataSourceOptions +import org.apache.spark.sql.sources.v2.reader.streaming.MicroBatchReader +import org.apache.spark.sql.streaming.StreamTest +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType} + +class TextSocketStreamSuite extends StreamTest with SharedSQLContext with BeforeAndAfterEach { + + override def afterEach() { +sqlContext.streams.active.foreach(_.stop()) +if (serverThread != null) { + serverThread.interrupt() + serverThread.join() + serverThread = null +} +if (batchReader != null) { + batchReader.stop() + batchReader = null +} + } + + private var serverThread: ServerThread = null + private var batchReader: MicroBatchReader = null + + test("V2 basic usage") { +serverThread = new ServerThread() +serverThread.start() + +val provider = new TextSocketSourceProvider +val options = new DataSourceOptions( + Map("host" -> "localhost", "port" -> serverThread.port.toString).asJava) +batchReader = provider.createMicroBatchReader(Optional.empty(), "", options) + +val schema = batchReader.readSchema() +assert(schema === StructType(StructField("value", StringType) :: Nil)) + +failAfter(streamingTimeout) { + serverThread.enqueue("hello") + batchReader.setOffsetRange(Optional.empty(), Optional.empty()) + while (batchReader.getEndOffset.asInstanceOf[LongOffset].offset == -1L) { +batchReader.setOffsetRange(Optional.empty(), Optional.empty()) +Thread.sleep(10) + } + withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { +val offset1 = batchReader.getEndOffset +val batch1 = new ListBuffer[Row] + batchReader.createDataReaderFactories().asScala.map(_.createDataReader()).foreach { r => + while (r.next()) { +batch1.append(r.get()) + } +} +assert(batch1.map(_.getAs[String](0)) === Seq("hello")) + +serverThread.enqueue("world") +while (batchReader.getEndOffset === offset1) { + batchReader.setOffsetRange(Optional.of(offset1), Optional.empty()) + Thread.sleep(10) +} +val offset2 = batchReader.getEndOffset +val batch2 = new ListBuffer[Row] + batchReader.createDataReaderFactories().asScala.map(_.createDataReader()).foreach { r => + while (r.next()) { +batch2.append(r.get()) + } +} +assert(batch2.map(_.getAs[String](0)) === Seq("world")) + +batchReader.setOffsetRange(Optional.empty(), Optional.of(offset2)) +val both = new ListBuffer[Row] + batchReader.createDataReaderFactories().asScala.map(_.createDataReader()).foreach { r => + while (r.next()) { +both.append(r.get()) + } +} +as
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20382 Sorry @tdas for the delay. I'm working on this, will push new changes soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20624#discussion_r170171868 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -387,6 +390,101 @@ case class CatalogStatistics( } } +/** + * This class of statistics for a column is used in [[CatalogTable]] to interact with metastore. + */ +case class CatalogColumnStat( + distinctCount: Option[BigInt] = None, + min: Option[String] = None, + max: Option[String] = None, + nullCount: Option[BigInt] = None, + avgLen: Option[Long] = None, + maxLen: Option[Long] = None, + histogram: Option[Histogram] = None) { + + /** + * Returns a map from string to string that can be used to serialize the column stats. + * The key is the name of the column and name of the field (e.g. "colName.distinctCount"), + * and the value is the string representation for the value. + * min/max values are stored as Strings. They can be deserialized using + * [[ColumnStat.fromExternalString]]. + * + * As part of the protocol, the returned map always contains a key called "version". + * In the case min/max values are null (None), they won't appear in the map. + */ + def toMap(colName: String): Map[String, String] = { +val map = new scala.collection.mutable.HashMap[String, String] +map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1") +distinctCount.foreach { v => + map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", v.toString) +} +nullCount.foreach { v => + map.put(s"${colName}.${CatalogColumnStat.KEY_NULL_COUNT}", v.toString) +} +avgLen.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_AVG_LEN}", v.toString) } +maxLen.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_LEN}", v.toString) } +min.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MIN_VALUE}", v) } +max.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_VALUE}", v) } +histogram.foreach { h => + map.put(s"${colName}.${CatalogColumnStat.KEY_HISTOGRAM}", HistogramSerializer.serialize(h)) +} +map.toMap + } + + /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */ + def toPlanStat( + colName: String, + dataType: DataType): ColumnStat = +ColumnStat( + distinctCount = distinctCount, + min = min.map(ColumnStat.fromExternalString(_, colName, dataType)), + max = max.map(ColumnStat.fromExternalString(_, colName, dataType)), + nullCount = nullCount, + avgLen = avgLen, + maxLen = maxLen, + histogram = histogram) +} + +object CatalogColumnStat extends Logging { + + // List of string keys used to serialize CatalogColumnStat + val KEY_VERSION = "version" + private val KEY_DISTINCT_COUNT = "distinctCount" + private val KEY_MIN_VALUE = "min" + private val KEY_MAX_VALUE = "max" + private val KEY_NULL_COUNT = "nullCount" + private val KEY_AVG_LEN = "avgLen" + private val KEY_MAX_LEN = "maxLen" + private val KEY_HISTOGRAM = "histogram" + + /** + * Creates a [[CatalogColumnStat]] object from the given map. + * This is used to deserialize column stats from some external storage. + * The serialization side is defined in [[CatalogColumnStat.toMap]]. + */ + def fromMap( +table: String, +colName: String, +map: Map[String, String]): Option[CatalogColumnStat] = { + +try { + Some(CatalogColumnStat( +distinctCount = map.get(s"${colName}.${KEY_DISTINCT_COUNT}").map(v => BigInt(v.toLong)), --- End diff -- IIUC the format doesn't change, we just change the way to save/restore stats in metastore, which looks cleaner. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20618: [SPARK-23329][SQL] Fix documentation of trigonometric fu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20618 Also cc @srinathshankar --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20660: [SPARK-23490][SQL]Check storage.locationUri with ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20660 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20660: [SPARK-23490][SQL]Check storage.locationUri with existin...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20660 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1009/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 **[Test build #87626 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87626/testReport)** for PR 20553 at commit [`8f457ce`](https://github.com/apache/spark/commit/8f457cee17ffdb478fca3d3d2ff05c343217aef4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87626/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1009/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1015/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10942: [SPARK-12850] [SQL] Support Bucket Pruning (Predicate Pu...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/10942 FYI this feature will be re-designed when migrating file-based data sources to data source v2. From a quick look seems this feature is not in the current code base anymore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1008/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87625/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 **[Test build #87625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87625/testReport)** for PR 20553 at commit [`11f94e2`](https://github.com/apache/spark/commit/11f94e2f57b6af331ec0d9e58708e03ac5a44e2a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1008/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1014/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20553 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 **[Test build #87626 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87626/testReport)** for PR 20553 at commit [`8f457ce`](https://github.com/apache/spark/commit/8f457cee17ffdb478fca3d3d2ff05c343217aef4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20553#discussion_r170169357 --- Diff: docs/running-on-kubernetes.md --- @@ -576,14 +576,21 @@ specific to Spark on Kubernetes. spark.kubernetes.driver.limit.cores (none) -Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. +Specify a hard [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) on the amount of CPU cores for the driver pod. --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20553 **[Test build #87625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87625/testReport)** for PR 20553 at commit [`11f94e2`](https://github.com/apache/spark/commit/11f94e2f57b6af331ec0d9e58708e03ac5a44e2a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20553#discussion_r170168795 --- Diff: docs/running-on-kubernetes.md --- @@ -576,14 +576,21 @@ specific to Spark on Kubernetes. spark.kubernetes.driver.limit.cores (none) -Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. +Specify a hard [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) on the amount of CPU cores for the driver pod. + + + + spark.kubernetes.executor.cores + (none) + +Specify the amount of CPU cores to request for each executor pod. Values conform to the Kubernetes [convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu). --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20553#discussion_r170168480 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala --- @@ -144,7 +149,7 @@ private[spark] class ExecutorPodFactory( val executorEnv = (Seq( (ENV_DRIVER_URL, driverUrl), // Executor backend expects integral value for executor cores, so round it up to an int. - (ENV_EXECUTOR_CORES, math.ceil(executorCores).toInt.toString), + (ENV_EXECUTOR_CORES, executorCores.toString), --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20628: [SPARK-23449][K8S] Preserve extraJavaOptions ordering
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20628 also cc @vanzin to verify this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1013/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87624/testReport)** for PR 20647 at commit [`dbee281`](https://github.com/apache/spark/commit/dbee2813accd4c8f5937b28eb9142cc6a50f8c6a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20648: [SPARK-23448][SQL] JSON parser should return partial row...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20648 Yes, thanks @HyukjinKwon for checking the behavior. If we look at the codes of JSON parser, we will find many places indicating the expectation of availability of partial results. For example in `BadRecordException`, there is `partialResult` which is supposed to hold partial result of parsing a bad record. But we never really use it to return partial result but just use `None` for it. Note: If we don't want to return partial result at all, we should refactor this part of code to make it clear. If we decide not to change current behavior, I can submit another PR to do refactoring. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170162261 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -35,15 +35,14 @@ case class DataSourceV2Relation( options: Map[String, String], projection: Seq[AttributeReference], filters: Option[Seq[Expression]] = None, -userSpecifiedSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { +userSpecifiedSchema: Option[StructType] = None) + extends LeafNode with MultiInstanceRelation with DataSourceV2QueryPlan { --- End diff -- Sorry I didn't realize you removed the extending of `DataSourceReaderHolder`, otherwise I would point it out in your PR. In general the question is, how should we define the equality of data source relation? It was defined by `output, reader.getClass, filters` before, and your PR changed it to the default equality of `DataSourceV2Relation` silently. The major difference is, should `options` take part in the equality? The answer is obviously yes, like `path`, so I'll add `options` to the equality. BTW `DataSourceV2QueryPlan` is needed, as there are 3 plans need to implement explain: `DataSourceV2Relation`, `StreamingDataSourceV2Relation`, `DataSourceV2ScanExec`. > This doesn't use the source name if it is named I missed that, let me add back. > doesn't indicate that the source is v2 I'll improve it > it doesn't show the most important part of the scan Do you mean the path option? First I don't think showing all the options is a good idea, as it can be a lot. My future plan is to show these standard options like path, table, etc. Again this is something added silently, there is no consensus about how to explain a data source v2 relation, my PR tries to have people focus on this part and have a consensus. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170160050 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -77,31 +79,32 @@ class MicroBatchExecution( sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",") val _logicalPlan = analyzedPlan.transform { - case streamingRelation@StreamingRelation(dataSourceV1, sourceName, output) => -toExecutionRelationMap.getOrElseUpdate(streamingRelation, { + case s @ StreamingRelation(dsV1, sourceName, output) => --- End diff -- This is to make the naming consistent, see https://github.com/apache/spark/pull/20647/files#diff-c2959c723f334c32806217216014362eL89 In general we don't forbid users to fix some code style issue in related PRs, otherwise we need to have code-style-fix-only PRs, which is not the common case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170159543 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -107,17 +106,24 @@ case class DataSourceV2Relation( } /** - * A specialization of DataSourceV2Relation with the streaming bit set to true. Otherwise identical - * to the non-streaming relation. + * A specialization of [[DataSourceV2Relation]] with the streaming bit set to true. + * + * Note that, this plan has a mutable reader, so Spark won't apply operator push-down for this plan, --- End diff -- are you suggesting we should open a JIRA to fix the internal document? We usually fix them in related PRs that touch the class... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20658: [SPARK-23488][python] Add missing catalog methods...
Github user drboyer commented on a diff in the pull request: https://github.com/apache/spark/pull/20658#discussion_r170158422 --- Diff: python/pyspark/sql/catalog.py --- @@ -28,7 +28,7 @@ Database = namedtuple("Database", "name description locationUri") Table = namedtuple("Table", "name database description tableType isTemporary") Column = namedtuple("Column", "name description dataType nullable isPartition isBucket") -Function = namedtuple("Function", "name description className isTemporary") +Function = namedtuple("Function", "name database description className isTemporary") --- End diff -- Ah yes, `database` is in the [Scala api](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function) so I added it in for the sake of completeness, but I'm happy to remove it if there's a concern. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20519: [Spark-23240][python] Don't let python site custo...
Github user bersprockets closed the pull request at: https://github.com/apache/spark/pull/20519 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user asolimando commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r170140738 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -362,10 +365,10 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { assert(topNode.isLeaf === false) assert(topNode.stats === null) -val nodesForGroup = Map((0, Array(topNode))) -val treeToNodeToIndexInfo = Map((0, Map( - (topNode.id, new RandomForest.NodeIndexInfo(0, None)) -))) +val nodesForGroup = Map(0 -> Array(topNode)) --- End diff -- Two tests previously moved here have now been moved back, there is still _"Use soft prediction for binary classification with ordered categorical features"_ to which I have applied @srowen 's comment, so the consistency argument still holds (even if weakened a bit). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user asolimando commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r170140280 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala --- @@ -359,29 +339,6 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext { assert(rootNode.stats.isEmpty) } - test("do not choose split that does not satisfy min instance per node requirements") { -// if a split does not satisfy min instances per node requirements, -// this split is invalid, even though the information gain of split is large. -val arr = Array( - LabeledPoint(0.0, Vectors.dense(0.0, 1.0)), --- End diff -- That's true , I have modified the input data for both tests as suggested, and "moved back" the two tests from _.../ml/tree/impl/RandomForestSuite.scala_ to _.../mllib/tree/DecisionTreeSuite.scala_ where they originally were. The whole suite of tests for mllib passes. As a recap, 2 tests have been adapted by slightly changing the input data: - _"Multiclass classification stump with 10-ary (ordered) categorical features"_ - _"do not choose split that does not satisfy min instance per node requirements"_ _"Use soft prediction for binary classification with ordered categorical features"_ was present in two files: 1. _.../ml/classification/DecisionTreeClassifierSuite.scala_ 2. _.../ml/tree/impl/RandomForestSuite.scala_ The one in 1. has been removed because it had to be adapted and it was redundant, while the one in 2. has been adapted following the same principle of other tests in that file such as _"Avoid aggregation on the last level" test, for instance"_. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user asolimando commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r170139957 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -402,20 +405,40 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { LabeledPoint(1.0, Vectors.dense(2.0))) val input = sc.parallelize(arr) +val seed = 42 +val numTrees = 1 + // Must set maxBins s.t. the feature will be treated as an ordered categorical feature. val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1, numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3), maxBins = 3) -val model = RandomForest.run(input, strategy, numTrees = 1, featureSubsetStrategy = "all", - seed = 42, instr = None).head -model.rootNode match { - case n: InternalNode => n.split match { -case s: CategoricalSplit => - assert(s.leftCategories === Array(1.0)) -case _ => throw new AssertionError("model.rootNode.split was not a CategoricalSplit") - } - case _ => throw new AssertionError("model.rootNode was not an InternalNode") -} +val metadata = DecisionTreeMetadata.buildMetadata(input, strategy, numTrees = numTrees, + featureSubsetStrategy = "all") +val splits = RandomForest.findSplits(input, metadata, seed = seed) + +val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata) +val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, + strategy.subsamplingRate, numTrees, false, seed = seed) + +val topNode = LearningNode.emptyNode(nodeIndex = 1) +assert(topNode.isLeaf === false) +assert(topNode.stats === null) + +val nodesForGroup = Map(0 -> Array(topNode)) +val treeToNodeToIndexInfo = Map(0 -> Map( + topNode.id -> new RandomForest.NodeIndexInfo(0, None) +)) +val nodeStack = new mutable.ArrayStack[(Int, LearningNode)] +val bestSplit = RandomForest.findBestSplits(baggedInput, metadata, Map(0 -> topNode), + nodesForGroup, treeToNodeToIndexInfo, splits, nodeStack) + +assert(topNode.split.isDefined, "rootNode does not have a split") --- End diff -- It is true but I need to call these internal methods to initialise the structure correctly, including _rootNode_. I have removed the only lines that did not look necessary to me: ` assert(topNode.isLeaf === false)` ` assert(topNode.stats === null)` What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20553#discussion_r170138424 --- Diff: docs/running-on-kubernetes.md --- @@ -576,14 +576,21 @@ specific to Spark on Kubernetes. spark.kubernetes.driver.limit.cores (none) -Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. +Specify a hard [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) on the amount of CPU cores for the driver pod. + + + + spark.kubernetes.executor.cores + (none) + +Specify the amount of CPU cores to request for each executor pod. Values conform to the Kubernetes [convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu). --- End diff -- Should we mention that this value overrides `spark.executor.cores`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20553#discussion_r170137631 --- Diff: docs/running-on-kubernetes.md --- @@ -576,14 +576,21 @@ specific to Spark on Kubernetes. spark.kubernetes.driver.limit.cores (none) -Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. +Specify a hard [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) on the amount of CPU cores for the driver pod. --- End diff -- I think it reads better without "the amount of", i.e. "Specify a hard limit on CPU cores for the driver pod". Same comment for the below section as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20553#discussion_r170138218 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala --- @@ -144,7 +149,7 @@ private[spark] class ExecutorPodFactory( val executorEnv = (Seq( (ENV_DRIVER_URL, driverUrl), // Executor backend expects integral value for executor cores, so round it up to an int. - (ENV_EXECUTOR_CORES, math.ceil(executorCores).toInt.toString), + (ENV_EXECUTOR_CORES, executorCores.toString), --- End diff -- The comment above is now outdated and can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4129/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/20553 Any comment or concern on this? Is this good to merge? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4134 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4134/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4132/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4131/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4133 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4133/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4130/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4127/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20622 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20622 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87623/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #87623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87623/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20660: [SPARK-23490][SQL]Check storage.locationUri with existin...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/20660 @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4135 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4135/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4128 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4128/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20650: [SPARK-23408][SS] Synchronize successive AddData actions...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20650 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20645 > A quick question: after this change, extraJavaOptions is still able to cleanly override whatever's set in defaultJavaOptions, is that right? No, the intent is for both sets of options to be passed. How the JVM interprets the options is not up to Spark. This is intended to provide a way for administrators to default properties so they are not accidentally overridden when a user adds `--driver-java-options`. Users can still override `defaultJavaOptions` if they need to deviate from adminstrator defaults. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20657: [SPARK-23361][yarn] Allow AM to restart after initial to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20657 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20657: [SPARK-23361][yarn] Allow AM to restart after initial to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20657 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87621/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20653 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20657: [SPARK-23361][yarn] Allow AM to restart after initial to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20657 **[Test build #87621 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87621/testReport)** for PR 20657 at commit [`2c3448d`](https://github.com/apache/spark/commit/2c3448dd3aa4071234a65a1c9317b1a3c4fe8d24). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20653 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87622/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20653: [SPARK-23459][SQL] Improve the error message when unknow...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20653 **[Test build #87622 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87622/testReport)** for PR 20653 at commit [`d6ac338`](https://github.com/apache/spark/commit/d6ac338a594d823a8eca0356102c6e9bf2bf09d6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170115965 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -415,12 +418,14 @@ class MicroBatchExecution( case v1: SerializedOffset => reader.deserializeOffset(v1.json) case v2: OffsetV2 => v2 } - reader.setOffsetRange( -toJava(current), -Optional.of(availableV2)) + reader.setOffsetRange(toJava(current), Optional.of(availableV2)) logDebug(s"Retrieving data from $reader: $current -> $availableV2") - Some(reader -> -new StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader)) + Some(reader -> StreamingDataSourceV2Relation( --- End diff -- @rdblue There was a doc as part of this SPIP: https://issues.apache.org/jira/browse/SPARK-20928, but it has definitely evolved enough past that we should update and send to the dev list again. Things like the logical plan requirement in execution will likely be significantly easier to remove once we have a full V2 API and can remove the legacy internal API for streaming. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20645 A quick question: after this change, `extraJavaOptions` is still able to cleanly override whatever's set in `defaultJavaOptions`, is that right? ^^ Just making sure I understood the intent correctly and not the other way around. There may well be the other side of administrative needs which is to "force options", e.g. force `-XX:-DisableExplicitGC` so that NIO direct memory doesn't get into trouble, but that'd be out of scope of this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170106104 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -415,12 +418,14 @@ class MicroBatchExecution( case v1: SerializedOffset => reader.deserializeOffset(v1.json) case v2: OffsetV2 => v2 } - reader.setOffsetRange( -toJava(current), -Optional.of(availableV2)) + reader.setOffsetRange(toJava(current), Optional.of(availableV2)) logDebug(s"Retrieving data from $reader: $current -> $availableV2") - Some(reader -> -new StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader)) + Some(reader -> StreamingDataSourceV2Relation( --- End diff -- When it comes to the streaming execution code, the basic problem is that it was more evolved than designed. For example, there's no particular reason to use a logical plan; the map is only ever used in order to construct another map of source -> physical plan stats. Untangling StreamExecution is definitely something we need to do, but that's going to be annoying and I think it's sufficiently orthogonal to the V2 migration to put off. There's currently no design doc for the streaming aspects of DataSourceV2. We kinda rushed an experimental version out the door, because it was coupled with the experimental ContinuousExecution streaming mode. I'm working on going back and cleaning things up; I'll send docs to the dev list and make sure to @ you on the changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20604#discussion_r170103841 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala --- @@ -55,18 +55,18 @@ private[spark] trait ExecutorAllocationClient { /** * Request that the cluster manager kill the specified executors. * - * When asking the executor to be replaced, the executor loss is considered a failure, and - * killed tasks that are running on the executor will count towards the failure limits. If no - * replacement is being requested, then the tasks will not count towards the limit. - * * @param executorIds identifiers of executors to kill - * @param replace whether to replace the killed executors with new ones, default false + * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down + * after these executors have been killed + * @param countFailures if there are tasks running on the executors when they are killed, whether --- End diff -- I'm still a little confused about this parameter. If `force = false`, it's a no op. And all call sites I've seen seem to set this parameter to `false`. So is there something I'm missing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20604#discussion_r170102063 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -334,6 +336,11 @@ private[spark] class ExecutorAllocationManager( // If the new target has not changed, avoid sending a message to the cluster manager if (numExecutorsTarget < oldNumExecutorsTarget) { +// We lower the target number of executors but don't actively kill any yet. Killing is +// controlled separately by an idle timeout. Its still helpful to reduce the target number --- End diff -- nit: it's --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20604#discussion_r170102582 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -334,6 +336,11 @@ private[spark] class ExecutorAllocationManager( // If the new target has not changed, avoid sending a message to the cluster manager if (numExecutorsTarget < oldNumExecutorsTarget) { +// We lower the target number of executors but don't actively kill any yet. Killing is +// controlled separately by an idle timeout. Its still helpful to reduce the target number +// in case an executor just happens to get lost (eg., bad hardware, or the cluster manager +// preempts it) -- in that case, there is no point in trying to immediately get a new +// executor, since we couldn't even use it yet. --- End diff -- s/couldn't/wouldn't --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20604: [SPARK-23365][CORE] Do not adjust num executors w...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20604#discussion_r170102363 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala --- @@ -55,18 +55,18 @@ private[spark] trait ExecutorAllocationClient { /** * Request that the cluster manager kill the specified executors. * - * When asking the executor to be replaced, the executor loss is considered a failure, and - * killed tasks that are running on the executor will count towards the failure limits. If no - * replacement is being requested, then the tasks will not count towards the limit. - * * @param executorIds identifiers of executors to kill - * @param replace whether to replace the killed executors with new ones, default false + * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down + * after these executors have been killed + * @param countFailures if there are tasks running on the executors when they are killed, whether + * those failures be counted to task failure limits? --- End diff -- nit: "whether to count those failures toward task failure limits" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20624#discussion_r170102064 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -387,6 +390,101 @@ case class CatalogStatistics( } } +/** + * This class of statistics for a column is used in [[CatalogTable]] to interact with metastore. + */ +case class CatalogColumnStat( + distinctCount: Option[BigInt] = None, + min: Option[String] = None, + max: Option[String] = None, + nullCount: Option[BigInt] = None, + avgLen: Option[Long] = None, + maxLen: Option[Long] = None, + histogram: Option[Histogram] = None) { + + /** + * Returns a map from string to string that can be used to serialize the column stats. + * The key is the name of the column and name of the field (e.g. "colName.distinctCount"), + * and the value is the string representation for the value. + * min/max values are stored as Strings. They can be deserialized using + * [[ColumnStat.fromExternalString]]. + * + * As part of the protocol, the returned map always contains a key called "version". + * In the case min/max values are null (None), they won't appear in the map. + */ + def toMap(colName: String): Map[String, String] = { +val map = new scala.collection.mutable.HashMap[String, String] +map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1") +distinctCount.foreach { v => + map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", v.toString) +} +nullCount.foreach { v => + map.put(s"${colName}.${CatalogColumnStat.KEY_NULL_COUNT}", v.toString) +} +avgLen.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_AVG_LEN}", v.toString) } +maxLen.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_LEN}", v.toString) } +min.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MIN_VALUE}", v) } +max.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_VALUE}", v) } +histogram.foreach { h => + map.put(s"${colName}.${CatalogColumnStat.KEY_HISTOGRAM}", HistogramSerializer.serialize(h)) +} +map.toMap + } + + /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */ + def toPlanStat( + colName: String, + dataType: DataType): ColumnStat = +ColumnStat( + distinctCount = distinctCount, + min = min.map(ColumnStat.fromExternalString(_, colName, dataType)), + max = max.map(ColumnStat.fromExternalString(_, colName, dataType)), + nullCount = nullCount, + avgLen = avgLen, + maxLen = maxLen, + histogram = histogram) +} + +object CatalogColumnStat extends Logging { + + // List of string keys used to serialize CatalogColumnStat + val KEY_VERSION = "version" + private val KEY_DISTINCT_COUNT = "distinctCount" + private val KEY_MIN_VALUE = "min" + private val KEY_MAX_VALUE = "max" + private val KEY_NULL_COUNT = "nullCount" + private val KEY_AVG_LEN = "avgLen" + private val KEY_MAX_LEN = "maxLen" + private val KEY_HISTOGRAM = "histogram" + + /** + * Creates a [[CatalogColumnStat]] object from the given map. + * This is used to deserialize column stats from some external storage. + * The serialization side is defined in [[CatalogColumnStat.toMap]]. + */ + def fromMap( +table: String, +colName: String, +map: Map[String, String]): Option[CatalogColumnStat] = { + +try { + Some(CatalogColumnStat( +distinctCount = map.get(s"${colName}.${KEY_DISTINCT_COUNT}").map(v => BigInt(v.toLong)), --- End diff -- Could you add a test case? BTW, forwards compatibility is also needed since Hive metastore is being shared by different Spark versions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/20640 @susanxhuynh I agree in case it is not enabled we can log failures as usual, but not for blacklisting as it is disabled it wouldnt make sense. User should have this option not to care. > In the case where an executor fails before entering Spark code (for example, Mesos agent failed to create the sandbox), would it be detected? Good question forgot to mention this. In this scenario a task status update will be given eg. [REASON_CONTAINER_LAUNCH_FAILED](https://github.com/apache/mesos/blob/5e5a8102c3281db25a37157dac123b0ca546e030/docs/task-state-reasons.md) This is done implicitly [here](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L658) in status update which then calls [removeExecutor ](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L728) which then sends a message to drivers point to remove the executor and then this [line](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L330) is called which then will calls another helper [method](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L536), which calls this [one](ht tps://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L595) and in there blacklist info is updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170097795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -415,12 +418,14 @@ class MicroBatchExecution( case v1: SerializedOffset => reader.deserializeOffset(v1.json) case v2: OffsetV2 => v2 } - reader.setOffsetRange( -toJava(current), -Optional.of(availableV2)) + reader.setOffsetRange(toJava(current), Optional.of(availableV2)) logDebug(s"Retrieving data from $reader: $current -> $availableV2") - Some(reader -> -new StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader)) + Some(reader -> StreamingDataSourceV2Relation( --- End diff -- Sounds like this isn't something that should hold up this commit then. Is there a design doc for what you're describing that I can read to familiarize myself with the issues here? I'd like to participate more on the streaming side as it relates to the new data source API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4135 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4135/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4134 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4134/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4133 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4133/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4132 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4132/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4131 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4131/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4130 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4130/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4129 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4129/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4128 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4128/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #4127 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4127/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170095169 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -415,12 +418,14 @@ class MicroBatchExecution( case v1: SerializedOffset => reader.deserializeOffset(v1.json) case v2: OffsetV2 => v2 } - reader.setOffsetRange( -toJava(current), -Optional.of(availableV2)) + reader.setOffsetRange(toJava(current), Optional.of(availableV2)) logDebug(s"Retrieving data from $reader: $current -> $availableV2") - Some(reader -> -new StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader)) + Some(reader -> StreamingDataSourceV2Relation( --- End diff -- It's an artifact of the current implementation of streaming progress reporting, which assumes at a deep and hard to untangle level that new data is represented by a map of source -> logical plan. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user susanxhuynh commented on the issue: https://github.com/apache/spark/pull/20640 @skonto We should not remove the logging. The logging [here](https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L194) is only available if blacklisting is enabled, but by default blacklisting is disabled. The BlacklistTracker object [is not created](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L748) if blacklisting is disabled. But, we might still want to see the log of executor failure in this case. In the case where an executor fails before entering Spark code (for example, Mesos agent failed to create the sandbox), would it be detected? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170092263 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala --- @@ -17,15 +17,12 @@ package org.apache.spark.sql.streaming.continuous -import java.util.UUID --- End diff -- This is another cosmetic change that could cause conflicts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170092091 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala --- @@ -492,16 +492,16 @@ class StreamSuite extends StreamTest { val explainWithoutExtended = q.explainInternal(false) // `extended = false` only displays the physical plan. - assert("StreamingDataSourceV2Relation".r.findAllMatchIn(explainWithoutExtended).size === 0) - assert("DataSourceV2Scan".r.findAllMatchIn(explainWithoutExtended).size === 1) + assert("Streaming Relation".r.findAllMatchIn(explainWithoutExtended).size === 0) + assert("Scan FakeDataSourceV2".r.findAllMatchIn(explainWithoutExtended).size === 1) --- End diff -- Why is this using the fake? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170091954 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -415,12 +418,14 @@ class MicroBatchExecution( case v1: SerializedOffset => reader.deserializeOffset(v1.json) case v2: OffsetV2 => v2 } - reader.setOffsetRange( -toJava(current), -Optional.of(availableV2)) + reader.setOffsetRange(toJava(current), Optional.of(availableV2)) logDebug(s"Retrieving data from $reader: $current -> $availableV2") - Some(reader -> -new StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader)) + Some(reader -> StreamingDataSourceV2Relation( --- End diff -- I realize that this is a pre-existing problem, but why is it necessary to create a relation from a reader here? The addition of `FakeDataSourceV2` and the `readerToDataSourceMap` aren't unresonable because the relation should have a reference to the `DataSourceV2` instance, but I doubt that the relation should be created here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #87623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87623/testReport)** for PR 20622 at commit [`ae2d853`](https://github.com/apache/spark/commit/ae2d853fceb25c09efd772d3bb8802982bc86331). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20622 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20622 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87620/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23491][SS] Remove explicit job cancellation from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20622 **[Test build #87620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87620/testReport)** for PR 20622 at commit [`0e5e52f`](https://github.com/apache/spark/commit/0e5e52f2c6b934a372d098a0d7780da18d3f99e0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170090758 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -415,12 +418,14 @@ class MicroBatchExecution( case v1: SerializedOffset => reader.deserializeOffset(v1.json) case v2: OffsetV2 => v2 } - reader.setOffsetRange( -toJava(current), -Optional.of(availableV2)) + reader.setOffsetRange(toJava(current), Optional.of(availableV2)) --- End diff -- This is another style-only change that will cause conflicts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170090323 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -77,31 +79,32 @@ class MicroBatchExecution( sparkSession.sqlContext.conf.disabledV2StreamingMicroBatchReaders.split(",") val _logicalPlan = analyzedPlan.transform { - case streamingRelation@StreamingRelation(dataSourceV1, sourceName, output) => -toExecutionRelationMap.getOrElseUpdate(streamingRelation, { + case s @ StreamingRelation(dsV1, sourceName, output) => --- End diff -- These renames are cosmetic, and unnecessary changes like this cause commit conflicts. I think these changes should be reverted. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170089913 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -26,7 +26,7 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] { override def apply( plan: LogicalPlan): LogicalPlan = plan transformUp { // PhysicalOperation guarantees that filters are deterministic; no need to check -case PhysicalOperation(project, newFilters, relation : DataSourceV2Relation) => +case PhysicalOperation(project, newFilters, relation: DataSourceV2Relation) => --- End diff -- While I'd rather not have this space there either (looks like an accident), there are no other changes to this file and this "fix" is not necessary. The risk of this causing commit conflicts outweighs the benefit of conforming to style so it should be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170089465 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -23,11 +23,11 @@ import org.apache.spark.sql.execution.SparkPlan object DataSourceV2Strategy extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { -case relation: DataSourceV2Relation => - DataSourceV2ScanExec(relation.output, relation.reader) :: Nil +case r: DataSourceV2Relation => --- End diff -- Why rename this variable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170089368 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -107,17 +106,24 @@ case class DataSourceV2Relation( } /** - * A specialization of DataSourceV2Relation with the streaming bit set to true. Otherwise identical - * to the non-streaming relation. + * A specialization of [[DataSourceV2Relation]] with the streaming bit set to true. + * + * Note that, this plan has a mutable reader, so Spark won't apply operator push-down for this plan, --- End diff -- I'm -0 on including this in a PR to improve explain results. This could needlessly cause commit conflicts when maintaining a branch. But, this is small and needs to go in somewhere. I would remove it, though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20650: [SPARK-23408][SS] Synchronize successive AddData actions...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20650 All the failures above can be attributed to other flakiness unrelated to the flakiness this PR trying to address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r170088798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -35,15 +35,14 @@ case class DataSourceV2Relation( options: Map[String, String], projection: Seq[AttributeReference], filters: Option[Seq[Expression]] = None, -userSpecifiedSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { +userSpecifiedSchema: Option[StructType] = None) + extends LeafNode with MultiInstanceRelation with DataSourceV2QueryPlan { --- End diff -- Extending `DataSourceV2QueryPlan` changes the definition of equality, which I don't think is correct. At a minimum, I would say that two relations are equal if they produce the same sequence of records. Equality as implemented in this PR would allow completely different folders of data to be considered equal as long as they produce the same schema and have the same filters. In general, I don't see the utility of `DataSourceV2QueryPlan` for the purpose of this PR, which is to improve explain results. This doesn't use the source name if it is named, which is a regression in the explain results. It also doesn't indicate that the source is v2. And finally, it doesn't show the most important part of the scan, which is where the data comes from. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org