[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315812795 ## File path: core/src/main/scala/org/apache/spark/SparkContext.scala ## @@ -2851,6 +2851,9 @@ object SparkContext extends Logging { memoryPerSlaveInt, sc.executorMemory)) } +// for local cluster mode the SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED defaults to false Review comment: The comment just repeats the code. The comment should say why this is needed / wanted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315833901 ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -358,12 +382,65 @@ final class ShuffleBlockFetcherIterator( } } + + /** + * Fetch the host-local blocks while we are fetching remote blocks. This is ok because + * `ManagedBuffer`'s memory is allocated lazily when we create the input stream, so all we + * track in-memory are the ManagedBuffer references themselves. + */ + private[this] def fetchHostLocalBlocks() { +logDebug(s"Start fetching host-local blocks: ${hostLocalBlocks.mkString(", ")}") +val hostLocalExecutorIds = hostLocalBlocksByExecutor.keySet.map(_.executorId) +val readsWithoutLocalDir = LinkedHashMap[BlockManagerId, Seq[(BlockId, Long)]]() +val localDirsByExec = blockManager.getHostLocalDirs(hostLocalExecutorIds.toArray) +hostLocalBlocksByExecutor.foreach { case (bmId, blockInfos) => + val localDirs = localDirsByExec.get(bmId.executorId) + if (localDirs.isDefined) { +blockInfos.foreach { case (blockId, _) => + try { +val buf = blockManager + .getHostLocalShuffleData(blockId.asInstanceOf[ShuffleBlockId], localDirs.get) +shuffleMetrics.incLocalBlocksFetched(1) +shuffleMetrics.incLocalBytesRead(buf.size) +buf.retain() +results.put(SuccessFetchResult(blockId, blockManager.blockManagerId, + buf.size(), buf, isNetworkReqDone = false)) + } catch { +case e: Exception => + // If we see an exception, stop immediately. + logError(s"Error occurred while fetching local blocks", e) + results.put(FailureFetchResult(blockId, blockManager.blockManagerId, e)) + return + } +} + } else { +readsWithoutLocalDir += bmId -> blockInfos + } +} + +if (readsWithoutLocalDir.nonEmpty) { + val collectedRemoteRequests = new ArrayBuffer[FetchRequest] + readsWithoutLocalDir.foreach( { case (bmId, blockInfos) => Review comment: `.foreach { case (bmId, blockInfos) =>` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315826558 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -995,6 +1003,19 @@ private[spark] class BlockManager( None } + private[spark] def getHostLocalDirs(executorIds: Array[String]) + : scala.collection.Map[String, Array[String]] = { +val cachedItems = executorIdToLocalDirsCache.filterKeys(executorIds.contains(_)) +if (cachedItems.size < executorIds.length) { + val notCachedItems = master + .getHostLocalDirs(executorIds.filterNot(executorIdToLocalDirsCache.contains)) + executorIdToLocalDirsCache ++= notCachedItems Review comment: Isn't this going to grow potentially unbounded? You added a cache to `BlockManagerMasterEndpoint` but that's something that only exists in the driver. So it doesn't look like there's anything limiting the number of entries cached in executors, at least directly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315829047 ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -74,16 +76,22 @@ final class ShuffleBlockFetcherIterator( maxReqSizeShuffleToMem: Long, detectCorrupt: Boolean, detectCorruptUseExtraMemory: Boolean, -shuffleMetrics: ShuffleReadMetricsReporter) +shuffleMetrics: ShuffleReadMetricsReporter, +enableHostLocalDiskReading: Boolean = true) extends Iterator[(BlockId, InputStream)] with DownloadFileManager with Logging { import ShuffleBlockFetcherIterator._ + // Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them + // smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5 + // nodes, rather than blocking on reading output from one node. + val targetRemoteRequestSize = math.max(maxBytesInFlight / 5, 1L) Review comment: `private`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315833073 ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -358,12 +382,65 @@ final class ShuffleBlockFetcherIterator( } } + + /** + * Fetch the host-local blocks while we are fetching remote blocks. This is ok because + * `ManagedBuffer`'s memory is allocated lazily when we create the input stream, so all we + * track in-memory are the ManagedBuffer references themselves. + */ + private[this] def fetchHostLocalBlocks() { Review comment: missing return type This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315831188 ## File path: project/MimaExcludes.scala ## @@ -36,6 +36,9 @@ object MimaExcludes { // Exclude rules for 3.0.x lazy val v30excludes = v24excludes ++ Seq( +// [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.status.LiveEntityHelpers.createMetrics"), Review comment: Is this still needed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315815432 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -995,6 +1003,19 @@ private[spark] class BlockManager( None } + private[spark] def getHostLocalDirs(executorIds: Array[String]) + : scala.collection.Map[String, Array[String]] = { Review comment: Just `Map`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
vanzin commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r315851620 ## File path: core/src/test/scala/org/apache/spark/ExternalShuffleServiceSuite.scala ## @@ -96,6 +99,39 @@ class ExternalShuffleServiceSuite extends ShuffleSuite with BeforeAndAfterAll wi e.getMessage should include ("Fetch failure will not retry stage due to testing config") } + test("SPARK-27651: host local disk reading avoids external shuffle service on the same node") { +val confWithHostLocalRead = + conf.clone.set(config.SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED, true) +sc = new SparkContext("local-cluster[2,1,1024]", "test", confWithHostLocalRead) +sc.getConf.get(config.SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED) should equal(true) +sc.env.blockManager.externalShuffleServiceEnabled should equal(true) +sc.env.blockManager.externalShuffleServiceEnabled should equal(true) Review comment: Duplicate This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories
AmplabJenkins removed a comment on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories URL: https://github.com/apache/spark/pull/25514#issuecomment-523150908 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dlstadther commented on issue #25513: [MINOR] [DOC] Fix Dockerfile path reference
dlstadther commented on issue #25513: [MINOR] [DOC] Fix Dockerfile path reference URL: https://github.com/apache/spark/pull/25513#issuecomment-523152020 @dongjoon-hyun what `Apache Spark 2.4.3` directory are you referring to? The source code download from Apache's website includes the same directory structure (includes `resource-managers`, but no `kubernetes` directories). BUT the `running-on-kubernetes.md` file contained in that release does not include these two lines. Thus, it still appears the version in `master` is not up-to-date. Could you please direct me to where you see this `kubernetes` dir in 2.4.3? Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API
SparkQA commented on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API URL: https://github.com/apache/spark/pull/25354#issuecomment-523151807 **[Test build #109429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109429/testReport)** for PR 25354 at commit [`17fe1fd`](https://github.com/apache/spark/commit/17fe1fd4439adf821d0ae0fa5ac8b0387fb0a32a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories
SparkQA commented on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories URL: https://github.com/apache/spark/pull/25514#issuecomment-523151801 **[Test build #109428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109428/testReport)** for PR 25514 at commit [`97bcf5b`](https://github.com/apache/spark/commit/97bcf5b810c4c312b9d8aedd90b0d8abd492c27c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API
AmplabJenkins removed a comment on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API URL: https://github.com/apache/spark/pull/25354#issuecomment-523151011 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API
AmplabJenkins removed a comment on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API URL: https://github.com/apache/spark/pull/25354#issuecomment-523151020 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14492/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories
AmplabJenkins removed a comment on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories URL: https://github.com/apache/spark/pull/25514#issuecomment-523150670 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API
AmplabJenkins commented on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API URL: https://github.com/apache/spark/pull/25354#issuecomment-523151020 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14492/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories
AmplabJenkins commented on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories URL: https://github.com/apache/spark/pull/25514#issuecomment-523150908 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API
AmplabJenkins commented on issue #25354: [SPARK-28612][SQL] Add DataFrameWriterV2 API URL: https://github.com/apache/spark/pull/25354#issuecomment-523151011 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories
AmplabJenkins commented on issue #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories URL: https://github.com/apache/spark/pull/25514#issuecomment-523150670 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrutig opened a new pull request #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories
shrutig opened a new pull request #25514: [SPARK-28784]StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories URL: https://github.com/apache/spark/pull/25514 ### What changes were proposed in this pull request? After PR https://github.com/apache/spark/pull/21048, the CheckpointFileManager interface was created to handle all structured streaming checkpointing operations and helps users to choose how they wish to write checkpointing files atomically. StreamExecution and StreamingQueryManager still uses some FileSystem operations without using the CheckpointFileManager. For instance, https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L137 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L392 Instead, StreamExecution and StreamingQueryManager should use CheckpointFileManager for these operations. ### Why are the changes needed? This change will allow users to use CheckpointFileManager for structured streaming checkpointing files without need for a separate FileSystem implementation for the same. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #25348: [SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths
brkyvz commented on a change in pull request #25348: [SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths URL: https://github.com/apache/spark/pull/25348#discussion_r315851494 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import scala.collection.JavaConverters._ + +import org.apache.spark.SparkException +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, SaveMode} +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.sources.{AlwaysTrue, CreatableRelationProvider, Filter, InsertableRelation} +import org.apache.spark.sql.sources.v2.Table +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.util.CaseInsensitiveStringMap + +/** + * Physical plan node for append into a v2 table using V1 write interfaces. + * + * Rows in the output data set are appended. + */ +case class AppendDataExecV1( +writeBuilder: V1WriteBuilder, +plan: LogicalPlan) extends V1FallbackWriters { Review comment: I think that'd be great to have when looking at explain plans This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#discussion_r315845844 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala ## @@ -433,4 +441,42 @@ private[kafka010] object KafkaOffsetReader { StructField("timestamp", TimestampType), StructField("timestampType", IntegerType) )) + + val schemaWithHeaders = { +new StructType(schemaWithoutHeaders.fields :+ StructField("headers", headersType)) + } + + def kafkaSchema(includeHeaders: Boolean): StructType = { +if (includeHeaders) schemaWithHeaders else schemaWithoutHeaders + } + + def toInternalRowWithoutHeaders: Record => InternalRow = +(cr: Record) => InternalRow( + cr.key, cr.value, UTF8String.fromString(cr.topic), cr.partition, cr.offset, + DateTimeUtils.fromJavaTimestamp(new java.sql.Timestamp(cr.timestamp)), cr.timestampType.id +) + + def toInternalRowWithHeaders: Record => InternalRow = +(cr: Record) => InternalRow( + cr.key, cr.value, UTF8String.fromString(cr.topic), cr.partition, cr.offset, + DateTimeUtils.fromJavaTimestamp(new java.sql.Timestamp(cr.timestamp)), cr.timestampType.id, + if (cr.headers.iterator().hasNext) { +new GenericArrayData(cr.headers.iterator().asScala + .map(header => +InternalRow(UTF8String.fromString(header.key()), header.value()) + ).toArray) + } else { +null + } +) + + def toUnsafeRowWithoutHeadersProjector: Record => UnsafeRow = +(cr: Record) => UnsafeProjection.create(schemaWithoutHeaders)(toInternalRowWithoutHeaders(cr)) Review comment: This will create a `UnsafeProjection` for each record. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#discussion_r312267713 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala ## @@ -432,4 +441,42 @@ private[kafka010] object KafkaOffsetReader { StructField("timestamp", TimestampType), StructField("timestampType", IntegerType) )) + + val schemaWithHeaders = { +new StructType(schemaWithoutHeaders.fields :+ StructField("headers", headersType)) + } + + def kafkaSchema(includeHeaders: Boolean): StructType = { +if (includeHeaders) schemaWithHeaders else schemaWithoutHeaders + } + + def toInternalRowWithoutHeaders: Record => InternalRow = Review comment: This should be a `val`. Otherwise, it will be called once for each record in the closure returned by `toUnsafeRowWithoutHeadersProjector`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#discussion_r312267723 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala ## @@ -432,4 +441,42 @@ private[kafka010] object KafkaOffsetReader { StructField("timestamp", TimestampType), StructField("timestampType", IntegerType) )) + + val schemaWithHeaders = { +new StructType(schemaWithoutHeaders.fields :+ StructField("headers", headersType)) + } + + def kafkaSchema(includeHeaders: Boolean): StructType = { +if (includeHeaders) schemaWithHeaders else schemaWithoutHeaders + } + + def toInternalRowWithoutHeaders: Record => InternalRow = +(cr: Record) => InternalRow( + cr.key, cr.value, UTF8String.fromString(cr.topic), cr.partition, cr.offset, + DateTimeUtils.fromJavaTimestamp(new java.sql.Timestamp(cr.timestamp)), cr.timestampType.id +) + + def toInternalRowWithHeaders: Record => InternalRow = Review comment: ditto This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
zsxwing commented on a change in pull request #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming URL: https://github.com/apache/spark/pull/22282#discussion_r315845897 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala ## @@ -433,4 +441,42 @@ private[kafka010] object KafkaOffsetReader { StructField("timestamp", TimestampType), StructField("timestampType", IntegerType) )) + + val schemaWithHeaders = { +new StructType(schemaWithoutHeaders.fields :+ StructField("headers", headersType)) + } + + def kafkaSchema(includeHeaders: Boolean): StructType = { +if (includeHeaders) schemaWithHeaders else schemaWithoutHeaders + } + + def toInternalRowWithoutHeaders: Record => InternalRow = +(cr: Record) => InternalRow( + cr.key, cr.value, UTF8String.fromString(cr.topic), cr.partition, cr.offset, + DateTimeUtils.fromJavaTimestamp(new java.sql.Timestamp(cr.timestamp)), cr.timestampType.id +) + + def toInternalRowWithHeaders: Record => InternalRow = +(cr: Record) => InternalRow( + cr.key, cr.value, UTF8String.fromString(cr.topic), cr.partition, cr.offset, + DateTimeUtils.fromJavaTimestamp(new java.sql.Timestamp(cr.timestamp)), cr.timestampType.id, + if (cr.headers.iterator().hasNext) { +new GenericArrayData(cr.headers.iterator().asScala + .map(header => +InternalRow(UTF8String.fromString(header.key()), header.value()) + ).toArray) + } else { +null + } +) + + def toUnsafeRowWithoutHeadersProjector: Record => UnsafeRow = +(cr: Record) => UnsafeProjection.create(schemaWithoutHeaders)(toInternalRowWithoutHeaders(cr)) + + def toUnsafeRowWithHeadersProjector: Record => UnsafeRow = +(cr: Record) => UnsafeProjection.create(schemaWithHeaders)(toInternalRowWithHeaders(cr)) Review comment: ditto This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #25513: [MINOR] [DOC] Fix Dockerfile path reference
srowen closed pull request #25513: [MINOR] [DOC] Fix Dockerfile path reference URL: https://github.com/apache/spark/pull/25513 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on issue #25513: [MINOR] [DOC] Fix Dockerfile path reference
srowen commented on issue #25513: [MINOR] [DOC] Fix Dockerfile path reference URL: https://github.com/apache/spark/pull/25513#issuecomment-523144518 Oh, right, sorry I made the same mistake! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery
AmplabJenkins removed a comment on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery URL: https://github.com/apache/spark/pull/24785#issuecomment-523143751 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109420/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery
AmplabJenkins removed a comment on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery URL: https://github.com/apache/spark/pull/24785#issuecomment-523143747 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery
AmplabJenkins commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery URL: https://github.com/apache/spark/pull/24785#issuecomment-523143751 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109420/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery
AmplabJenkins commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery URL: https://github.com/apache/spark/pull/24785#issuecomment-523143747 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery
SparkQA removed a comment on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery URL: https://github.com/apache/spark/pull/24785#issuecomment-523077607 **[Test build #109420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109420/testReport)** for PR 24785 at commit [`25240e5`](https://github.com/apache/spark/commit/25240e5c9d9a5a6825e4bbfac53267e75df8f52f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery
SparkQA commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery URL: https://github.com/apache/spark/pull/24785#issuecomment-523142938 **[Test build #109420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109420/testReport)** for PR 24785 at commit [`25240e5`](https://github.com/apache/spark/commit/25240e5c9d9a5a6825e4bbfac53267e75df8f52f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523132993 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14491/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523132989 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523132993 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14491/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523132989 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523131216 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109427/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
ueshin commented on a change in pull request #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#discussion_r315829924 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ## @@ -1917,19 +1921,33 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession { null ).toDF("i") +// transform(i, x -> x + 1) +val resA = Seq( + Row(Seq(2, 10, 9, 8)), + Row(Seq(6, 9, 10, 8, 3)), + Row(Seq.empty), + Row(null)) + +// transform(i, (x, i) -> x + i) +val resB = Seq( + Row(Seq(1, 10, 10, 10)), + Row(Seq(5, 9, 11, 10, 6)), + Row(Seq.empty), + Row(null)) + def testArrayOfPrimitiveTypeNotContainsNull(): Unit = { - checkAnswer(df.selectExpr("transform(i, x -> x + 1)"), -Seq( - Row(Seq(2, 10, 9, 8)), - Row(Seq(6, 9, 10, 8, 3)), - Row(Seq.empty), - Row(null))) - checkAnswer(df.selectExpr("transform(i, (x, i) -> x + i)"), -Seq( - Row(Seq(1, 10, 10, 10)), - Row(Seq(5, 9, 11, 10, 6)), - Row(Seq.empty), - Row(null))) + checkAnswer(df.selectExpr("transform(i, x -> x + 1)"), resA) + checkAnswer(df.selectExpr("transform(i, (x, i) -> x + i)"), resB) + + checkAnswer(df.select(transform(col("i"), x => x + 1)), resA) + checkAnswer(df.select(transform(col("i"), (x, i) => x + i)), resB) + + checkAnswer(df.select(transform(col("i"), new JFunc { +def call(x: Column) = x + 1 + })), resA) + checkAnswer(df.select(transform(col("i"), new JFunc2 { +def call(x: Column, i: Column) = x + i + })), resB) Review comment: Could you move Java tests to `JavaDataFrameSuite` or create `JavaDataFrameFunctionsSuite`? Seems like Java compiler claims `error: reference to transform is ambiguous` because there are two overloaded functions `def transform(column: Column, f: Column => Column): Column` and `def transform(column: Column, f: JavaFunction[Column, Column])`, and the same for the others. I guess Java compiler can handle `Column => Column` so we don't need the overloaded function with `JavaFunction[Column, Column]`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523131203 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
SparkQA removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523130550 **[Test build #109427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109427/testReport)** for PR 25507 at commit [`6f02ddc`](https://github.com/apache/spark/commit/6f02ddc718a4ec5513d482237e0a1f694796e379). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523131203 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523131216 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109427/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
SparkQA commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523131193 **[Test build #109427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109427/testReport)** for PR 25507 at commit [`6f02ddc`](https://github.com/apache/spark/commit/6f02ddc718a4ec5513d482237e0a1f694796e379). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RpcAbortException(message: String) extends Exception(message)` * `class CatalogManager(conf: SQLConf) extends Logging ` * `case class ReuseAdaptiveSubquery(` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
SparkQA commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523130550 **[Test build #109427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109427/testReport)** for PR 25507 at commit [`6f02ddc`](https://github.com/apache/spark/commit/6f02ddc718a4ec5513d482237e0a1f694796e379). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523129733 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109418/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523129728 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523129733 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109418/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
AmplabJenkins commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523129728 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
SparkQA removed a comment on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523077563 **[Test build #109418 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109418/testReport)** for PR 25507 at commit [`a911ee3`](https://github.com/apache/spark/commit/a911ee3e609c0888fc4a7e96dd8aa9cb0d649ca5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
SparkQA commented on issue #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#issuecomment-523129172 **[Test build #109418 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109418/testReport)** for PR 25507 at commit [`a911ee3`](https://github.com/apache/spark/commit/a911ee3e609c0888fc4a7e96dd8aa9cb0d649ca5). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523123057 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109415/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523123049 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523123057 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109415/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523123049 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
SparkQA removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523031483 **[Test build #109415 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109415/testReport)** for PR 24715 at commit [`cc32b48`](https://github.com/apache/spark/commit/cc32b486a02abe0f1f077cf15765006906071dae). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
SparkQA commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523122339 **[Test build #109415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109415/testReport)** for PR 24715 at commit [`cc32b48`](https://github.com/apache/spark/commit/cc32b486a02abe0f1f077cf15765006906071dae). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#discussion_r315820103 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala ## @@ -38,8 +37,7 @@ class DataSourceV2SQLSuite extends QueryTest with SharedSparkSession with Before import org.apache.spark.sql.catalog.v2.CatalogV2Implicits._ - private val orc2 = classOf[OrcDataSourceV2].getName - private val v2Source = classOf[FakeV2Provider].getName + private val v2Format = classOf[FakeV2Provider].getName Review comment: Why rename `v2Source` to `v2Format`? That introduces several unnecessary changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#discussion_r315819485 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala ## @@ -347,9 +346,16 @@ class DataSourceV2SQLSuite extends QueryTest with SharedSparkSession with Before test("ReplaceTableAsSelect: CREATE OR REPLACE new table has same behavior as CTAS.") { Seq("testcat", "testcat_atomic").foreach { catalogName => - spark.sql(s"CREATE TABLE $catalogName.created USING $orc2 AS SELECT id, data FROM source") spark.sql( -s"CREATE OR REPLACE TABLE $catalogName.replaced USING $orc2 AS SELECT id, data FROM source") +s""" + |CREATE TABLE $catalogName.created USING $v2Format + |AS SELECT id, data FROM source + """.stripMargin) + spark.sql( +s""" + |CREATE TABLE $catalogName.replaced USING $v2Format Review comment: This changed the SQL. The original query was `CREATE OR REPLACE`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#discussion_r315816400 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ## @@ -251,64 +251,62 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { assertNotBucketed("save") -val session = df.sparkSession -val cls = DataSource.lookupDataSource(source, session.sessionState.conf) -val canUseV2 = canUseV2Source(session, cls) && partitioningColumns.isEmpty - -// In Data Source V2 project, partitioning is still under development. -// Here we fallback to V1 if partitioning columns are specified. -// TODO(SPARK-26778): use V2 implementations when partitioning feature is supported. -if (canUseV2) { - val provider = cls.getConstructor().newInstance().asInstanceOf[TableProvider] - val sessionOptions = DataSourceV2Utils.extractSessionConfigs( -provider, session.sessionState.conf) - val options = sessionOptions ++ extraOptions - val dsOptions = new CaseInsensitiveStringMap(options.asJava) - - import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._ - provider.getTable(dsOptions) match { -// TODO (SPARK-27815): To not break existing tests, here we treat file source as a special -// case, and pass the save mode to file source directly. This hack should be removed. -case table: FileTable => - val write = table.newWriteBuilder(dsOptions).asInstanceOf[FileWriteBuilder] -.mode(modeForDSV1) // should not change default mode for file source. -.withQueryId(UUID.randomUUID().toString) -.withInputDataSchema(df.logicalPlan.schema) -.buildForBatch() - // The returned `Write` can be null, which indicates that we can skip writing. - if (write != null) { +lookupV2Provider() match { + // TODO(SPARK-26778): use V2 implementations when partition columns are specified + case Some(provider) if partitioningColumns.isEmpty => +saveToV2Source(provider) + + case _ => saveToV1Source() +} + } + + private def saveToV2Source(provider: TableProvider): Unit = { +val sessionOptions = DataSourceV2Utils.extractSessionConfigs( + provider, df.sparkSession.sessionState.conf) +val options = sessionOptions ++ extraOptions +val dsOptions = new CaseInsensitiveStringMap(options.asJava) + +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._ +provider.getTable(dsOptions) match { + // TODO (SPARK-27815): To not break existing tests, here we treat file source as a special + // case, and pass the save mode to file source directly. This hack should be removed. + case table: FileTable => Review comment: Is this case still needed, now that `lookupV2Provider` will not return `FileDataSourceV2`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
rdblue commented on a change in pull request #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#discussion_r315814833 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ## @@ -251,64 +251,62 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { assertNotBucketed("save") -val session = df.sparkSession -val cls = DataSource.lookupDataSource(source, session.sessionState.conf) -val canUseV2 = canUseV2Source(session, cls) && partitioningColumns.isEmpty - -// In Data Source V2 project, partitioning is still under development. -// Here we fallback to V1 if partitioning columns are specified. -// TODO(SPARK-26778): use V2 implementations when partitioning feature is supported. -if (canUseV2) { - val provider = cls.getConstructor().newInstance().asInstanceOf[TableProvider] - val sessionOptions = DataSourceV2Utils.extractSessionConfigs( -provider, session.sessionState.conf) - val options = sessionOptions ++ extraOptions - val dsOptions = new CaseInsensitiveStringMap(options.asJava) - - import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._ - provider.getTable(dsOptions) match { -// TODO (SPARK-27815): To not break existing tests, here we treat file source as a special -// case, and pass the save mode to file source directly. This hack should be removed. -case table: FileTable => - val write = table.newWriteBuilder(dsOptions).asInstanceOf[FileWriteBuilder] -.mode(modeForDSV1) // should not change default mode for file source. -.withQueryId(UUID.randomUUID().toString) -.withInputDataSchema(df.logicalPlan.schema) -.buildForBatch() - // The returned `Write` can be null, which indicates that we can skip writing. - if (write != null) { +lookupV2Provider() match { + // TODO(SPARK-26778): use V2 implementations when partition columns are specified + case Some(provider) if partitioningColumns.isEmpty => +saveToV2Source(provider) + + case _ => saveToV1Source() +} + } + + private def saveToV2Source(provider: TableProvider): Unit = { +val sessionOptions = DataSourceV2Utils.extractSessionConfigs( + provider, df.sparkSession.sessionState.conf) +val options = sessionOptions ++ extraOptions +val dsOptions = new CaseInsensitiveStringMap(options.asJava) + +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._ +provider.getTable(dsOptions) match { + // TODO (SPARK-27815): To not break existing tests, here we treat file source as a special + // case, and pass the save mode to file source directly. This hack should be removed. + case table: FileTable => +val write = table.newWriteBuilder(dsOptions).asInstanceOf[FileWriteBuilder] + .mode(modeForDSV1) // should not change default mode for file source. + .withQueryId(UUID.randomUUID().toString) + .withInputDataSchema(df.logicalPlan.schema) + .buildForBatch() +// The returned `Write` can be null, which indicates that we can skip writing. +if (write != null) { + runCommand(df.sparkSession, "save") { +WriteToDataSourceV2(write, df.logicalPlan) + } +} + + case table: SupportsWrite if table.supports(BATCH_WRITE) => +lazy val relation = DataSourceV2Relation.create(table, dsOptions) +modeForDSV2 match { + case SaveMode.Append => runCommand(df.sparkSession, "save") { - WriteToDataSourceV2(write, df.logicalPlan) + AppendData.byName(relation, df.logicalPlan) } - } -case table: SupportsWrite if table.supports(BATCH_WRITE) => - lazy val relation = DataSourceV2Relation.create(table, dsOptions) - modeForDSV2 match { -case SaveMode.Append => - runCommand(df.sparkSession, "save") { -AppendData.byName(relation, df.logicalPlan) - } - -case SaveMode.Overwrite if table.supportsAny(TRUNCATE, OVERWRITE_BY_FILTER) => - // truncate the table - runCommand(df.sparkSession, "save") { -OverwriteByExpression.byName(relation, df.logicalPlan, Literal(true)) - } - -case other => - throw new AnalysisException(s"TableProvider implementation $source cannot be " + -s"written with $other mode, please use Append or Overwrite " + -"modes instead.") - } + case SaveMode.Overwrite if table.supportsAny(TRUNCATE, OVERWRITE_BY_FILTER) => +// truncate the table +runCommand(df.sparkSession, "save") { + OverwriteByExpression.byName(relation,
[GitHub] [spark] viirya commented on a change in pull request #25499: [SPARK-28774] [SQL] Fix exchange reuse for columnar data
viirya commented on a change in pull request #25499: [SPARK-28774] [SQL] Fix exchange reuse for columnar data URL: https://github.com/apache/spark/pull/25499#discussion_r315813283 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala ## @@ -105,6 +120,17 @@ class ExchangeSuite extends SparkPlanTest with SharedSparkSession { assert(exchange5 sameResult exchange4) } + test ("Columnar exchange works") { Review comment: Remove space after `test`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #25499: [SPARK-28774] [SQL] Fix exchange reuse for columnar data
viirya commented on a change in pull request #25499: [SPARK-28774] [SQL] Fix exchange reuse for columnar data URL: https://github.com/apache/spark/pull/25499#discussion_r315813454 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala ## @@ -105,6 +120,17 @@ class ExchangeSuite extends SparkPlanTest with SharedSparkSession { assert(exchange5 sameResult exchange4) } + test ("Columnar exchange works") { +val df = spark.range(10) +val plan = df.queryExecution.executedPlan +assert(plan sameResult plan) Review comment: looks like a redundant assert? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523110224 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523110230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109413/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523110224 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
AmplabJenkins commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523110230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109413/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
SparkQA removed a comment on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523010743 **[Test build #109413 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109413/testReport)** for PR 24715 at commit [`1e1378d`](https://github.com/apache/spark/commit/1e1378d4e1542e9bbd73d419e79ba4470365b7ec). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
SparkQA commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523109554 **[Test build #109413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109413/testReport)** for PR 24715 at commit [`1e1378d`](https://github.com/apache/spark/commit/1e1378d4e1542e9bbd73d419e79ba4470365b7ec). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523108004 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523108017 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14490/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315804479 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ## @@ -682,6 +699,28 @@ abstract class BaseSubqueryExec extends SparkPlan { override def outputPartitioning: Partitioning = child.outputPartitioning override def outputOrdering: Seq[SortOrder] = child.outputOrdering + + override def generateTreeString( +depth: Int, Review comment: @cloud-fan oops.. some how need to set up my ide to catch this :-).. will change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523108004 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523108017 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14490/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315804272 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec.scala ## @@ -19,6 +19,8 @@ package org.apache.spark.sql.execution.aggregate import java.util.concurrent.TimeUnit._ +import scala.collection.mutable Review comment: @cloud-fan will remove. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315804244 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.expressions.{Expression, PlanExpression} +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.trees.TreeNodeTag + +object ExplainUtils { + /** + * Given a input physical plan, performs the following tasks. + * 1. Generate the plan -> operator id map. + * 2. Generate the plan -> codegen id map + * 3. Generate the two part explain output for this plan. + * 1. First part explains the operator tree with each operator tagged with an unique + * identifier. + * 2. Second part explans each operator in a verbose manner. + * + * Note : This function skips over subqueries. They are handled by its caller. + */ + private def processPlanSkippingSubqueries[T <: QueryPlan[T]]( + plan: => QueryPlan[T], + append: String => Unit, + startOperatorID: Int): Int = { + +// ReusedSubqueryExecs are skipped over +if (plan.isInstanceOf[BaseSubqueryExec]) { + return startOperatorID +} + +val operationIDs = new mutable.ArrayBuffer[(Int, QueryPlan[_])]() +var currentOperatorID = startOperatorID +try { + currentOperatorID = generateOperatorIDs(plan, currentOperatorID, operationIDs) + generateWholeStageCodegenIdMap(plan) + + QueryPlan.append( +plan, +append, +verbose = false, +addSuffix = false, +printOperatorId = true) + + append("\n") + var i: Integer = 0 + for ((opId, curPlan) <- operationIDs) { +append(curPlan.verboseStringWithOperatorId()) + } +} catch { + case e: AnalysisException => append(e.toString) +} +currentOperatorID + } + + /** + * Given a input physical plan, performs the following tasks. + * 1. Generates the explain output for the input plan excluding the subquery plans. + * 2. Generates the explain output for each subquery referenced in the plan. + */ + def processPlan[T <: QueryPlan[T]]( + plan: => QueryPlan[T], + append: String => Unit): Unit = { +try { + val subqueries = ArrayBuffer.empty[(SparkPlan, Expression, SparkPlan)] + var currentOperatorID = 0 + currentOperatorID = processPlanSkippingSubqueries(plan, append, currentOperatorID) + getSubqueries(plan, subqueries) + var i = 0 + + // Process all the subqueries in the plan. + for (sub <- subqueries) { +if (i == 0) { + append("\n= Subqueries =\n\n") +} +i = i + 1 +append(s"Subquery:$i Hosting operator id = " + + s"${getOpId(sub._1)} Hosting Expression = ${sub._2}\n") + +// For each subquery expression in the parent plan, process its child plan to compute +// the explain output. +currentOperatorID = processPlanSkippingSubqueries( + sub._3, + append, + currentOperatorID) +append("\n") + } +} finally { + removeTags(plan) +} + } + + /** + * Traverses the supplied input plan in a bottem-up fashion to produce the following two maps : + *1. operator -> operator identifier + *2. operator identifier -> operator + * Note : + *1. Operator such as WholeStageCodegenExec and InputAdapter are skipped as they don't + * appear in the explain output. + *2. operator identifier starts at startIdx + 1 + */ + private def generateOperatorIDs( + plan: QueryPlan[_], + startOperatorID: Int, + operatorIDs: mutable.ArrayBuffer[(Int, QueryPlan[_])]): Int = { +var currentOperationID = startOperatorID +// Skip the subqueries as they are not printed as part of main query block. +if
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315803314 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.expressions.{Expression, PlanExpression} +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.trees.TreeNodeTag + +object ExplainUtils { + /** + * Given a input physical plan, performs the following tasks. + * 1. Generate the plan -> operator id map. + * 2. Generate the plan -> codegen id map + * 3. Generate the two part explain output for this plan. + * 1. First part explains the operator tree with each operator tagged with an unique + * identifier. + * 2. Second part explans each operator in a verbose manner. + * + * Note : This function skips over subqueries. They are handled by its caller. + */ + private def processPlanSkippingSubqueries[T <: QueryPlan[T]]( + plan: => QueryPlan[T], + append: String => Unit, + startOperatorID: Int): Int = { Review comment: @cloud-fan OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315803402 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.expressions.{Expression, PlanExpression} +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.trees.TreeNodeTag + +object ExplainUtils { + /** + * Given a input physical plan, performs the following tasks. + * 1. Generate the plan -> operator id map. + * 2. Generate the plan -> codegen id map + * 3. Generate the two part explain output for this plan. + * 1. First part explains the operator tree with each operator tagged with an unique + * identifier. + * 2. Second part explans each operator in a verbose manner. + * + * Note : This function skips over subqueries. They are handled by its caller. + */ + private def processPlanSkippingSubqueries[T <: QueryPlan[T]]( + plan: => QueryPlan[T], + append: String => Unit, + startOperatorID: Int): Int = { + +// ReusedSubqueryExecs are skipped over +if (plan.isInstanceOf[BaseSubqueryExec]) { + return startOperatorID +} + +val operationIDs = new mutable.ArrayBuffer[(Int, QueryPlan[_])]() +var currentOperatorID = startOperatorID +try { + currentOperatorID = generateOperatorIDs(plan, currentOperatorID, operationIDs) + generateWholeStageCodegenIdMap(plan) + + QueryPlan.append( +plan, +append, +verbose = false, +addSuffix = false, +printOperatorId = true) + + append("\n") + var i: Integer = 0 + for ((opId, curPlan) <- operationIDs) { +append(curPlan.verboseStringWithOperatorId()) + } +} catch { + case e: AnalysisException => append(e.toString) +} +currentOperatorID + } + + /** + * Given a input physical plan, performs the following tasks. + * 1. Generates the explain output for the input plan excluding the subquery plans. + * 2. Generates the explain output for each subquery referenced in the plan. + */ + def processPlan[T <: QueryPlan[T]]( + plan: => QueryPlan[T], + append: String => Unit): Unit = { +try { + val subqueries = ArrayBuffer.empty[(SparkPlan, Expression, SparkPlan)] + var currentOperatorID = 0 + currentOperatorID = processPlanSkippingSubqueries(plan, append, currentOperatorID) + getSubqueries(plan, subqueries) + var i = 0 + + // Process all the subqueries in the plan. + for (sub <- subqueries) { +if (i == 0) { + append("\n= Subqueries =\n\n") +} +i = i + 1 +append(s"Subquery:$i Hosting operator id = " + + s"${getOpId(sub._1)} Hosting Expression = ${sub._2}\n") + +// For each subquery expression in the parent plan, process its child plan to compute +// the explain output. +currentOperatorID = processPlanSkippingSubqueries( + sub._3, + append, + currentOperatorID) +append("\n") + } +} finally { + removeTags(plan) +} + } + + /** + * Traverses the supplied input plan in a bottem-up fashion to produce the following two maps : + *1. operator -> operator identifier + *2. operator identifier -> operator + * Note : + *1. Operator such as WholeStageCodegenExec and InputAdapter are skipped as they don't + * appear in the explain output. + *2. operator identifier starts at startIdx + 1 + */ + private def generateOperatorIDs( + plan: QueryPlan[_], + startOperatorID: Int, + operatorIDs: mutable.ArrayBuffer[(Int, QueryPlan[_])]): Int = { +var currentOperationID = startOperatorID +// Skip the subqueries as they are not printed as part of main query block. +if
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315803233 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.expressions.{Expression, PlanExpression} +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.trees.TreeNodeTag + +object ExplainUtils { + /** + * Given a input physical plan, performs the following tasks. + * 1. Generate the plan -> operator id map. + * 2. Generate the plan -> codegen id map + * 3. Generate the two part explain output for this plan. + * 1. First part explains the operator tree with each operator tagged with an unique + * identifier. + * 2. Second part explans each operator in a verbose manner. + * + * Note : This function skips over subqueries. They are handled by its caller. + */ + private def processPlanSkippingSubqueries[T <: QueryPlan[T]]( Review comment: @cloud-fan No.. don't need it.. i probably copied from somewhere :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315802736 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala ## @@ -273,6 +288,9 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT } object QueryPlan extends PredicateHelper { + val opidTag = TreeNodeTag[Int]("operatorId") Review comment: @cloud-fan Will change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315802822 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala ## @@ -273,6 +288,9 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT } object QueryPlan extends PredicateHelper { + val opidTag = TreeNodeTag[Int]("operatorId") + val codegenTag = new TreeNodeTag[Int]("wholeStageCodegenId") Review comment: @cloud-fan Will change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command
dilipbiswal commented on a change in pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#discussion_r315803003 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -1512,6 +1512,16 @@ object SQLConf { "When this conf is not set, the value from `spark.redaction.string.regex` is used.") .fallbackConf(org.apache.spark.internal.config.STRING_REDACTION_PATTERN) + val SQL_EXPLAIN_LEGACY_FORMAT = Review comment: @cloud-fan Sounds good. Will remove the property. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
SparkQA commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523105687 **[Test build #109426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109426/testReport)** for PR 25465 at commit [`f93a8eb`](https://github.com/apache/spark/commit/f93a8eb609ebe12c7bbda2539c4574b1c9702050). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide
AmplabJenkins removed a comment on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide URL: https://github.com/apache/spark/pull/25136#issuecomment-523101240 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109411/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide
SparkQA removed a comment on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide URL: https://github.com/apache/spark/pull/25136#issuecomment-523001086 **[Test build #109411 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109411/testReport)** for PR 25136 at commit [`8caf4a9`](https://github.com/apache/spark/commit/8caf4a95376623bd6f1e3e999c49be69e726852e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide
AmplabJenkins removed a comment on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide URL: https://github.com/apache/spark/pull/25136#issuecomment-523101230 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide
AmplabJenkins commented on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide URL: https://github.com/apache/spark/pull/25136#issuecomment-523101240 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109411/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
SparkQA removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523004203 **[Test build #109412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109412/testReport)** for PR 25465 at commit [`830550d`](https://github.com/apache/spark/commit/830550d944ab47d0d7b7ff02228c11bfbc39658c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide
AmplabJenkins commented on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide URL: https://github.com/apache/spark/pull/25136#issuecomment-523101230 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523100802 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109412/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins removed a comment on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523100793 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on issue #25368: [SPARK-28635][SQL] create CatalogManager to track registered v2 catalogs
rdblue commented on issue #25368: [SPARK-28635][SQL] create CatalogManager to track registered v2 catalogs URL: https://github.com/apache/spark/pull/25368#issuecomment-523101070 @cloud-fan, sorry for the delay reviewing this. +1 from me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523100802 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109412/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
AmplabJenkins commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523100793 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide
SparkQA commented on issue #25136: [SPARK-28322][SQL] Add support to Decimal type for integral divide URL: https://github.com/apache/spark/pull/25136#issuecomment-523100439 **[Test build #109411 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109411/testReport)** for PR 25136 at commit [`8caf4a9`](https://github.com/apache/spark/commit/8caf4a95376623bd6f1e3e999c49be69e726852e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs
SparkQA commented on issue #25465: [SPARK-28747][SQL] merge the two data source v2 fallback configs URL: https://github.com/apache/spark/pull/25465#issuecomment-523100075 **[Test build #109412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109412/testReport)** for PR 25465 at commit [`830550d`](https://github.com/apache/spark/commit/830550d944ab47d0d7b7ff02228c11bfbc39658c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs
SparkQA commented on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24981#issuecomment-523099461 **[Test build #109425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109425/testReport)** for PR 24981 at commit [`dd1ffaf`](https://github.com/apache/spark/commit/dd1ffaf86f5d1a44fd812540b405e2eb8a21e6cf). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs
AmplabJenkins removed a comment on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24981#issuecomment-523098659 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs
AmplabJenkins removed a comment on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24981#issuecomment-523098665 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14489/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs
AmplabJenkins commented on issue #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24981#issuecomment-523098665 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14489/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org