[GitHub] spark issue #21501: [SPARK-15064][ML] Locale support in StopWordsRemover
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21501 @dongjinleekr You are welcome and thanks for your contribution! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21539 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3958/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21539 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21527 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21527 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91724/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21527 **[Test build #91724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91724/testReport)** for PR 21527 at commit [`4c8acfa`](https://github.com/apache/spark/commit/4c8acfa5899ccbdafeb630f38ce44b23332b80f2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21542: [WIP][SPARK-24529][Build] Add spotbugs into maven build ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21542 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21542: [WIP][SPARK-24529][Build] Add spotbugs into maven build ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21542 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91720/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21539 **[Test build #91735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91735/testReport)** for PR 21539 at commit [`d5832f4`](https://github.com/apache/spark/commit/d5832f4b50d4c8f84feb462291d8da37e87b192f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21539 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21542: [WIP][SPARK-24529][Build] Add spotbugs into maven build ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21542 **[Test build #91720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91720/testReport)** for PR 21542 at commit [`3a356ad`](https://github.com/apache/spark/commit/3a356ada17a5cad00cb49fea20fb473a9e8392d1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21539 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/69/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21539: [SPARK-24500][SQL] Make sure streams are materialized du...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21539 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21543 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21543 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91729/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21543 **[Test build #91729 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91729/testReport)** for PR 21543 at commit [`08461b4`](https://github.com/apache/spark/commit/08461b45bb4dc11e74976e6335269907476d02ca). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21082 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21082 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91718/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r194904013 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +182,131 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeMessageBatch(new ReadChannel(Channels.newChannel(in)), allocator) + .asInstanceOf[ArrowRecordBatch] // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and return an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile(sqlContext: SQLContext, filename: String): + JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) +try { + // Create array so that we can safely close the file + val batches = getBatchesFromStream(fileStream.getChannel).toArray + // Parallelize the record batches to create an RDD + JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, batches.length)) +} finally { + fileStream.close() +} + } + + /** + * Read an Arrow stream input and return an iterator of serialized ArrowRecordBatches. + */ + private[sql] def getBatchesFromStream(in: SeekableByteChannel): Iterator[Array[Byte]] = { + +// TODO: simplify in super class +class RecordBatchMessageReader(inputChannel: SeekableByteChannel) { --- End diff -- made https://issues.apache.org/jira/browse/ARROW-2704 to track --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21082 **[Test build #91718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91718/testReport)** for PR 21082 at commit [`328b2c4`](https://github.com/apache/spark/commit/328b2c4e09502a66939d47d6967ceea7ceab6c8c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21539: [SPARK-24500][SQL] Make sure streams are material...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/21539#discussion_r194902636 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -679,6 +679,13 @@ class PlannerSuite extends SharedSQLContext { } assert(rangeExecInZeroPartition.head.outputPartitioning == UnknownPartitioning(0)) } + + test("SPARK-24500: create union with stream of children") { +val df = Union(Stream( + Range(1, 1, 1, 1), + Range(1, 2, 1, 1))) +df.queryExecution.executedPlan.execute() --- End diff -- Yeah, it would throw an `UnsupportedOperationException` before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21539: [SPARK-24500][SQL] Make sure streams are material...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/21539#discussion_r194902354 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -301,6 +290,37 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { def mapChildren(f: BaseType => BaseType): BaseType = { if (children.nonEmpty) { var changed = false + def mapChild(child: Any): Any = child match { +case arg: TreeNode[_] if containsChild(arg) => --- End diff -- Yeah, so was about to do that but then I noticed that they handle different cases, `mapChild` handles `TreeNode` and `(TreeNode, TreeNode)`, whereas L326-L349 handles `TreeNode`, `Option[TreeNode]` and `Map[_, TreeNode]`. I am not sure if combining them is useful, and if it is then I'd rather do it in a different PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21319 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91722/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21319 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21319 **[Test build #91722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91722/testReport)** for PR 21319 at commit [`91fdedc`](https://github.com/apache/spark/commit/91fdedc4d91a7abde5f6b64dbfcf354b67d89a48). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r194898976 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala --- @@ -51,11 +51,11 @@ class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll { test("collect to arrow record batch") { val indexData = (1 to 6).toDF("i") -val arrowPayloads = indexData.toArrowPayload.collect() -assert(arrowPayloads.nonEmpty) -assert(arrowPayloads.length == indexData.rdd.getNumPartitions) +val arrowBatches = indexData.getArrowBatchRdd.collect() +assert(arrowBatches.nonEmpty) +assert(arrowBatches.length == indexData.rdd.getNumPartitions) --- End diff -- Most of these changes are just renames to be consistent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r194898793 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +182,131 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeMessageBatch(new ReadChannel(Channels.newChannel(in)), allocator) + .asInstanceOf[ArrowRecordBatch] // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and return an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile(sqlContext: SQLContext, filename: String): + JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) +try { + // Create array so that we can safely close the file + val batches = getBatchesFromStream(fileStream.getChannel).toArray + // Parallelize the record batches to create an RDD + JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, batches.length)) +} finally { + fileStream.close() +} + } + + /** + * Read an Arrow stream input and return an iterator of serialized ArrowRecordBatches. + */ + private[sql] def getBatchesFromStream(in: SeekableByteChannel): Iterator[Array[Byte]] = { + +// TODO: simplify in super class +class RecordBatchMessageReader(inputChannel: SeekableByteChannel) { --- End diff -- I had to modify the existing Arrow code to allow for this, but I will work on getting these changes into Arrow for 0.10.0 and then this class can be simplified a lot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21515 **[Test build #91734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91734/testReport)** for PR 21515 at commit [`04f6371`](https://github.com/apache/spark/commit/04f6371a3aa0f03ba1c37b5b450bea69923388e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21515 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3957/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21515 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21515 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21515: [SPARK-24372][build] Add scripts to help with preparing ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21515 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/68/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/67/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21546 **[Test build #91733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91733/testReport)** for PR 21546 at commit [`a5a1fbe`](https://github.com/apache/spark/commit/a5a1fbe7121c5b0dd93876a56c29ad17dcd9b168). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21515: [SPARK-24372][build] Add scripts to help with pre...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21515#discussion_r194897476 --- Diff: dev/create-release/vote.tmpl --- @@ -0,0 +1,64 @@ +Please vote on releasing the following candidate as Apache Spark version {version}. + +The vote is open until {deadline} and passes if a majority of at least 3 +1 PMC votes are cast. --- End diff -- The rules are a little bit more complicated than that. Updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3956/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21515: [SPARK-24372][build] Add scripts to help with pre...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21515#discussion_r194896688 --- Diff: dev/create-release/spark-rm/Dockerfile --- @@ -0,0 +1,89 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Image for building Spark releases. Based on Ubuntu 16.04. +# +# Includes: +# * Java 8 +# * Ivy +# * Python/PyPandoc (2.7.12/3.5.2) +# * R-base/R-base-dev (3.3.2+) +# * Ruby 2.3 build utilities + +FROM ubuntu:16.04 + +# These arguments are just for reuse and not really meant to be customized. +ARG APT_INSTALL="apt-get install --no-install-recommends -y" + +# Install extra needed repos and refresh. +# - CRAN repo +# - Ruby repo (for doc generation) +RUN echo 'deb http://cran.cnr.Berkeley.edu/bin/linux/ubuntu xenial/' >> /etc/apt/sources.list && \ + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 && \ + gpg -a --export E084DAB9 | apt-key add - && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* && \ + apt-get clean && \ + apt-get update && \ + $APT_INSTALL software-properties-common && \ + apt-add-repository -y ppa:brightbox/ruby-ng && \ + apt-get update + +# Install openjdk 8. +RUN $APT_INSTALL openjdk-8-jdk && \ + update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java + +# Install build / source control tools +RUN $APT_INSTALL curl wget git maven ivy subversion make gcc libffi-dev \ +pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && \ + ln -s -T /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && \ + curl -sL https://deb.nodesource.com/setup_4.x | bash && \ + $APT_INSTALL nodejs + +# Install needed python packages. Use pip for installing packages (for consistency). +ARG BASE_PIP_PKGS="setuptools wheel virtualenv" +ARG PIP_PKGS="pyopenssl pypandoc numpy pygments sphinx" + +RUN $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && \ + pip install $BASE_PIP_PKGS && \ + pip install $PIP_PKGS && \ + cd && \ + virtualenv -p python3 p35 && \ + . p35/bin/activate && \ + pip install $BASE_PIP_PKGS && \ + pip install $PIP_PKGS + +# Install R packages and dependencies used when building. +# R depends on pandoc*, libssl (which are installed above). +RUN $APT_INSTALL r-base r-base-dev && \ + $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && \ + Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='http://cran.us.r-project.org/')" && \ + Rscript -e "devtools::install_github('jimhester/lintr')" + +# Install tools needed to build the documentation. +RUN $APT_INSTALL ruby2.3 ruby2.3-dev && \ + gem install jekyll --no-rdoc --no-ri && \ + gem install jekyll-redirect-from && \ + gem install pygments.rb + +WORKDIR /opt/spark-rm/output + +ARG UID +RUN useradd -m -s /bin/bash -p spark-rm -u $UID spark-rm --- End diff -- The gpg signature is completely unrelated to the user running the process. Also, Docker being the weird thing it is, the user name in the container doesn't really matter much. It's running with the same UID as the host user who built the image, which in this case is expected the be the person preparing the release, and this will only really matter for files that are written back to the host's file system. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21546 This is a WIP because I had to hack up some of the message processing code in Arrow. This should be done in Arrow, and then it can be cleaned up here. I will make these changes for version 0.10.0 and complete this once we have upgraded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream ...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/21546 [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format for creating from and collecting Pandas DataFrames ## What changes were proposed in this pull request? This changes the calls of `toPandas()` and `createDataFrame()` to use the Arrow stream format, when Arrow is enabled. Previously, Arrow data was written to byte arrays where each chunk is an output of the Arrow file format. This was mainly due to constraints at the time, and caused some overhead by writing the schema/footer on each chunk of data and then having to read multiple Arrow file inputs and concat them together. Using the Arrow stream format has improved these by increasing performance, lower memory overhead for the average case, and simplified the code. Here are the details of this change: **toPandas()** _Before:_ Spark internal rows are converted to Arrow file format, each group of records is a complete Arrow file which contains the schema and other metadata. Next a collect is done and an Array of Arrow files is the result. After that each Arrow file is sent to Python driver which then loads each file and concats them to a single Arrow DataFrame. _After:_ Spark internal rows are converted to ArrowRecordBatches directly, which is the simplest Arrow component for IPC data transfers. The driver JVM then immediately starts serving data to Python as an Arrow stream, sending the schema first. It then starts Spark jobs with a custom handler such that when a partition is received (and in the correct order) the ArrowRecordBatches can be sent to python as soon as possible. This improves performance, simplifies memory usage on executors, and improves the average memory usage on the JVM driver. Since the order of partitions must be preserved, the worst case is that the first partition will be the last to arrive and all data must be kept in memory until then. This case is no worse that before when doing a full collect. **createDataFrame()** _Before:_ A Pandas DataFrame is split into parts and each part is made into an Arrow file. Then each file is prefixed by the buffer size and written to a temp file. The temp file is read and each Arrow file is parallelized as a byte array. _After:_ A Pandas DataFrame is split into parts, then an Arrow stream is written to a temp file where each part is an ArrowRecordBatch. The temp file is read as a stream and the Arrow messages are examined. If the message is an ArrowRecordBatch, the data is saved as a byte array. After reading the file, each ArrowRecordBatch is parallelized as a byte array. This has slightly more processing than before because we must look each Arrow message to extract the record batches, but performance remains the same. It is cleaner in the sense that IPC from Python to JVM is done over a single Arrow stream. ## How was this patch tested? Added new unit tests for the additions to ArrowConverters in Scala, existing tests for Python. You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark arrow-toPandas-stream-SPARK-23030 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21546.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21546 commit 9af482170ee95831bbda139e6e931ba2631df386 Author: Bryan Cutler Date: 2018-01-10T22:02:15Z change ArrowConverters to stream format commit d617f0da8eff1509da465bb707340e391314bec4 Author: Bryan Cutler Date: 2018-01-10T22:14:07Z Change ArrowSerializer to use stream format commit f10d5d9cd3cece7f56749e1de7fe01699e4759a0 Author: Bryan Cutler Date: 2018-01-12T00:40:36Z toPandas is working with RecordBatch payloads, using custom handler to stream ordered partitions commit 03653c687473b82bbfb6653504479498a2a3c63b Author: Bryan Cutler Date: 2018-02-10T00:23:17Z cleanup and removed ArrowPayload, createDataFrame now working commit 1b932463bca0815e79f3a8d61d1c816e62949698 Author: Bryan Cutler Date: 2018-03-09T00:14:06Z toPandas and createDataFrame working but tests fail with date cols commit ce22d8ad18e052d150528752b727c6cfe11485f7 Author: Bryan Cutler Date: 2018-03-27T00:32:03Z removed usage of seekableByteChannel commit dede0bd96921c439747a9176f24c9ecbb9c8ce0a Author: Bryan Cutler Date: 2018-03-28T00:28:54Z for toPandas, set old collection result to null and add comments commit 9e29b092cb7d45fa486db0215c3bd4a99c5f8d98 Author: Bryan Cutler Date: 2018-03-28T18:28:18Z cleanup, not yet passing ArrowConvertersSuite commit ceb8d38a6c83c3b6dae040c9e8d860811ecad0cc Author: Bryan Cutler Date: 2018-03-29T21:14:03Z fix
[GitHub] spark pull request #21515: [SPARK-24372][build] Add scripts to help with pre...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21515#discussion_r194896168 --- Diff: dev/.rat-excludes --- @@ -106,3 +106,4 @@ spark-warehouse structured-streaming/* kafka-source-initial-offset-version-2.1.0.bin kafka-source-initial-offset-future-version.bin +vote.tmpl --- End diff -- Are you saying this file should not be packaged in the source release? Not sure I see why that would be the case. There's a lot of stuff in `.rat-excludes` that is still packaged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194895594 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.Strategy +import org.apache.spark.sql.{execution, Strategy} +import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource, WriteToContinuousDataSourceExec} object DataSourceV2Strategy extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { -case r: DataSourceV2Relation => - DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, r.reader) :: Nil +case PhysicalOperation(project, filters, relation: DataSourceV2Relation) => + val projectSet = AttributeSet(project.flatMap(_.references)) + val filterSet = AttributeSet(filters.flatMap(_.references)) + + val projection = if (filterSet.subsetOf(projectSet) && + AttributeSet(relation.output) == projectSet) { +// When the required projection contains all of the filter columns and column pruning alone +// can produce the required projection, push the required projection. +// A final projection may still be needed if the data source produces a different column +// order or if it cannot prune all of the nested columns. +relation.output + } else { +// When there are filter columns not already in the required projection or when the required +// projection is more complicated than column pruning, base column pruning on the set of +// all columns needed by both. +(projectSet ++ filterSet).toSeq + } + + val reader = relation.newReader --- End diff -- Yea the second proposal is what happens for the v1 data sources. For file-based data source we kind of pick the third proposal and add an optimizer rule `PruneFileSourcePartitions` to push down some of the filters to data source at the logical phase, to get precise stats. I'd like to pick from the 2nd and 3rd proposals(the 3rd proposal is also temporary, before we move stats to physical plan). Applying pushdown twice is hard to workaround(need to cache), while we can keep the `PruneFileSourcePartitions` rule to work around the issue in 2nd proposal for file-based data sources. Let's also get more inputs from other people. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21531: [SPARK-24521][SQL][TEST] Fix ineffective test in CachedT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21531 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21050 **[Test build #91732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91732/testReport)** for PR 21050 at commit [`ba0d60f`](https://github.com/apache/spark/commit/ba0d60f2e7fd7de4e571d78af362b7f06935b3e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21531: [SPARK-24521][SQL][TEST] Fix ineffective test in CachedT...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21531 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91719/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21050 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21050 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21050 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3955/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21050: [SPARK-23912][SQL]add array_distinct
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21050 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/66/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21531: [SPARK-24521][SQL][TEST] Fix ineffective test in CachedT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21531 **[Test build #91719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91719/testReport)** for PR 21531 at commit [`658539a`](https://github.com/apache/spark/commit/658539a71971bdea62e89afd11775bf6f51d766d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21050: [SPARK-23912][SQL]add array_distinct
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21050#discussion_r194893615 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -1882,3 +1883,134 @@ case class ArrayRepeat(left: Expression, right: Expression) } } + +/** + * Removes duplicate values from the array. + */ +@ExpressionDescription( + usage = "_FUNC_(array) - Removes duplicate values from the array.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3, null, 3)); + [1,2,3,null] + """, since = "2.4.0") +case class ArrayDistinct(child: Expression) + extends UnaryExpression with ExpectsInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType) + + override def dataType: DataType = child.dataType + + lazy val elementType: DataType = dataType.asInstanceOf[ArrayType].elementType + + override def nullSafeEval(array: Any): Any = { +val elementType = child.dataType.asInstanceOf[ArrayType].elementType +val data = array.asInstanceOf[ArrayData].toArray[AnyRef](elementType).distinct +new GenericArrayData(data.asInstanceOf[Array[Any]]) + } + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +nullSafeCodeGen(ctx, ev, (array) => { + val i = ctx.freshName("i") + val j = ctx.freshName("j") + val hs = ctx.freshName("hs") + val foundNullElement = ctx.freshName("foundNullElement") + val distinctArrayLen = ctx.freshName("distinctArrayLen") + val getValue = CodeGenerator.getValue(array, elementType, i) + val openHashSet = classOf[OpenHashSet[_]].getName + val classTag = s"scala.reflect.ClassTag$$.MODULE$$.Object()" + s""" + |int $distinctArrayLen = 0; + |boolean $foundNullElement = false; + |$openHashSet $hs = new $openHashSet($classTag); + |for (int $i = 0; $i < $array.numElements(); $i++) { + | if ($array.isNullAt($i)) { + |if (!($foundNullElement)) { + | $distinctArrayLen = $distinctArrayLen + 1; + | $foundNullElement = true; + |} + | } + | else { + |if (!($hs.contains($getValue))) { + | $hs.add($getValue); + | $distinctArrayLen = $distinctArrayLen + 1; + |} + | } + |} --- End diff -- @ueshin Thanks for your review. I made some changes based on your comments. Could you please review one more time? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21366 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3807/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21398: [SPARK-24338][SQL] Fixed Hive CREATETABLE error in Sentr...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21398 @vanzin @cloud-fan. Going back to the facts. The fix/work-around in Cloudera's distro obviously exists because the distro integrates with Sentry anyway. Removing the location did fix the Spark Job at the customer side. To re-phrase the last request from @cloud-fan, if we remove the explicit location what are the consequences, when do you need the explicit location for managed tables? Can this be a default behavior in general? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21545 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3806/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21366 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3807/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/65/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21545 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3806/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21366 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3804/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194875888 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.Strategy +import org.apache.spark.sql.{execution, Strategy} +import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource, WriteToContinuousDataSourceExec} object DataSourceV2Strategy extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { -case r: DataSourceV2Relation => - DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, r.reader) :: Nil +case PhysicalOperation(project, filters, relation: DataSourceV2Relation) => + val projectSet = AttributeSet(project.flatMap(_.references)) + val filterSet = AttributeSet(filters.flatMap(_.references)) + + val projection = if (filterSet.subsetOf(projectSet) && + AttributeSet(relation.output) == projectSet) { +// When the required projection contains all of the filter columns and column pruning alone +// can produce the required projection, push the required projection. +// A final projection may still be needed if the data source produces a different column +// order or if it cannot prune all of the nested columns. +relation.output + } else { +// When there are filter columns not already in the required projection or when the required +// projection is more complicated than column pruning, base column pruning on the set of +// all columns needed by both. +(projectSet ++ filterSet).toSeq + } + + val reader = relation.newReader --- End diff -- I don't mind either option #1 or #2. #2 is basically what happens for non-v2 data sources right now. Plus, both should be temporary. I think it is a bad idea to continue with hacky code that uses the reader in the logical plan. It is much cleaner otherwise and we've spend too much time making sure that everything still works. The main example that comes to mind is setting the requested projection and finding out what output is using pushdown. I think hacks are slowing progress on the v2 sources. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3954/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21545 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21545 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/64/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194871032 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.Strategy +import org.apache.spark.sql.{execution, Strategy} +import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource, WriteToContinuousDataSourceExec} object DataSourceV2Strategy extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { -case r: DataSourceV2Relation => - DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, r.reader) :: Nil +case PhysicalOperation(project, filters, relation: DataSourceV2Relation) => + val projectSet = AttributeSet(project.flatMap(_.references)) + val filterSet = AttributeSet(filters.flatMap(_.references)) + + val projection = if (filterSet.subsetOf(projectSet) && + AttributeSet(relation.output) == projectSet) { +// When the required projection contains all of the filter columns and column pruning alone +// can produce the required projection, push the required projection. +// A final projection may still be needed if the data source produces a different column +// order or if it cannot prune all of the nested columns. +relation.output + } else { +// When there are filter columns not already in the required projection or when the required +// projection is more complicated than column pruning, base column pruning on the set of +// all columns needed by both. +(projectSet ++ filterSet).toSeq + } + + val reader = relation.newReader --- End diff -- OK we need to make a decision here: 1. Apply operator pushdown twice(proposed by tihs PR). This moves the pushdown logic to planner which is more ideal and cleaner. The drawback is, before moving statistics to physical plan, we have some duplicated pushdown code in `DataSourceV2Relation` and applying pushdown twice has performance penalty. 2. Apply operator pushdown only once in the planner. Same as 1, it's cleaner. The drawback is, before moving statistics to physical plan, data source v2 can't report statistics after filter. 3. Apply operator pushdown only once in the optimizer(proposed by https://github.com/apache/spark/pull/21319). It has no performance penalty and we can report statistics after filter. The drawback is, before moving statistics to physical plan, we have a temporary `DataSourceReader` in `DataSourceV2Relation`, which is hacky. The tradeoff is: shall we bear with hacky code and move forward with the data source v2 operator pushdown support? or shall we make the code cleaner and bear with some performance pemalty(apply pushdown twice or not report stats after filter)? or shall we just hold back and think about how to move stats to physical plan? cc @marmbrus @jose-torres --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3951/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21545 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3953/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21545 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21543 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21366 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3804/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21366 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3803/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21366 **[Test build #91731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91731/testReport)** for PR 21366 at commit [`8b0a211`](https://github.com/apache/spark/commit/8b0a211f4c5046065b0e2abcac0ee71119ff6bd6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21545: [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21545 **[Test build #91730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91730/testReport)** for PR 21545 at commit [`00d9850`](https://github.com/apache/spark/commit/00d9850698350abba2c6ac10a1c753e332e47751). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: [SPARK-24479][SS] Added config for registering streaming...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21504 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91715/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: [SPARK-24479][SS] Added config for registering streaming...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21504 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: [SPARK-24479][SS] Added config for registering streaming...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21504 **[Test build #91715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91715/testReport)** for PR 21504 at commit [`b87c90b`](https://github.com/apache/spark/commit/b87c90b6bb356e0faaf8230cb6b20fbcdd65c858). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21545: [SPARK-23010][BUILD] Fix java checkstyle failure ...
GitHub user jiangxb1987 opened a pull request: https://github.com/apache/spark/pull/21545 [SPARK-23010][BUILD] Fix java checkstyle failure of kubernetes-integration-tests ## What changes were proposed in this pull request? Fix java checkstyle failure of kubernetes-integration-tests ## How was this patch tested? Checked manually on my local environment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiangxb1987/spark k8s-checkstyle Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21545.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21545 commit 00d9850698350abba2c6ac10a1c753e332e47751 Author: Xingbo Jiang Date: 2018-06-12T19:39:50Z fix java checkstyle failure broken by kubernetes integration tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21544: add one supported type missing from the javadoc
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21544 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21544: add one supported type missing from the javadoc
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21544 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21544: add one supported type missing from the javadoc
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21544 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21529: [SPARK-24495][SQL] EnsureRequirement returns wron...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21529#discussion_r194863357 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -679,6 +679,17 @@ class PlannerSuite extends SharedSQLContext { } assert(rangeExecInZeroPartition.head.outputPartitioning == UnknownPartitioning(0)) } + + test("SPARK-24495: EnsureRequirements can return wrong plan when reusing the same key in join") { +withSQLConf(("spark.sql.shuffle.partitions", "1"), + ("spark.sql.constraintPropagation.enabled", "false"), + ("spark.sql.autoBroadcastJoinThreshold", "-1")) { + val df1 = spark.range(100).repartition(2, $"id", $"id") --- End diff -- no, the issue can happen with range partition(because of the double transformation issue), the code in the ticket can reproduce the bug and it has no hash partitioning. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21544: add one supported type missing from the javadoc
GitHub user yuj opened a pull request: https://github.com/apache/spark/pull/21544 add one supported type missing from the javadoc ## What changes were proposed in this pull request? The supported java.math.BigInteger type is not mentioned in the javadoc of Encoders.bean() ## How was this patch tested? only Javadoc fix Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yuj/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21544.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21544 commit 635378aa616f535ee77a3ac9d4236d54c7ef093e Author: James Yu Date: 2018-06-12T19:32:19Z add one supported type missing from the javadoc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21258 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21529: [SPARK-24495][SQL] EnsureRequirement returns wrong plan ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21529 > I think we need to support both. testing different physical operators needs same result, testing something like type coercion mode needs different result. Anyway let's discuss it in the followup. ok @cloud-fan, I'll try and send a proposal in the next days. > @mgaido91 Could you help improve the test coverage of joins org.apache.spark.sql.JoinSuite? Due to the incomplete test case coverage, we did not discover this at the very beginning. We need to add more test cases to cover different join algorithms. Sure, @gatorsmile, I am happy to. Do you mean running the existing tests for every type of join or do you have something different in mind? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21543 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/63/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21543 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21543 **[Test build #91729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91729/testReport)** for PR 21543 at commit [`08461b4`](https://github.com/apache/spark/commit/08461b45bb4dc11e74976e6335269907476d02ca). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/62/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21067: [SPARK-23980][K8S] Resilient Spark driver on Kube...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21067#discussion_r194862373 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala --- @@ -67,12 +68,19 @@ private[spark] class BasicExecutorFeatureStep( } private val executorLimitCores = kubernetesConf.get(KUBERNETES_EXECUTOR_LIMIT_CORES) - override def configurePod(pod: SparkPod): SparkPod = { -val name = s"$executorPodNamePrefix-exec-${kubernetesConf.roleSpecificConf.executorId}" + // If the driver pod is killed, the new driver pod will try to + // create new executors with the same name, but it will fail + // and hangs indefinitely because a terminating executors blocks + // the creation of the new ones, so to avoid that apply salt + private val executorNameSalt = Random.alphanumeric.take(4).mkString("").toLowerCase --- End diff -- 4 digits does not have enough randomness to be a reliable salt, which is why I suggest this to avoid collisions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21067: [SPARK-23980][K8S] Resilient Spark driver on Kube...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21067#discussion_r194862246 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala --- @@ -67,12 +68,19 @@ private[spark] class BasicExecutorFeatureStep( } private val executorLimitCores = kubernetesConf.get(KUBERNETES_EXECUTOR_LIMIT_CORES) - override def configurePod(pod: SparkPod): SparkPod = { -val name = s"$executorPodNamePrefix-exec-${kubernetesConf.roleSpecificConf.executorId}" + // If the driver pod is killed, the new driver pod will try to + // create new executors with the same name, but it will fail + // and hangs indefinitely because a terminating executors blocks + // the creation of the new ones, so to avoid that apply salt + private val executorNameSalt = Random.alphanumeric.take(4).mkString("").toLowerCase --- End diff -- We should just have the executor pod name be `s"$applicationId-exec-$executorId` then. Don't think the pod name prefix has to strictly be tied to the application name. The application name should be applied to a label so the executor pod can be located using that when using the dashboard and kubectl, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21543 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3952/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21543 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21529: [SPARK-24495][SQL] EnsureRequirement returns wron...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21529#discussion_r194861969 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -679,6 +679,17 @@ class PlannerSuite extends SharedSQLContext { } assert(rangeExecInZeroPartition.head.outputPartitioning == UnknownPartitioning(0)) } + + test("SPARK-24495: EnsureRequirements can return wrong plan when reusing the same key in join") { +withSQLConf(("spark.sql.shuffle.partitions", "1"), + ("spark.sql.constraintPropagation.enabled", "false"), + ("spark.sql.autoBroadcastJoinThreshold", "-1")) { + val df1 = spark.range(100).repartition(2, $"id", $"id") --- End diff -- this way would not be ok, as we would have a `RangePartitioning` while the issue appears only with `HashPartitioning` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21366: [SPARK-24248][K8S] Use level triggering and state reconc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21366 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21258 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194861645 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.Strategy +import org.apache.spark.sql.{execution, Strategy} +import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource, WriteToContinuousDataSourceExec} object DataSourceV2Strategy extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { -case r: DataSourceV2Relation => - DataSourceV2ScanExec(r.output, r.source, r.options, r.pushedFilters, r.reader) :: Nil +case PhysicalOperation(project, filters, relation: DataSourceV2Relation) => + val projectSet = AttributeSet(project.flatMap(_.references)) + val filterSet = AttributeSet(filters.flatMap(_.references)) + + val projection = if (filterSet.subsetOf(projectSet) && + AttributeSet(relation.output) == projectSet) { +// When the required projection contains all of the filter columns and column pruning alone +// can produce the required projection, push the required projection. +// A final projection may still be needed if the data source produces a different column +// order or if it cannot prune all of the nested columns. +relation.output + } else { +// When there are filter columns not already in the required projection or when the required +// projection is more complicated than column pruning, base column pruning on the set of +// all columns needed by both. +(projectSet ++ filterSet).toSeq + } + + val reader = relation.newReader --- End diff -- I didn't realize you were talking about other v2 sources. Yes, two readers would be configured for v2. If you wanted to avoid it, you could cache when pushdown is expensive in the implementation or we could add something else that prevents that case. We need to do *something* to fix the current behavior of doing pushdown in the optimizer. I'm perfectly happy with less accurate stats for v2 until stats use the physical plan, or a solution like this where pushdown happens twice. I just don't think it is a good idea to continue with the design where the logical plan needs to use the v2 reader APIs. I think we agree that that should happen once, and conversion to physical plan is where it makes the most sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21543: [SPARK-24531][TESTS] Replace 2.3.0 version with 2...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/21543 [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1 ## What changes were proposed in this pull request? The PR updates the 2.3 version tested to the new release 2.3.1. ## How was this patch tested? existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21543.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21543 commit 08461b45bb4dc11e74976e6335269907476d02ca Author: Marco Gaido Date: 2018-06-12T19:28:03Z [SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21067: [SPARK-23980][K8S] Resilient Spark driver on Kube...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21067#discussion_r194860793 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesFeatureConfigStep.scala --- @@ -18,54 +18,87 @@ package org.apache.spark.deploy.k8s.features import io.fabric8.kubernetes.api.model.HasMetadata -import org.apache.spark.deploy.k8s.SparkPod +import org.apache.spark.deploy.k8s.{SparkExecutorPod, SparkJob} /** - * A collection of functions that together represent a "feature" in pods that are launched for + * A collection of functions that together represent a + * "feature" in jobs and pods that are launched for * Spark drivers and executors. */ private[spark] trait KubernetesFeatureConfigStep { /** - * Apply modifications on the given pod in accordance to this feature. This can include attaching + * Apply modifications on the given job in accordance to this feature. This can include attaching * volumes, adding environment variables, and adding labels/annotations. * - * Note that we should return a SparkPod that keeps all of the properties of the passed SparkPod + * Note that we should return a SparkJob that keeps all of the properties of the passed SparkJob * object. So this is correct: * - * {@code val configuredPod = new PodBuilder(pod.pod) + * {@code val configuredJob = new JobBuilder(job.job) * .editSpec() * ... * .build() - * val configuredContainer = new ContainerBuilder(pod.container) + * val configuredContainer = new ContainerBuilder(job.container) * ... * .build() - * SparkPod(configuredPod, configuredContainer) + * SparkJob(configuredJob, configuredContainer) * } * * This is incorrect: * - * {@code val configuredPod = new PodBuilder() // Loses the original state + * {@code val configuredJob = new JobBuilder() // Loses the original state * .editSpec() * ... * .build() * val configuredContainer = new ContainerBuilder() // Loses the original state * ... * .build() - * SparkPod(configuredPod, configuredContainer) + * SparkJob(configuredJob, configuredContainer) * } * */ - def configurePod(pod: SparkPod): SparkPod - + def configureJob(job: SparkJob): SparkJob = SparkJob.initialJob() + /** + * Apply modifications on the given executor pod in + * accordance to this feature. This can include attaching + * volumes, adding environment variables, and adding labels/annotations. + * + * Note that we should return a SparkExecutorPod that keeps all + * of the properties of the passed SparkExecutor + * object. So this is correct: + * + * {@code val configuredExecutorPod = new PodBuilder(pod.pod) + * .editSpec() + * ... + * .build() + * val configuredContainer = new ContainerBuilder(pod.container) + * ... + * .build() + * SparkExecutorPod(configuredPod, configuredContainer) + * } + * + * This is incorrect: + * + * {@code val configuredExecutorPod = new PodBuilder() // Loses the original state + * .editSpec() + * ... + * .build() + * val configuredContainer = new ContainerBuilder() // Loses the original state + * ... + * .build() + * SparkExecutorPod(configuredPod, configuredContainer) + * } + * + */ + def configureExecutorPod(pod: SparkExecutorPod): SparkExecutorPod = SparkExecutorPod.initialPod() --- End diff -- Ok looking a bit more closely at the API - I think what we want is for all feature steps to be configuring a `SparkPod`, where: ``` case class SparkPod(podTemplate: PodTemplateSpec, container: Container) ``` Then the `KubernetesDriverBuilder` adapts the pod template into a Job, but the `KubernetesExecutorBuilder` adapts the pod template into just a `Pod`. For the job, the adapting part is trivial (`JobBuilder.withNewJobSpec().withSpec(...)`). We can translate a pod template spec into a pod as well:: * inject `PodTemplateSpec.getMetadata()` into `PodBuilder.withMetadata()`, * inject `PodTemplateSpec.getSpec()` into `PodBuilder.withSpec()`, and * `PodBuilder.editMetadata().withName()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org