[GitHub] spark issue #22288: [SPARK-22148][Scheduler] Acquire new executors to avoid ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22288 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95723/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95723/testReport)** for PR 22318 at commit [`809b8a8`](https://github.com/apache/spark/commit/809b8a83b7ec3d62ba6d65f6aff6a7d3175bd3e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22288: [SPARK-22148][Scheduler] Acquire new executors to avoid ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22288 **[Test build #95729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95729/testReport)** for PR 22288 at commit [`87c4e57`](https://github.com/apache/spark/commit/87c4e57bb966078c8a78eabc5a5e4b6f60c78f28). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95727/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95721/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22328 **[Test build #95727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95727/testReport)** for PR 22328 at commit [`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22338 **[Test build #95721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95721/testReport)** for PR 22338 at commit [`6cb94ed`](https://github.com/apache/spark/commit/6cb94ed5ed76e23e4fb775a2fd6f0e66e9c15abd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 Thank you for approval, @srowen . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ig...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22336 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22336 Wow, Great. Thanks, @gatorsmile ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22335: [SPARK-25091][SQL] reduce the storage memory in Executor...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22335 Also, this should be `[core]` not `[sql]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22336 Sorry,,, I just merged it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22336 Finally, it's passed. Can I merge this, @cloud-fan and @HyukjinKwon ? :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22336 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95722/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22336 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22336 **[Test build #95722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95722/testReport)** for PR 22336 at commit [`69f207f`](https://github.com/apache/spark/commit/69f207f8a4531435c4a8df790780557968a33bb1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r215376132 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -82,7 +83,7 @@ case class CreateHiveTableAsSelectCommand( query, overwrite = true, ifPartitionNotExists = false, - outputColumns = outputColumns).run(sparkSession, child) + outputColumnNames = outputColumnNames).run(sparkSession, child) --- End diff -- `outputColumnNames` themselves. Specyfing `outputColumnNames` as the name of the property to set using `outputColumnNames` does nothing but introduces a duplication. If you removed one `outputColumnNames` the comprehension should not be lowered whatsoever, shouldn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22112 The current solution looks good to me for unblocking the Apache 2.4 release. We definitely should continue improving the fix, as what the other reviewers suggested above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r215371723 --- Diff: core/src/main/java/org/apache/spark/ExecutorPlugin.java --- @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark; + +import org.apache.spark.annotation.DeveloperApi; + +/** + * A plugin which can be automaticaly instantiated within each Spark executor. Users can specify + * plugins which should be created with the "spark.executor.plugins" configuration. An instance + * of each plugin will be created for every executor, including those created by dynamic allocation, + * before the executor starts running any tasks. + * + * The specific api exposed to the end users still considered to be very unstable. We will + * hopefully be able to keep compatability by providing default implementations for any methods + * added, but make no guarantees this will always be possible across all Spark releases. + * + * Spark does nothing to verify the plugin is doing legitimate things, or to manage the resources + * it uses. A plugin acquires the same privileges as the user running the task. A bad plugin + * could also intefere with task execution and make the executor fail in unexpected ways. + */ +@DeveloperApi +public interface ExecutorPlugin { + + /** + * Initialization method that will be called during executor startup, in the same thread as + * the executor. Plugins should override this method to add in their initialization logic. + */ + default void init() {} --- End diff -- I think it should be either `init` & `shutdown`, or `start` & `stop`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r215372559 --- Diff: core/src/main/java/org/apache/spark/ExecutorPlugin.java --- @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark; + +import org.apache.spark.annotation.DeveloperApi; + +/** + * A plugin which can be automaticaly instantiated within each Spark executor. Users can specify + * plugins which should be created with the "spark.executor.plugins" configuration. An instance + * of each plugin will be created for every executor, including those created by dynamic allocation, + * before the executor starts running any tasks. + * + * The specific api exposed to the end users still considered to be very unstable. We will + * hopefully be able to keep compatability by providing default implementations for any methods + * added, but make no guarantees this will always be possible across all Spark releases. + * + * Spark does nothing to verify the plugin is doing legitimate things, or to manage the resources + * it uses. A plugin acquires the same privileges as the user running the task. A bad plugin + * could also intefere with task execution and make the executor fail in unexpected ways. + */ +@DeveloperApi +public interface ExecutorPlugin { + + /** + * Initialization method that will be called during executor startup, in the same thread as --- End diff -- Since this is a new public API and thus this is public documentation, follow javadoc rules: - first sentence should be short - break things into paragraphs as needed (separated with ``). e.g. ``` Callback for initializing an executor plugin. This is called synchronously from the executor initialization thread. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r215373277 --- Diff: core/src/main/java/org/apache/spark/ExecutorPlugin.java --- @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark; + +import org.apache.spark.annotation.DeveloperApi; + +/** + * A plugin which can be automaticaly instantiated within each Spark executor. Users can specify + * plugins which should be created with the "spark.executor.plugins" configuration. An instance + * of each plugin will be created for every executor, including those created by dynamic allocation, + * before the executor starts running any tasks. + * + * The specific api exposed to the end users still considered to be very unstable. We will + * hopefully be able to keep compatability by providing default implementations for any methods + * added, but make no guarantees this will always be possible across all Spark releases. + * + * Spark does nothing to verify the plugin is doing legitimate things, or to manage the resources + * it uses. A plugin acquires the same privileges as the user running the task. A bad plugin + * could also intefere with task execution and make the executor fail in unexpected ways. + */ +@DeveloperApi +public interface ExecutorPlugin { + + /** + * Initialization method that will be called during executor startup, in the same thread as + * the executor. Plugins should override this method to add in their initialization logic. + */ + default void init() {} + + /** + * Stop method, to be called when the executor is shutting down. Plugins should clean up --- End diff -- Prefer: ``` Callback for cleaning up executor plugin resources. Called when the executor is shutting down. ``` "Prepare to terminate" is also a little vague; what does that mean? Shouldn't the plugin be terminated already when this method returns? Otherwise there are no guarantees of what happens, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22325: [SPARK-25318]. Add exception handling when wrappi...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22325#discussion_r215368719 --- Diff: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala --- @@ -444,10 +444,23 @@ final class ShuffleBlockFetcherIterator( throwFetchFailedException(blockId, address, e) } - input = streamWrapper(blockId, in) + var wrapCorruption: Boolean = false --- End diff -- Wouldn't it be better to try to merge this block into the existing code below? Something along the lines of: ``` try { val input = streamWrapper(blockId, in) if (detectCorrupt && !wrapCorruption && !input.eq(in) && size < maxBytesInFlight / 3) { // do the corruption detection } catch { // existing catch cases from around L474 } ``` It seems you're trying to achieve the same things as the existing code (e.g. retry on IOException) so it would be nice to re-use the same code for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21404: [SPARK-24360][SQL] Support Hive 3.0 metastore
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/21404 Thank you for review, @tooptoop4 and @wangyum . I'm going to update this to the latest Hive 3.1.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22333: [SPARK-25335][BUILD] Skip Zinc downloading if it's insta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22333 **[Test build #95728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95728/testReport)** for PR 22333 at commit [`14d1017`](https://github.com/apache/spark/commit/14d10172f9d62ba9a8eabec5aaa11759757278bf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22333: [SPARK-25335][BUILD] Skip Zinc downloading if it's insta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22333 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22333: [SPARK-25335][BUILD] Skip Zinc downloading if it's insta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22333 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2882/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22333: [SPARK-25335][BUILD] Skip Zinc downloading if it'...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22333#discussion_r215365678 --- Diff: build/mvn --- @@ -91,15 +92,23 @@ install_mvn() { # Install zinc under the build/ folder install_zinc() { - local zinc_path="zinc-0.3.15/bin/zinc" - [ ! -f "${_DIR}/${zinc_path}" ] && ZINC_INSTALL_FLAG=1 - local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.lightbend.com} + ZINC_VERSION=0.3.15 --- End diff -- Oh, thanks! Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs thanks for testing it out! I've created https://issues.apache.org/jira/browse/SPARK-25341 and https://issues.apache.org/jira/browse/SPARK-25342 to track the followup. I think these two together is the long-term solution. Users can do sort/checkpoint to eliminate the indeterminacy, or use a reliable shuffle storage to avoid fetch failure(someone is proposing it in dev list). If users can't avoid it and hit the issue, this PR provides a final guard to rerun some stages and get correct result. For Spark 2.4 we just fail the job, and we will finish the above 2 tickets in Spark 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 - Since `runSparkSubmit` is a part of `SparkSubmitTestUtils`, some test case will fail if we always remove the env variables like `SPARK_DIST_CLASSPATH`. It gives the child the correct test-time class-path. - For executing old Spark distribution, yes. `SPARK_DIST_CLASSPATH` was the direct root cause and `SPARK_PREPEND_CLASSES` also doesn't look safe to me. I added the others in order to be proactively preventive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the aggregated stage metrics b...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22209 (Feel free to do it if you think it should be there, btw.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the aggregated stage metrics b...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22209 I guess it should be ok. Just trying to be conservative and not inadvertently making the branch less stable in the middle of an rc cycle... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95720/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22335: [SPARK-25091][SQL] reduce the storage memory in E...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22335#discussion_r215359589 --- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala --- @@ -509,6 +507,10 @@ private class LiveRDD(val info: RDDInfo) extends LiveEntity { private val distributions = new HashMap[String, LiveRDDDistribution]() + def isEmpty(): Boolean = { +memoryUsed == 0L && diskUsed == 0L && partitions.isEmpty && distributions.isEmpty --- End diff -- Shouldn't this be the equivalent of `storageLevel == None` instead? That's what 2.2 does: https://github.com/apache/spark/blob/branch-2.2/core/src/main/scala/org/apache/spark/storage/StorageStatusListener.scala#L52 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22335: [SPARK-25091][SQL] reduce the storage memory in E...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22335#discussion_r215360659 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -710,9 +719,21 @@ private[spark] class AppStatusListener( val executorId = event.blockUpdatedInfo.blockManagerId.executorId // Whether values are being added to or removed from the existing accounting. +// BlockManager always send empty block status message when user try to remove rdd block, +// so we try to get this removed block size from rdd partition to get accurate memory/disk storage size. val storageLevel = event.blockUpdatedInfo.storageLevel -val diskDelta = event.blockUpdatedInfo.diskSize * (if (storageLevel.useDisk) 1 else -1) -val memoryDelta = event.blockUpdatedInfo.memSize * (if (storageLevel.useMemory) 1 else -1) +val diskDelta: Long = storageLevel != StorageLevel.NONE match { --- End diff -- A simple `if` is much clearer (and probably faster) here. But see later comment. I believe it's better for this code to be in the "unpersist" callback, and leave the block manager code unchanged. This code is also wrong the way it is. Because if you're changing the storage level of an RDD (e.g. if it was cached on disk, but after the update, it's not) then this is doing the wrong thing now. (So, another argument for treating the "unpersist" event differently, and in the appropriate callback.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22335: [SPARK-25091][SQL] reduce the storage memory in E...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22335#discussion_r215360036 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1549,7 +1549,7 @@ private[spark] class BlockManager( // TODO: Avoid a linear scan by creating another mapping of RDD.id to blocks. logInfo(s"Removing RDD $rddId") val blocksToRemove = blockInfoManager.entries.flatMap(_._1.asRDDId).filter(_.rddId == rddId) -blocksToRemove.foreach { blockId => removeBlock(blockId, tellMaster = false) } --- End diff -- I don't think this should be changing. Instead, the `onUnpersistRDD` callback in the listener should be updating the executor data it's tracking when the RDD is being unpersisted. That's similar to what I understand was being done in 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22335: [SPARK-25091][SQL] reduce the storage memory in E...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22335#discussion_r215359111 --- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala --- @@ -120,7 +118,7 @@ private class LiveTask( stageAttemptId: Int, lastUpdateTime: Option[Long]) extends LiveEntity { - import LiveEntityHelpers._ --- End diff -- None of these import changes are necessary... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95720/testReport)** for PR 22318 at commit [`809b8a8`](https://github.com/apache/spark/commit/809b8a83b7ec3d62ba6d65f6aff6a7d3175bd3e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the aggregated stage metrics b...
Github user squito commented on the issue: https://github.com/apache/spark/pull/22209 any reason not to merge to 2.3? its a bug in 2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22328 **[Test build #95727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95727/testReport)** for PR 22328 at commit [`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22335: [SPARK-25091][SQL] reduce the storage memory in E...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22335#discussion_r215358096 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -646,8 +646,17 @@ private[spark] class AppStatusListener( } override def onUnpersistRDD(event: SparkListenerUnpersistRDD): Unit = { -liveRDDs.remove(event.rddId) -kvstore.delete(classOf[RDDStorageInfoWrapper], event.rddId) +while (true) { --- End diff -- There's a single thread that calls this method (or any other listener method in this class). But I'm kinda wondering what's the point of the loop in the first place... It seems you'll get into an infinite loop if for some reason `rdd.isEmpty()` returns false. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2881/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22328 The image data source tests passed but JVM crashed at the end. Triggered another test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22328 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22140#discussion_r215356501 --- Diff: python/pyspark/sql/types.py --- @@ -1397,6 +1397,8 @@ def _create_row_inbound_converter(dataType): def _create_row(fields, values): +if len(values) > len(fields): +raise ValueError("Can not create %s by %s" % (fields, values)) --- End diff -- I'd like to improve this message a little, maybe "Can not create Row with fields %s, expected %d values but got %s" % (fields, len(fields), values) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22140#discussion_r215355538 --- Diff: python/pyspark/sql/tests.py --- @@ -269,6 +269,10 @@ def test_struct_field_type_name(self): struct_field = StructField("a", IntegerType()) self.assertRaises(TypeError, struct_field.typeName) +def test_invalid_create_row(slef): --- End diff -- typo: `slef` -> `self` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attibutes fr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attibutes fr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95718/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attibutes fr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22326 **[Test build #95718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95718/testReport)** for PR 22326 at commit [`777b881`](https://github.com/apache/spark/commit/777b881df5f577d2bb4100b96748692d494c54c9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95713/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95713/testReport)** for PR 22112 at commit [`8952d08`](https://github.com/apache/spark/commit/8952d082b7b9082d38f5b332ccded2d2d7c96b08). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22325 **[Test build #95726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95726/testReport)** for PR 22325 at commit [`ec069b3`](https://github.com/apache/spark/commit/ec069b3d269b6ec2e6ca9edcc0f6981ccecdad3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the aggregated stage me...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22209 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22325 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22329: [SPARK-25328][PYTHON] Add an example for having t...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22329#discussion_r215345817 --- Diff: python/pyspark/sql/functions.py --- @@ -2804,6 +2804,22 @@ def pandas_udf(f=None, returnType=None, functionType=None): | 1|1.5| | 2|6.0| +---+---+ + >>> @pandas_udf( + ..."id long, additional_key double, v double", --- End diff -- Sorry, I know you just changed it, but I think just naming the column "ceil(v1 / 2)" with a type `long` would be a little more clear. Although "additional_key" is ok too, if you guys want to keep that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the aggregated stage me...
Github user ankuriitg commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r215345106 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala --- @@ -77,7 +77,14 @@ class UISeleniumSuite inputStream.foreachRDD { rdd => rdd.foreach(_ => {}) try { -rdd.foreach(_ => throw new RuntimeException("Oops")) +rdd.foreach { _ => + import org.apache.spark.TaskContext --- End diff -- Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the aggregated stage metrics b...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22209 Merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the aggregated stage me...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r215344056 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala --- @@ -77,7 +77,14 @@ class UISeleniumSuite inputStream.foreachRDD { rdd => rdd.foreach(_ => {}) try { -rdd.foreach(_ => throw new RuntimeException("Oops")) +rdd.foreach { _ => + import org.apache.spark.TaskContext --- End diff -- This should be together with other imports. Will fix during merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the aggregated stage me...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r215343938 --- Diff: core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala --- @@ -1190,6 +1190,61 @@ class AppStatusListenerSuite extends SparkFunSuite with BeforeAndAfter { assert(appStore.asOption(appStore.lastStageAttempt(3)) === None) } + test("SPARK-24415: update metrics for tasks that finish late") { +val listener = new AppStatusListener(store, conf, true) + +val stage1 = new StageInfo(1, 0, "stage1", 4, Nil, Nil, "details1") +val stage2 = new StageInfo(2, 0, "stage2", 4, Nil, Nil, "details2") + +// Start job +listener.onJobStart(SparkListenerJobStart(1, time, Seq(stage1, stage2), null)) + +// Start 2 stages +listener.onStageSubmitted(SparkListenerStageSubmitted(stage1, new Properties())) +listener.onStageSubmitted(SparkListenerStageSubmitted(stage2, new Properties())) + +// Start 2 Tasks +val tasks = createTasks(2, Array("1")) +tasks.foreach { task => + listener.onTaskStart(SparkListenerTaskStart(stage1.stageId, stage1.attemptNumber, task)) +} + +// Task 1 Finished +time += 1 +tasks(0).markFinished(TaskState.FINISHED, time) +listener.onTaskEnd( + SparkListenerTaskEnd(stage1.stageId, stage1.attemptId, "taskType", Success, tasks(0), null)) + +// Stage 1 Completed +stage1.failureReason = Some("Failed") +listener.onStageCompleted(SparkListenerStageCompleted(stage1)) + +// Stop job 1 +time += 1 +listener.onJobEnd(SparkListenerJobEnd(1, time, JobSucceeded)) + +// Task 2 Killed +time += 1 +tasks(1).markFinished(TaskState.FINISHED, time) +listener.onTaskEnd( + SparkListenerTaskEnd(stage1.stageId, stage1.attemptId, "taskType", +TaskKilled(reason = "Killed"), tasks(1), null)) + +// Ensure killed task metrics are updated +val allStages = store.view(classOf[StageDataWrapper]).reverse().asScala.map(_.info) +val failedStages = allStages.filter(_.status == v1.StageStatus.FAILED) +assert(failedStages.size == 1) +assert(failedStages.head.numKilledTasks == 1) +assert(failedStages.head.numCompleteTasks == 1) + +val allJobs = store.view(classOf[JobDataWrapper]).reverse().asScala.map(_.info) +assert(allJobs.size == 1) --- End diff -- Ah, missed that. Took another look and it should be fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/22327 The goal for the 2.7.x line should be "nothing breaks", which is precisely why it's only getting critical patches. Reverting might make the problem go away, but you can assume that everyone running HDFS clusters with something based on the 2.7.x line will get this patch in before long: it's best to identify what's up and address it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22340 I see. It does seem like we don't want to run with test env variables in this context. I was going to ask if we ever do? should this function always strip the env variables for testing? I can see being conservative and restricting it to this case. It seems like just stripping `SPARK_DIST_CLASSPATH` and maybe `SPARK_PREPEND_CLASSES` is the issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95724/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22328 **[Test build #95724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95724/testReport)** for PR 22328 at commit [`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22327 If it's an unintentional regression in Hadoop, and is impacting users, I'd agree with down-grading the version here (at least to the version that doesn't exhibit the issue). If it were intentional behavior, we'd really need to work around it. @steveloughran seems like you think it's a regression insofar as behavior changes like this aren't expected in maintenance releases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 @dongjoon-hyun I have updated PR description to explain in more details. As you mentioned, this PR is specific to the case when reading from data source table persisted in metastore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22323: [SPARK-25262][K8S] Allow SPARK_LOCAL_DIRS to be t...
Github user rvesse commented on a diff in the pull request: https://github.com/apache/spark/pull/22323#discussion_r215338145 --- Diff: docs/running-on-kubernetes.md --- @@ -215,6 +215,19 @@ spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.options.clai The configuration properties for mounting volumes into the executor pods use prefix `spark.kubernetes.executor.` instead of `spark.kubernetes.driver.`. For a complete list of available options for each supported type of volumes, please refer to the [Spark Properties](#spark-properties) section below. +## Local Storage + +Spark uses temporary scratch space to spill data to disk during shuffles and other operations. When using Kubernetes as the resource manager the pods will be created with an [emptyDir](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir) volume mounted for each directory listed in `SPARK_LOCAL_DIRS`. If no directories are explicitly specified then a default directory is created and configured appropriately. + +`emptyDir` volumes use the ephemeral storage feature of Kubernetes and do not persist beyond the life of the pod. + +### Using RAM for local storage + +As `emptyDir` volumes use the nodes backing storage for ephemeral storage this default behaviour may not be appropriate for some compute environments. For example if you have diskless nodes with remote storage mounted over a network having lots of executors doing IO to this remote storage may actually degrade performance. + +In this case it may be desirable to set `spark.kubernetes.local.dirs.tmpfs=true` in your configuration which will cause the `emptyDir` volumes to be configured as `tmpfs` i.e. RAM backed volumes. When configured like this Sparks local storage usage will count towards your pods memory usage therefore you may wish to increase your memory requests via the normal `spark.driver.memory` and `spark.executor.memory` configuration properties. --- End diff -- Ok, I'll update the proposed doc changes to mention both --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user suryag10 commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215338063 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/hadoopsteps/HadoopBootstrapUtil.scala --- @@ -0,0 +1,186 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features.hadoopsteps + +import java.io.File + +import scala.collection.JavaConverters._ + +import com.google.common.base.Charsets +import com.google.common.io.Files +import io.fabric8.kubernetes.api.model.{ConfigMap, ConfigMapBuilder, ContainerBuilder, KeyToPathBuilder, PodBuilder} + +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.SparkPod +import org.apache.spark.deploy.k8s.security.KubernetesHadoopDelegationTokenManager + +private[spark] object HadoopBootstrapUtil { + + /** +* Mounting the DT secret for both the Driver and the executors +* +* @param dtSecretName Name of the secret that stores the Delegation Token +* @param dtSecretItemKey Name of the Item Key storing the Delegation Token +* @param userName Name of the SparkUser to set SPARK_USER +* @param fileLocation Location of the krb5 file +* @param krb5ConfName Name of the ConfigMap for Krb5 +* @param pod Input pod to be appended to +* @return a modified SparkPod +*/ + def bootstrapKerberosPod( + dtSecretName: String, + dtSecretItemKey: String, + userName: String, + fileLocation: String, + krb5ConfName: String, + pod: SparkPod) : SparkPod = { + val krb5File = new File(fileLocation) + val fileStringPath = krb5File.toPath.getFileName.toString + val kerberizedPod = new PodBuilder(pod.pod) +.editOrNewSpec() + .addNewVolume() +.withName(SPARK_APP_HADOOP_SECRET_VOLUME_NAME) +.withNewSecret() + .withSecretName(dtSecretName) + .endSecret() +.endVolume() + .addNewVolume() +.withName(KRB_FILE_VOLUME) + .withNewConfigMap() +.withName(krb5ConfName) +.withItems(new KeyToPathBuilder() + .withKey(fileStringPath) + .withPath(fileStringPath) + .build()) +.endConfigMap() + .endVolume() +// TODO: (ifilonenko) make configurable PU(G)ID + .editOrNewSecurityContext() +.withRunAsUser(1000L) +.withFsGroup(2000L) +.endSecurityContext() + .endSpec() +.build() + val kerberizedContainer = new ContainerBuilder(pod.container) +.addNewVolumeMount() + .withName(SPARK_APP_HADOOP_SECRET_VOLUME_NAME) + .withMountPath(SPARK_APP_HADOOP_CREDENTIALS_BASE_DIR) + .endVolumeMount() +.addNewVolumeMount() + .withName(KRB_FILE_VOLUME) + .withMountPath(KRB_FILE_DIR_PATH) --- End diff -- Hi Stavros, The main aim of the following step was to mount the krb5.conf file into /etc/ directory. .addNewVolumeMount() .withName(KRB_FILE_VOLUME) .withMountPath(KRB_FILE_DIR_PATH) when the above is done all the contents which are actually present inside the container in "/etc/" directory are lost and gets mounted only with the krb5.conf file with read permissions. As other contents are lost from "/etc" directory nothing works. As i had commented earlier as well, this makes the driver pod to fail (spawn fails) as well. To make things correct and mount only the krb5.conf file, following should be done: .addNewVolumeMount() .withMountPath(KRB_FILE_DIR_PATH + "/krb5.conf") .withSubPath("krb5.conf") With this everything is perfect without any deletions from /etc/ directory
[GitHub] spark issue #20999: [WIP][SPARK-14922][SPARK-23866][SQL] Support partition f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20999 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20999: [WIP][SPARK-14922][SPARK-23866][SQL] Support partition f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20999 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2880/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20999: [WIP][SPARK-23866][SQL] Support partition filters in ALT...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20999 I am reopening this according to the discussion in https://github.com/apache/spark/pull/19691. cc @maropu @DazhuangSu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/22327 The 2.7.x branches updates are generally all security plus some fixes for JVM/JDK regressions. without looking at the details, you can assume that the regression will be related to one of these and for that reason, wouldn't recommend rolling back. Better to find the problem and come up with a fix or workaround. Created [HADOOP-15722](https://issues.apache.org/jira/browse/HADOOP-15722) to cover this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20999: [WIP][SPARK-23866][SQL] Support partition filters in ALT...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20999 **[Test build #95725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95725/testReport)** for PR 20999 at commit [`148f477`](https://github.com/apache/spark/commit/148f47742cae892260c46f9ffa97bb2d0422701d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20999: [WIP][SPARK-23866][SQL] Support partition filters...
GitHub user mgaido91 reopened a pull request: https://github.com/apache/spark/pull/20999 [WIP][SPARK-23866][SQL] Support partition filters in ALTER TABLE DROP PARTITION ## What changes were proposed in this pull request? Hive has been supporting for a while the ability of dropping partitions using any kind of comparison operator on them. Spark so far is supporting only dropping partitions by exact values. For instance, Spark doesn't support: ``` ALTER TABLE mytable DROP PARTITION(mydate < '2018-04-06') ``` The PR adds the support to this syntax too. ## How was this patch tested? UTs to be added You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23866 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20999.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20999 commit b57a5d1797dbe206aeb0a4d2a24ccd0c73845dc8 Author: Marco Gaido Date: 2018-04-05T15:35:19Z [WIP][SPARK-23866][SQL] Support partition filters in ALTER TABLE DROP PARTITION commit 148f47742cae892260c46f9ffa97bb2d0422701d Author: Marco Gaido Date: 2018-09-05T16:11:38Z adding UT from SPARK-14922 by DazhuangSu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 Also, cc @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22141#discussion_r215333831 --- Diff: sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/nested-not-in.sql --- @@ -0,0 +1,198 @@ +-- Tests NOT-IN subqueries nested inside OR expression(s). --- End diff -- @maropu Oh.. thank you very much for trying out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 `HiveExternalCatalogVersionsSuite` downloads and executes old Spark in their directory. However, it gave the child the followings. None of them is expected here. Especially, `SPARK_DIST_CLASSPATH` is related to this bug. - SPARK_TESTING - SPARK_SQL_TESTING - SPARK_PREPEND_CLASSES - SPARK_DIST_CLASSPATH Previously, in the class path, new Spark classes are behind the old Spark classes. So, new ones are unseen. However, Spark 2.4.0 reveals this bug due to the recent data source class changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95711/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22328 LGTM pending tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22320 **[Test build #95711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95711/testReport)** for PR 22320 at commit [`4590c98`](https://github.com/apache/spark/commit/4590c9837026e820d7d91300a7ab3f87a668755c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22328 **[Test build #95724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95724/testReport)** for PR 22328 at commit [`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2879/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22328 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22141 @maropu I think, this will be considered post 2.4. So we have some time. I am now exploring options to plan better at physical level. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22165 I didn't make a full pass over the tests. @jiangxb1987 let me know if you need me to take a pass. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22165: [SPARK-25017][Core] Add test suite for BarrierCoo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/22165#discussion_r215326727 --- Diff: core/src/test/scala/org/apache/spark/scheduler/BarrierCoordinatorSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.TimeoutException + +import scala.concurrent.duration._ +import scala.language.postfixOps + +import org.apache.spark._ +import org.apache.spark.rpc.RpcTimeout + +class BarrierCoordinatorSuite extends SparkFunSuite with LocalSparkContext { + + /** + * Get the current barrierEpoch from barrierCoordinator.states by ContextBarrierId + */ + def getCurrentBarrierEpoch( + stageId: Int, stageAttemptId: Int, barrierCoordinator: BarrierCoordinator): Int = { +val barrierId = ContextBarrierId(stageId, stageAttemptId) +barrierCoordinator.states.get(barrierId).barrierEpoch + } + + test("normal test for single task") { +sc = new SparkContext("local", "test") +val barrierCoordinator = new BarrierCoordinator(5, sc.listenerBus, sc.env.rpcEnv) +val rpcEndpointRef = sc.env.rpcEnv.setupEndpoint("barrierCoordinator", barrierCoordinator) +val stageId = 0 +val stageAttemptNumber = 0 +rpcEndpointRef.askSync[Unit]( + message = RequestToSync(numTasks = 1, stageId, stageAttemptNumber, taskAttemptId = 0, +barrierEpoch = 0), + timeout = new RpcTimeout(5 seconds, "rpcTimeOut")) +// sleep for waiting barrierEpoch value change +Thread.sleep(500) +assert(getCurrentBarrierEpoch(stageId, stageAttemptNumber, barrierCoordinator) == 1) + } + + test("normal test for multi tasks") { +sc = new SparkContext("local", "test") +val barrierCoordinator = new BarrierCoordinator(5, sc.listenerBus, sc.env.rpcEnv) +val rpcEndpointRef = sc.env.rpcEnv.setupEndpoint("barrierCoordinator", barrierCoordinator) +val numTasks = 3 +val stageId = 0 +val stageAttemptNumber = 0 +val rpcTimeOut = new RpcTimeout(5 seconds, "rpcTimeOut") +// sync request from 3 tasks +(0 until numTasks).foreach { taskId => + new Thread(s"task-$taskId-thread") { +setDaemon(true) +override def run(): Unit = { + rpcEndpointRef.askSync[Unit]( +message = RequestToSync(numTasks, stageId, stageAttemptNumber, taskAttemptId = taskId, + barrierEpoch = 0), +timeout = rpcTimeOut) +} + }.start() +} +// sleep for waiting barrierEpoch value change +Thread.sleep(500) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22165: [SPARK-25017][Core] Add test suite for BarrierCoo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/22165#discussion_r215326394 --- Diff: core/src/test/scala/org/apache/spark/scheduler/BarrierCoordinatorSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.TimeoutException + +import scala.concurrent.duration._ +import scala.language.postfixOps + +import org.apache.spark._ +import org.apache.spark.rpc.RpcTimeout + +class BarrierCoordinatorSuite extends SparkFunSuite with LocalSparkContext { + + /** + * Get the current barrierEpoch from barrierCoordinator.states by ContextBarrierId + */ + def getCurrentBarrierEpoch( + stageId: Int, stageAttemptId: Int, barrierCoordinator: BarrierCoordinator): Int = { +val barrierId = ContextBarrierId(stageId, stageAttemptId) +barrierCoordinator.states.get(barrierId).barrierEpoch + } + + test("normal test for single task") { +sc = new SparkContext("local", "test") +val barrierCoordinator = new BarrierCoordinator(5, sc.listenerBus, sc.env.rpcEnv) +val rpcEndpointRef = sc.env.rpcEnv.setupEndpoint("barrierCoordinator", barrierCoordinator) +val stageId = 0 +val stageAttemptNumber = 0 +rpcEndpointRef.askSync[Unit]( + message = RequestToSync(numTasks = 1, stageId, stageAttemptNumber, taskAttemptId = 0, +barrierEpoch = 0), + timeout = new RpcTimeout(5 seconds, "rpcTimeOut")) +// sleep for waiting barrierEpoch value change +Thread.sleep(500) --- End diff -- Do not use explicit sleep. It basically means adding 0.5 seconds to total test time and flakyness. Use conditional wait, for example: https://github.com/apache/spark/commit/bfb74394a5513134ea1da9fcf4a1783b77dd64e4#diff-a90010f459c27926238d7a4ce5a0aca1R107 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22165: [SPARK-25017][Core] Add test suite for BarrierCoo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/22165#discussion_r215324595 --- Diff: core/src/main/scala/org/apache/spark/BarrierCoordinator.scala --- @@ -65,7 +65,7 @@ private[spark] class BarrierCoordinator( // Record all active stage attempts that make barrier() call(s), and the corresponding internal // state. - private val states = new ConcurrentHashMap[ContextBarrierId, ContextBarrierState] + private[spark] val states = new ConcurrentHashMap[ContextBarrierId, ContextBarrierState] --- End diff -- Could you turn the `// ...` comment into ScalaDoc `/** ... */` and mention `Visible for unit testing.` in the doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95712/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #95712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95712/testReport)** for PR 18906 at commit [`97305f5`](https://github.com/apache/spark/commit/97305f53218d7d8ab6588ab2c60d1cb91c14e9dc). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/22328#discussion_r215322021 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.image + +/** + * `image` package implements Spark SQL data source API for loading IMAGE data as `DataFrame`. + * The loaded `DataFrame` has one `StructType` column: `image`. + * The schema of the `image` column is: + * - origin: String (represents the origin of the image. + *If loaded from files, then it is the file path) + * - height: Int (height of the image) + * - width: Int (width of the image) + * - nChannels: Int (number of the image channels) + * - mode: Int (OpenCV-compatible type) + * - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases) + * + * To use IMAGE data source, you need to set "image" as the format in `DataFrameReader` and + * optionally specify the data source options, for example: + * {{{ + * // Scala + * val df = spark.read.format("image") + * .option("dropImageFailures", true) + * .load("data/mllib/images/partitioned") + * + * // Java + * Dataset df = spark.read().format("image") + * .option("dropImageFailures", true) + * .load("data/mllib/images/partitioned"); + * }}} + * + * IMAGE data source supports the following options: + * - "dropImageFailures": Whether to drop the files that are not valid images from the result. --- End diff -- How about changing `dropImageFailures` to `dropInvalid`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/22328#discussion_r215322673 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala --- @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.image + +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap + +private[image] class ImageOptions( +@transient private val parameters: CaseInsensitiveMap[String]) extends Serializable { + + def this(parameters: Map[String, String]) = this(CaseInsensitiveMap(parameters)) + + val dropImageFailures = parameters.getOrElse("dropImageFailures", "false").toBoolean --- End diff -- Should add ScalaDoc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org