[GitHub] spark pull request: [HOTFIX] Some clean-up in shuffle code.
GitHub user pwendell opened a pull request: https://github.com/apache/spark/pull/5286 [HOTFIX] Some clean-up in shuffle code. Before diving into review #4450 I did a look through the existing shuffle code to learn how it works. Unfortunately, there are some very confusing things in this code. This patch makes a few small changes to simplify things. It is not easily to concisely describe the changes because of how convoluted the issues were, but they are fairly small logically: 1. There is a trait named `ShuffleBlockManager` that only deals with one logical function which is retrieving shuffle block data given shuffle block coordinates. This trait has two implementors FileShuffleBlockManager and IndexShuffleBlockManager. Confusingly the vast majority of those implementations have nothing to do with this particular functionality. So I've renamed the trait to ShuffleBlockResolver and documented it. 2. The aforementioned trait had two almost identical methods, for no good reason. I removed one method (getBytes) and modified callers to use the other one. I think the behavior is preserved in all cases. 3. The sort shuffle code uses an identifier 0 in the reduce slot of a BlockID as a placeholder. I made it into a constant since it needs to be consistent across multiple places. I think for (3) there is actually a better solution that would avoid the need to do this type of workaround/hack in the first place, but it's more complex so I'm punting it for now. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pwendell/spark cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5286.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5286 commit a4060797a83c591be288e8fca48affecb17412c6 Author: Patrick Wendell patr...@databricks.com Date: 2015-03-31T01:12:43Z [HOTFIX] Some clean-up in shuffle code. Before diving into review #4450 I did a look through the existing shuffle code. Unfortunately, there are some very confusing things in this code. This patch makes a few small changes to simplify things. It is not easily to concisely describe the changes because of how convoluted the issues were: 1. There was a trait named ShuffleBlockManager that only deals with one logical function which is retrieving shuffle block data given shuffle block coordinates. This trait has two implementors FileShuffleBlockManager and IndexShuffleBlockManager. Confusingly the vast majority of those implementations have nothing to do with this particular functionality. So I've renamed the trait to ShuffleBlockResolver and documented it. 2. The aformentioned trait had two almost identical methods, for no good reason. I removed one method (getBytes) and modified callers to use the other one. I think the behavior is preserved in all cases. 3. The sort shuffle code uses an identifier 0 in the reduce slot of a BlockID as a placeholder. I made it into a constant since it needs to be consistent across multiple places. I think for (3) there is actually a better solution that would avoid the need to do this type of workaround/hack in the first place, but it's more complex so I'm punting it for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6433 hive tests to import spark-sql test...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5119#discussion_r27537871 --- Diff: pom.xml --- @@ -1472,6 +1473,25 @@ groupIdorg.scalatest/groupId artifactIdscalatest-maven-plugin/artifactId /plugin + !-- Build the JARs-- --- End diff -- Can this say something more useful? Maybe like: ``` !-- Build test-jar's for all projects, since some projects depend on tests from others -- ``` Otherwise it might be hard for someone to understand what is going on here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6433 hive tests to import spark-sql test...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5119#issuecomment-88300495 The overall approach LGTM, but I would suggest adding a better comment since it's non-obvious what is going on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6627] Some clean-up in shuffle code.
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5286#discussion_r27496946 --- Diff: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockManager.scala --- @@ -39,25 +41,18 @@ import org.apache.spark.storage._ // Note: Changes to the format in this file should be kept in sync with // org.apache.spark.network.shuffle.StandaloneShuffleBlockManager#getSortBasedShuffleBlockData(). private[spark] -class IndexShuffleBlockManager(conf: SparkConf) extends ShuffleBlockManager { +class IndexShuffleBlockManager(conf: SparkConf) extends ShuffleBlockResolver { private lazy val blockManager = SparkEnv.get.blockManager private val transportConf = SparkTransportConf.fromSparkConf(conf) - /** - * Mapping to a single shuffleBlockId with reduce ID 0. - * */ - def consolidateId(shuffleId: Int, mapId: Int): ShuffleBlockId = { -ShuffleBlockId(shuffleId, mapId, 0) - } - def getDataFile(shuffleId: Int, mapId: Int): File = { -blockManager.diskBlockManager.getFile(ShuffleDataBlockId(shuffleId, mapId, 0)) +blockManager.diskBlockManager.getFile(ShuffleDataBlockId(shuffleId, mapId, NOOP_REDUCE_ID)) --- End diff -- Yeah - I think there is actually a third option which is just to make it so the BlockObjectWriter doesn't have to be passed a Block ID (it almost doesn't even need one). This way we totally avoid using the BlockID for the purpose of naming this file. I'm punting it to future work though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6627] Some clean-up in shuffle code.
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5286#issuecomment-88186520 Jenkins, retest this pleas.e --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4450#discussion_r27423831 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ChainedBuffer.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io.OutputStream + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.storage.BlockObjectWriter + +/** + * A logical byte buffer that wraps a list of byte arrays. All the byte arrays have equal size. The + * advantage of this over a standard ArrayBuffer is that it can grow without claiming large amounts + * of memory and needing to copy the full contents. + */ +private[spark] class ChainedBuffer(chunkSize: Int) { --- End diff -- Do we ever use the fact that we can do arbitrary offset reads? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4450#discussion_r27424659 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ChainedBuffer.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io.OutputStream + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.storage.BlockObjectWriter + +/** + * A logical byte buffer that wraps a list of byte arrays. All the byte arrays have equal size. The + * advantage of this over a standard ArrayBuffer is that it can grow without claiming large amounts + * of memory and needing to copy the full contents. + */ +private[spark] class ChainedBuffer(chunkSize: Int) { + private val chunkSizeLog2 = (math.log(chunkSize) / math.log(2)).toInt + assert(math.pow(2, chunkSizeLog2).toInt == chunkSize) + private val chunks: ArrayBuffer[Array[Byte]] = new ArrayBuffer[Array[Byte]]() + private var _size: Int = _ + + /** + * Feed bytes from this buffer into a BlockObjectWriter. + * + * @param pos Offset in the buffer to read from. + * @param writer BlockObjectWriter to read into. + * @param len Number of bytes to read. + */ + def read(pos: Int, writer: BlockObjectWriter, len: Int): Unit = { --- End diff -- It seems weird to me that this ChainedBuffer abstraction should be aware of a BlockObjectWriter - no concrete suggestion at this point (just getting started with this patch), but perhaps there is a clear way to separate the concerns here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [HOTFIX][SPARK-4123]: Updated to fix bug where...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5269#issuecomment-87804589 Thanks - pulling this in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4450#discussion_r27424016 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ChainedBuffer.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import java.io.OutputStream + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.storage.BlockObjectWriter + +/** + * A logical byte buffer that wraps a list of byte arrays. All the byte arrays have equal size. The + * advantage of this over a standard ArrayBuffer is that it can grow without claiming large amounts --- End diff -- I would maybe say explicitly that the disadvantage is we can't view/return it as a single large array. Otherwise it's not clear why ArrayBuffer isn't just implemented this way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6544][build] Increment Avro version fro...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5193#issuecomment-87030538 It's a fair point - in some cases we have made minor (patch) release updates in our own patch versions. @srowen what do you think? It can cause some friction for packagers if dependencies change in these patch releases, but this upgrade may be alright. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6544][build] Increment Avro version fro...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5193#issuecomment-87062417 Okay after reviewing the changes to Avro I've decided to backport this. I looked a bit more and in other cases we have done fairly small dependency updates like this provided the changes in the dependency have been minimal between versions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6119][SQL] DataFrame.dropna
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5225#discussion_r27328262 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -730,6 +731,23 @@ class DataFrame private[sql]( Generate(generator, join = true, outer = false, None, logicalPlan) } + /** + * Returns a new [[DataFrame]] that omits rows with `NULL` values in the specified columns. + * If no columns are specified, a row is omitted if any column contains `NULL`. + * + * @group dfops + */ + @scala.annotation.varargs + def dropna(cols: String*): DataFrame = { --- End diff -- Is it possible to do what cheng said? I think dropNull is a much better name, but if we have an alias dropna, I think it would be good for pandas compatibility. In general I think aliasing things might necessary in the way we approach data frames since we're trying to meet a few different requirements. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [spark] [SPARK-6168] Expose some of the collec...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5084#issuecomment-86835246 Yeah I agree with Sean and what pretty much everyone else said. My feeling is that we don't want typical spark users to be relying on unstable API's, else it creates a lot of upgrade friction for users and/or pressure on maintainers to stabilize this stuff. We annotate things as unstable and expose them for two reasons (a) to get feedback on new API's we intend to stabilize shortly (b) to expose internal hooks that are meant for third party integrations (similar to kernel modules) such as data sources, but not expected to be needed by end users. Exposing internal utilities for end users, IMO, it's just not a good idea. I don't think slapping an annotation means that there is no cost to doing this. Users will rely on this stuff and it will create upgrade friction, and this is why we are careful about what we expose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6405] Limiting the maximum Kryo buffer ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5218#issuecomment-86834844 LGTM - merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6549 - Spark console logger logs to stde...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5202#issuecomment-86810509 Changing the default logging behavior like this would break compatibility for people who rely on the out-of-the-box behavior. IIRC this change was proposed previously and rejected on the same grounds. Also, I'm not even convinced that stdout is the better default. The reason it's stderr is to isolate user output from the spark's own output. In any case, I think on compatibility grounds we can't merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6544][build] Increment Avro version fro...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5193#issuecomment-86815673 Yeah LGTM - I think bumping maintenance releases like this has usually been fine for us with Avro. It may be best not merge this into 1.3 though, since we typically keep dependencies fixed in patch releases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6554] [SQL] Don't push down predicates ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5210#discussion_r27252832 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala --- @@ -435,11 +435,18 @@ private[sql] case class ParquetRelation2( // Push down filters when possible. Notice that not all filters can be converted to Parquet // filter predicate. Here we try to convert each individual predicate and only collect those // convertible ones. -predicates - .flatMap(ParquetFilters.createFilter) - .reduceOption(FilterApi.and) - .filter(_ = sqlContext.conf.parquetFilterPushDown) - .foreach(ParquetInputFormat.setFilterPredicate(jobConf, _)) +if (sqlContext.conf.parquetFilterPushDown) { + predicates +// Don't push down predicates which reference partition columns +.filter { pred = + val partitionColNames = partitionColumns.map(_.name).toSet + val referencedColNames = pred.references.map(_.name).toSet + referencedColNames.intersect(partitionColNames).isEmpty +} +.flatMap(ParquetFilters.createFilter) --- End diff -- Yeah reynold's one seems more clear to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6477][Build]: Run MIMA tests before the...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5145#issuecomment-85371214 LGTM too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4123][Project Infra][WIP]: Show new dep...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5093#discussion_r26996263 --- Diff: dev/tests/pr_new_dependencies.sh --- @@ -0,0 +1,85 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# +# This script follows the base format for testing pull requests against +# another branch and returning results to be published. More details can be +# found at dev/run-tests-jenkins. +# +# Arg1: The Github Pull Request Actual Commit +#+ known as `ghprbActualCommit` in `run-tests-jenkins` +# Arg2: The SHA1 hash +#+ known as `sha1` in `run-tests-jenkins` +# + +ghprbActualCommit=$1 +sha1=$2 + +MVN_BIN=`pwd`/build/mvn +CURR_CP_FILE=my-classpath.txt +MASTER_CP_FILE=master-classpath.txt + +${MVN_BIN} clean compile dependency:build-classpath 2/dev/null | \ --- End diff -- If that's the case, we should probably gate this entire thing in a check as to whether any pom.xml files are modified. Then for most builds this will not add any time, since most builds do not modify dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [spark] [SPARK-6168] Expose some of the collec...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5084#issuecomment-84447923 I agree with Sean and Sandy here. I don't think we should just expose internal utilities like this. The collections classes are simple enough that someone can just copy our code if they want to use it. Exposing `Utilities` seems like a huge widening of visibility - those are designed as internal utilities. The reason we have @Experimetnal and @DeveloperAPI are to provide public API's we expect to eventually stabilize or that we want feedback on from the community. They still come with a reasonable expectation of stability and an implicit endorsement that we expect people to use them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5085#issuecomment-84443666 Okay - if there is no big performance hit in launching from 1.3-1.4, that's fine with me. I guess then the remaining question is whether to make a smaller change (simply make the wildcard match more liberal) vs. doing a larger change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5085#issuecomment-84216181 @nishkamravi2 - It's about the startup time of the interactive shell. That's the usability trade off. If the shell takes 300 extra ms to start up, it is not acceptable from a usability perspective. This is not about performance for a long running batch processing job. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5134 [BUILD] Bump default Hadoop version...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5027#issuecomment-84201074 Looks good - thanks for commiting this sean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5085#issuecomment-84205836 Hey @vanzin - can you explain a bit why the performance impact isn't worrisome? I think a first order goal was to avoid invoking multiple JVM's that have the assembly jar in the classpath, because it is slow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6122][Core] Upgrade Tachyon client vers...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4867#issuecomment-84200925 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5085#issuecomment-84214306 What is the intuitive explanation for why it is faster than before? It seems like launching the assembly twice would be slower. Separately, is it worth doing a big refactoring for this issue? If we want to support this maybe we should just widen the regex. Or do you now feel the refactoring is an important design improvement, and it was simply instigated by this issue being reported? For instance, if a change makes starting spark-shell 100ms slower for people, IMO that's bad usability. Responsiveness for things like this can matter when people getting an impression of the overall performance of a system, and we have components specifically designed for quick interactive experience. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5847] Allow for namespacing metrics by ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4632#issuecomment-84209577 Hey @JoshRosen can probably take a look at this. One thing though, IIRC the reason why we have a unique ID here is because some sets of users requested the exact opposite behavior as what you want: https://issues.apache.org/jira/browse/SPARK-3377 If we do decide to allow this, I think it's best to have a specific set of naming schemes (don't give them a latch to chose an arbitrary conf), in order to limit the surface area of possible configurations. I'm also not so sure about the right way to do this in terms of the current MetricsConfing infrastructure. It seems a bit weird to hard code handling of this particular configuration hard coded in the MetricsConfig class. Only had time to take a quick look though and I'm not familiar with that code at all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4123][Project Infra][WIP]: Show new dep...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5093#discussion_r26882829 --- Diff: dev/tests/pr_new_dependencies.sh --- @@ -0,0 +1,77 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# +# This script follows the base format for testing pull requests against +# another branch and returning results to be published. More details can be +# found at dev/run-tests-jenkins. +# +# Arg1: The Github Pull Request Actual Commit +#+ known as `ghprbActualCommit` in `run-tests-jenkins` +# Arg2: The SHA1 hash +#+ known as `sha1` in `run-tests-jenkins` +# + +ghprbActualCommit=$1 +sha1=$2 + +CURR_CP_FILE=my-classpath.txt +MASTER_CP_FILE=master-classpath.txt + +./build/mvn clean compile dependency:build-classpath | \ + sed -n -e '/Building Spark Project Assembly/,$p' | \ + grep --context=1 -m 2 Dependencies classpath: | \ + head -n 3 | \ + tail -n 1 | \ + tr : \n | \ + rev | \ + cut -d / -f 1 | \ + rev | \ + sort ${CURR_CP_FILE} + +# Checkout the master branch to compare against +git checkout apache/master + +./build/mvn clean compile dependency:build-classpath | \ + sed -n -e '/Building Spark Project Assembly/,$p' | \ + grep --context=1 -m 2 Dependencies classpath: | \ + head -n 3 | \ + tail -n 1 | \ + tr : \n | \ + rev | \ + cut -d / -f 1 | \ + rev | \ + sort ${MASTER_CP_FILE} + +DIFF_RESULTS=`diff my-classpath.txt master-classpath.txt` + +if [ -z ${DIFF_RESULTS} ]; then + echo * This patch adds no new dependencies. +else + # Pretty print the new dependencies + new_deps=$(echo ${DIFF_RESULTS} | grep | cut -d -f2 | awk '{print * $1}') + echo * This patch **adds the following new dependencies:**\n${new_deps} --- End diff -- It would actually be good to show additions and removals of dependencies (in some cases a library will get a new version, which ends up here as an addition and removal). I updated the JIRA title to reflect that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6371] [build] Update version to 1.4.0-S...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5056#issuecomment-84085364 LGTM - I did some searching around and I think this upgrades all the necessary places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6275][Documentation]Miss toDF() functio...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4977#issuecomment-81428099 I've pushed this change to the website. http://spark.apache.org/docs/latest/sql-programming-guide.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-78219718 Okay cool - LGTM I will pull this in. I just did some local sanity tests, built with Maven and ran (a few) Maven tests. We'll need to keep any eye on the Maven build tomorrow to see if this is succeeding there, since it's not caught by the PRB. Thanks @vanzin for all the work on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-78003405 @andrewor14 are you good with this one? I'd like to merge it soon! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-77929101 Hey Marcelo, I think this is really close to being ready to merge. I noticed some extra output (a '0') when running usage on `spark-shell`: ``` ./bin/spark-shell --help 0 Usage: ./bin/spark-shell [options] Spark assembly has been built with Hive, including Datanucleus jars in classpath. ``` I think whether this should be an exposed API is still a question. However, I'm fine to merge v1 of this as an exposed API and get feedback. It would be helpful to hear from potential users whether this is useful to them... so far I haven't heard much in that direction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r26065985 --- Diff: launcher/src/main/java/org/apache/spark/launcher/Main.java --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static org.apache.spark.launcher.CommandBuilderUtils.*; + +/** + * Command line interface for the Spark launcher. Used internally by Spark scripts. + */ +class Main { + + /** + * Usage: Main [class] [class args] + * p/ + * This CLI works in two different modes: + * ul + * lispark-submit: if iclass/i is org.apache.spark.deploy.SparkSubmit, the + * {@link SparkLauncher} class is used to launch a Spark application./li + * lispark-class: if another class is provided, an internal Spark class is run./li + * /ul + * + * This class works in tandem with the bin/spark-class script on Unix-like systems, and + * bin/spark-class2.cmd batch script on Windows to execute the final command. + * p/ + * On Unix-like systems, the output is a list of command arguments, separated by the NULL + * character. On Windows, the output is single command line suitable for direct execution + * form the script. --- End diff -- from --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r26067056 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java --- @@ -0,0 +1,327 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Properties; + +import static org.apache.spark.launcher.CommandBuilderUtils.*; + +/** + * Special launcher for handling a CLI invocation of SparkSubmit. + * p/ + * This launcher extends SparkLauncher to add command line parsing compatible with --- End diff -- I'm not sure this comment is correct. I think a `SparkLauncher` encapsulates a `SparkSubmitCommandBuilder` rather than the latter extending the former. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r26067102 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java --- @@ -0,0 +1,327 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Properties; + +import static org.apache.spark.launcher.CommandBuilderUtils.*; + +/** + * Special launcher for handling a CLI invocation of SparkSubmit. --- End diff -- I don't think this is still a sub type of `Launcher` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5134] Bump default hadoop.version to 2....
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3917#issuecomment-5142 Hey I commented on the JIRA, but some recent changes in the way we publish artifacts actually makes this more tenable of a change. https://issues.apache.org/jira/browse/SPARK-5134 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6200] [SQL] Add a manager for dialects
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4939#issuecomment-77736075 It might be simpler to just accept a dialect via a configuration option. Is the idea here that someone might switch dialects multiple times in a single session? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6182 [BUILD] spark-parent pom needs to b...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4912#issuecomment-77429640 I tested this locally and it worked well. I'd like to pull this in to do further testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6182 [BUILD] spark-parent pom needs to b...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4913#issuecomment-77429517 I don't think this works in its current form because the artifact names are laid out in the parent pom and those can still only have a single version. The only way I think we could preserve having a single parent pom is to simply not declare any scala dependencies in the parent pom and only declare them inline in each child pom, which is generally bad form. So given that #4912 is simpler, I'm inclined to just pull that one in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6149] [SQL] [Build] Excludes Guava 15 r...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4890#issuecomment-77203459 @liancheng can you put this exclusion in the pom.xml file instead of in sbt? If you look there we already have other exclusions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5143 [BUILD] [WIP] spark-network-yarn 2....
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4876#issuecomment-77246969 I commented on the JIRA. This LGTM as an immediate fix - clearly the property is not correct in the published 2.11 poms. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6149] [SQL] [Build] Excludes Guava 15 r...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4890#issuecomment-77306302 I'm going to optimistically merge this to cut another RC. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4838#discussion_r25647202 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -184,7 +184,7 @@ private[spark] class MemoryStore(blockManager: BlockManager, maxMemory: Long) val entry = entries.remove(blockId) if (entry != null) { currentMemory -= entry.size -logInfo(sBlock $blockId of size ${entry.size} dropped from memory (free $freeMemory)) +logDebug(sBlock $blockId of size ${entry.size} dropped from memory (free $freeMemory)) --- End diff -- I believe we do have INFO level logging for this up the call chain when drops are blocked due to cache contention: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1004 Might be nice to augment that logging to have information on the size and limit (like this does). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4838#discussion_r25646752 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -184,7 +184,7 @@ private[spark] class MemoryStore(blockManager: BlockManager, maxMemory: Long) val entry = entries.remove(blockId) if (entry != null) { currentMemory -= entry.size -logInfo(sBlock $blockId of size ${entry.size} dropped from memory (free $freeMemory)) +logDebug(sBlock $blockId of size ${entry.size} dropped from memory (free $freeMemory)) --- End diff -- On this one - do you know if this already gets logged somewhere else if a block is dropped from memory due to contention? It would be good to make sure there is some INFO level logging when a block is dropped due to memory being exceeded. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5390 [DOCS] Encourage users to post on S...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4843#discussion_r25646936 --- Diff: docs/index.md --- @@ -115,6 +115,8 @@ options for deployment: * [Spark Homepage](http://spark.apache.org) * [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK) +* [Spark Community](http://spark.apache.org/community.html) resources, including local meetups --- End diff -- Ah I see - sounds good! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25648384 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -574,6 +583,11 @@ private[spark] object JsonProtocol { SparkListenerExecutorRemoved(time, executorId, reason) } + def logStartFromJson(json: JValue): SparkListenerLogStart = { +val version = (json \ Spark Version).extract[String] --- End diff -- minor - but maybe call this `sparkVersion` so it's clear that this isn't a version for the logging format (since that doesn't have its own versioning). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5390 [DOCS] Encourage users to post on S...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4843#discussion_r25646565 --- Diff: docs/index.md --- @@ -115,6 +115,8 @@ options for deployment: * [Spark Homepage](http://spark.apache.org) * [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK) +* [Spark Community](http://spark.apache.org/community.html) resources, including local meetups --- End diff -- minor - but could this say including mailing lists and local meetups? I think it would be good to still mention the mailing list here, but moving this dropped that text. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4821#issuecomment-76855580 Okay I took a close look through this and it LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4799#issuecomment-76862763 Okay - thanks everyone for helping with this. I'll pull it in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4799#discussion_r25582753 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -92,6 +92,16 @@ private[spark] class Executor( private val executorActor = env.actorSystem.actorOf( Props(new ExecutorActor(executorId)), ExecutorActor) + // Whether to load classes in user jars before those in Spark jars + private val userClassPathFirst: Boolean = { +val oldKey = spark.files.userClassPathFirst +val newKey = spark.executor.userClassPathFirst +if (conf.contains(oldKey)) { + logWarning(s$oldKey is deprecated. Please use $newKey instead.) --- End diff -- Log it on the driver side during validation, or at some earlier point. I just think having it here isn't super useufl. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4821#issuecomment-76782633 @vanzin so just to be clear - you are anticipating a future case where we need to read the version to correctly parse the logs? Is that the argument for it? I am only making a simple observation that here we write something and never read it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4799#discussion_r25573606 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -92,6 +92,16 @@ private[spark] class Executor( private val executorActor = env.actorSystem.actorOf( Props(new ExecutorActor(executorId)), ExecutorActor) + // Whether to load classes in user jars before those in Spark jars + private val userClassPathFirst: Boolean = { +val oldKey = spark.files.userClassPathFirst +val newKey = spark.executor.userClassPathFirst +if (conf.contains(oldKey)) { + logWarning(s$oldKey is deprecated. Please use $newKey instead.) --- End diff -- This will log it on the executors, which seems weird since all other configuration warnings are logged at the driver or submission client. Can we just log this warning on the driver side? Or just don't log a deprecation warning always? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25573801 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -217,57 +223,67 @@ private[spark] object EventLoggingListener extends Logging { /** * Write metadata about the event log to the given stream. * - * The header is a serialized version of a map, except it does not use Java serialization to - * avoid incompatibilities between different JDKs. It writes one map entry per line, in - * key=value format. - * - * The very last entry in the header is the `HEADER_END_MARKER` marker, so that the parsing code - * can know when to stop. - * - * The format needs to be kept in sync with the openEventLog() method below. Also, it cannot - * change in new Spark versions without some other way of detecting the change (like some - * metadata encoded in the file name). + * The header is a single line of JSON in the beginning of the file. Note that this + * assumes all metadata necessary to parse the log is also included in the file name. + * The format needs to be kept in sync with the `openEventLog()` method below. Also, it + * cannot change in new Spark versions without some other way of detecting the change. * - * @param logStream Raw output stream to the even log file. + * @param logStream Raw output stream to the event log file. * @param compressionCodec Optional compression codec to use. - * @return A stream where to write event log data. This may be a wrapper around the original + * @return A stream to which event log data is written. This may be a wrapper around the original * stream (for example, when compression is enabled). */ def initEventLog( logStream: OutputStream, compressionCodec: Option[CompressionCodec]): OutputStream = { -val meta = mutable.HashMap((version - SPARK_VERSION)) +val metadata = new mutable.HashMap[String, String] +// Some of these metadata are already encoded in the file name +// Here we include them again within the file itself for completeness +metadata += (Event - Utils.getFormattedClassName(SparkListenerMetadataIdentifier)) +metadata += (SPARK_VERSION_KEY - SPARK_VERSION) compressionCodec.foreach { codec = - meta += (compressionCodec - codec.getClass().getName()) + metadata += (COMPRESSION_CODEC_KEY - codec.getClass.getCanonicalName) } - -def write(entry: String) = { - val bytes = entry.getBytes(Charsets.UTF_8) - if (bytes.length MAX_HEADER_LINE_LENGTH) { -throw new IOException(sHeader entry too long: ${entry}) - } - logStream.write(bytes, 0, bytes.length) +val metadataJson = compact(render(JsonProtocol.mapToJson(metadata))) +val metadataBytes = (metadataJson + \n).getBytes(Charsets.UTF_8) +if (metadataBytes.length MAX_HEADER_LINE_LENGTH) { + throw new IOException(sEvent log metadata too long: $metadataJson) } - -meta.foreach { case (k, v) = write(s$k=$v\n) } -write(s$HEADER_END_MARKER\n) - compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream) +logStream.write(metadataBytes, 0, metadataBytes.length) +logStream } /** * Return a file-system-safe path to the log file for the given application. * + * Note that because we currently only create a single log file for each application, + * we must encode all the information needed to parse this event log in the file name + * instead of within the file itself. Otherwise, if the file is compressed, for instance, + * we won't know which codec to use to decompress the metadata. + * * @param logBaseDir Directory where the log file will be written. * @param appId A unique app ID. + * @param compressionCodecName Name of the compression codec used to compress the contents + * of the log, or None if compression is not enabled. * @return A path which consists of file-system-safe characters. */ - def getLogPath(logBaseDir: String, appId: String): String = { -val name = appId.replaceAll([ :/], -).replaceAll([${}'\], _).toLowerCase -Utils.resolveURI(logBaseDir) + / + name.stripSuffix(/) + def getLogPath( + logBaseDir: String, + appId: String, + compressionCodecName: Option[String] = None): String = { +val sanitizedAppId = appId.replaceAll([ :/], -).replaceAll([${}'\], _).toLowerCase +// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1 +// e.g. EVENT_LOG_
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25573931 --- Diff: core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala --- @@ -23,7 +23,9 @@ private[spark] class ApplicationDescription( val memoryPerSlave: Int, val command: Command, var appUiUrl: String, -val eventLogDir: Option[String] = None) +val sparkVersion: String, --- End diff -- What about just passing a single option `eventLogFile` here (i.e. the full path to the log file). The master only uses the evenLogDir, sparkVersion, eventLogCodec to reconstruct the filename. Reconstructing it in two places seems a little brittle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25573776 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -217,57 +223,67 @@ private[spark] object EventLoggingListener extends Logging { /** * Write metadata about the event log to the given stream. * - * The header is a serialized version of a map, except it does not use Java serialization to - * avoid incompatibilities between different JDKs. It writes one map entry per line, in - * key=value format. - * - * The very last entry in the header is the `HEADER_END_MARKER` marker, so that the parsing code - * can know when to stop. - * - * The format needs to be kept in sync with the openEventLog() method below. Also, it cannot - * change in new Spark versions without some other way of detecting the change (like some - * metadata encoded in the file name). + * The header is a single line of JSON in the beginning of the file. Note that this + * assumes all metadata necessary to parse the log is also included in the file name. + * The format needs to be kept in sync with the `openEventLog()` method below. Also, it + * cannot change in new Spark versions without some other way of detecting the change. * - * @param logStream Raw output stream to the even log file. + * @param logStream Raw output stream to the event log file. * @param compressionCodec Optional compression codec to use. - * @return A stream where to write event log data. This may be a wrapper around the original + * @return A stream to which event log data is written. This may be a wrapper around the original * stream (for example, when compression is enabled). */ def initEventLog( logStream: OutputStream, compressionCodec: Option[CompressionCodec]): OutputStream = { -val meta = mutable.HashMap((version - SPARK_VERSION)) +val metadata = new mutable.HashMap[String, String] +// Some of these metadata are already encoded in the file name +// Here we include them again within the file itself for completeness +metadata += (Event - Utils.getFormattedClassName(SparkListenerMetadataIdentifier)) +metadata += (SPARK_VERSION_KEY - SPARK_VERSION) compressionCodec.foreach { codec = - meta += (compressionCodec - codec.getClass().getName()) + metadata += (COMPRESSION_CODEC_KEY - codec.getClass.getCanonicalName) } - -def write(entry: String) = { - val bytes = entry.getBytes(Charsets.UTF_8) - if (bytes.length MAX_HEADER_LINE_LENGTH) { -throw new IOException(sHeader entry too long: ${entry}) - } - logStream.write(bytes, 0, bytes.length) +val metadataJson = compact(render(JsonProtocol.mapToJson(metadata))) +val metadataBytes = (metadataJson + \n).getBytes(Charsets.UTF_8) +if (metadataBytes.length MAX_HEADER_LINE_LENGTH) { + throw new IOException(sEvent log metadata too long: $metadataJson) } - -meta.foreach { case (k, v) = write(s$k=$v\n) } -write(s$HEADER_END_MARKER\n) - compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream) +logStream.write(metadataBytes, 0, metadataBytes.length) +logStream } /** * Return a file-system-safe path to the log file for the given application. * + * Note that because we currently only create a single log file for each application, + * we must encode all the information needed to parse this event log in the file name + * instead of within the file itself. Otherwise, if the file is compressed, for instance, + * we won't know which codec to use to decompress the metadata. + * * @param logBaseDir Directory where the log file will be written. * @param appId A unique app ID. + * @param compressionCodecName Name of the compression codec used to compress the contents + * of the log, or None if compression is not enabled. * @return A path which consists of file-system-safe characters. */ - def getLogPath(logBaseDir: String, appId: String): String = { -val name = appId.replaceAll([ :/], -).replaceAll([${}'\], _).toLowerCase -Utils.resolveURI(logBaseDir) + / + name.stripSuffix(/) + def getLogPath( + logBaseDir: String, + appId: String, + compressionCodecName: Option[String] = None): String = { +val sanitizedAppId = appId.replaceAll([ :/], -).replaceAll([${}'\], _).toLowerCase +// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1 --- End diff -- I don't see where the `EVENT_LOG` prefix is actually
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4838#discussion_r25574579 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1074,7 +1074,7 @@ private[spark] class BlockManager( * Remove all blocks belonging to the given broadcast. */ def removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int = { -logInfo(sRemoving broadcast $broadcastId) --- End diff -- Yeah that one is sufficient then. I looked earlier, but only at the `doDestroy` implementations which did not have info level logging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25574126 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -99,6 +99,12 @@ case class SparkListenerExecutorRemoved(time: Long, executorId: String, reason: extends SparkListenerEvent /** + * A special dummy event used to identify the metadata header in event logs. + * This is not actually posted anywhere. + */ +private[spark] case object SparkListenerMetadataIdentifier extends SparkListenerEvent --- End diff -- Since this is never used, why don't we instead just ignore this event type when reading from JSON (i.e. allow sparkEventFromJson to return None). It seems a bit strange to add this event but it contains nothing that's ever used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4821#issuecomment-76641955 I took a pass on this with some feedback. Overall, it would be good to really minimize the scope of the changes since this is so late in the game. There is some clean-up and renaming etc that would be best just left out of the patch. The main thing I'm wondering is why we need this header at all. It doesn't even ever get used by our own replay - we just ignore it. It seems like it was added for the purpose of conveying the compression codec to bootstrap replaying the file, however just having an extension seems like a better, much more standard way of doing that. The only argument I see for it is that the header could be used in the future to encode things that are necessary for proper replay of the logs. However, in that case I don't see why we can't just add it later if and when those things occur. I guess I don't see a good argument against a straw man of just not having the header. Curious to hear thoughts from @andrewor14 and @vanzin. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6050] [yarn] Add config option to do la...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4818#issuecomment-76659139 @vanzin okay so maybe set this to true then? I don't have any opinion, but would love to get this in as it's one of the only release blockers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25579888 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis * --- End diff -- This doc might be better if it said in Spark 1.2.X and earlier instead of in previous releases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25579899 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis * * @return 2-tuple of (input stream of the events, version of Spark which wrote the log) --- End diff -- This is outdated now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4821#discussion_r25579955 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis * --- End diff -- Actually I guess this handles multiple versions. My bad, fine to leave as is --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6074] [sql] Package pyspark sql binding...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4822#issuecomment-76563416 Good catch @vanzin. This LGTM. I did some testing to verify that the assembly includes all relevant python files now: ``` $ jar -tf assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar |grep \\.py$ |grep pyspark | wc 62 621480 $ find ./python/pyspark/ | grep \\.py$ | wc 62 622100 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4799#issuecomment-76564115 I agree @andrewor14 can you add documentation about deprecated configs? I would extend what's there now: ``` Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4838#discussion_r25563232 --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala --- @@ -188,10 +188,10 @@ private[spark] class ContextCleaner(sc: SparkContext) extends Logging { /** Perform broadcast cleanup. */ def doCleanupBroadcast(broadcastId: Long, blocking: Boolean) { try { - logDebug(Cleaning broadcast + broadcastId) + logDebug(sCleaning broadcast $broadcastId) --- End diff -- I'm +1 on removing these cleaning ones. It's really not useful or actionable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4838#issuecomment-76563715 Chimed in in a few places. I think overall, a good goal is that when we are doing our normal GC of RDD's and broadcasts, we don't want to be so verbose. This cleaning occurs asynchronously and it's confusing for users to see these messages. When dropping blocks due to cache contention though, I'm not so sure we want silence these messages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6055] [PySpark] fix incorrect DataType....
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4809#issuecomment-76564137 @davies can you close this? (auto close doesn't work for the backport commits). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6055] [PySpark] fix incorrect DataType....
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4810#issuecomment-76564149 @davies can you close this? (auto close doesn't work for the backport commits). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4838#discussion_r25563229 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -371,7 +371,7 @@ private[spark] class MemoryStore(blockManager: BlockManager, maxMemory: Long) private def ensureFreeSpace( blockIdToAdd: BlockId, space: Long): ResultWithDroppedBlocks = { -logInfo(sensureFreeSpace($space) called with curMem=$currentMemory, maxMem=$maxMemory) +logDebug(sensureFreeSpace($space) called with curMem=$currentMemory, maxMem=$maxMemory) --- End diff -- This one I'd say is fairly important. I've used it somewhat often in production clusters to understand when things are dropping to disk and why. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4838#discussion_r25563255 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1074,7 +1074,7 @@ private[spark] class BlockManager( * Remove all blocks belonging to the given broadcast. */ def removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int = { -logInfo(sRemoving broadcast $broadcastId) --- End diff -- It would be good to make sure that when the user explicitly `destroy()`s a broadcast, we do have some INFO level logging that we are doing it. I think that can be achieved by dropping this one to debug and logging things higher up in the call stack in the destroy method of either broadcast implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4838#discussion_r25563263 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala --- @@ -476,16 +476,16 @@ private[spark] class BlockManagerInfo( val blockStatus: BlockStatus = _blocks.get(blockId) _blocks.remove(blockId) if (blockStatus.storageLevel.useMemory) { -logInfo(Removed %s on %s in memory (size: %s, free: %s).format( +logDebug(Removed %s on %s in memory (size: %s, free: %s).format( --- End diff -- These ones also seem useful in some cases for reasoning about dropping blocks due to cache contention. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6070] [yarn] Remove unneeded classes fr...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4820#issuecomment-76514114 Thanks Marcelo, pulling this in! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5979][SPARK-6032] Smaller safer --packa...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4802#issuecomment-76514538 Pulling this in - thanks Burak! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5979][SPARK-6031][SPARK-6032][SPARK-604...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4754#issuecomment-76514573 @brkyvz let's close this issue for now and keep it in our back pocket. We can use it if we decide to put this in the 1.3 branch down the line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6063
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4815#issuecomment-76450569 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6070] [yarn] Remove unneeded classes fr...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4820#issuecomment-76492395 Thanks for finding this! Either approach seems fine to me. This LGTM... it might be nice to add a comment explaining that this is needed because the root pom is declaring them at compile scope. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6070] [yarn] Remove unneeded classes fr...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4820#issuecomment-76498092 Cool LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6050] [yarn] Add config option to do la...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4818#issuecomment-76502508 It seems reasonable to me to have the default of false and make a comment in the release notes. No strong feelings here though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5979][SPARK-6032] Smaller safer fix
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4802#issuecomment-76352327 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4579 [WEBUI] Scheduling Delay appears ne...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4796#issuecomment-76301218 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4579 [WEBUI] Scheduling Delay appears ne...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4796#issuecomment-76301244 Yeah this would be great to pull into an RC... a very confusing presentation in the UI right now! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4799#issuecomment-76320547 @vanzin does this look okay to you? I commented on the JIRA, but my main goal is to find a surgical patch here that can unblock the release, so just rewinding the change seems the safest. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5979][SPARK-6031][SPARK-6032][SPARK-604...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4754#discussion_r25482866 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala --- @@ -82,13 +85,25 @@ private[spark] object SparkSubmitDriverBootstrapper { .orElse(confDriverMemory) .getOrElse(defaultDriverMemory) -val newClasspath = +var newClasspath = if (submitClasspath.isDefined) { classpath } else { classpath + confClasspath.map(sys.props(path.separator) + _).getOrElse() } +// Resolve maven dependencies if there are any and add them to classpath. Also send them +// to SparkSubmit so that they can be shipped to executors. +val resolvedMavenCoordinates = + SparkSubmitUtils.resolveMavenCoordinates( +submitPackages, submitRepositories, confIvyRepo) +if (resolvedMavenCoordinates.nonEmpty) { + newClasspath += sys.props(path.separator) + +resolvedMavenCoordinates.mkString(sys.props(path.separator)) + submitArgs = +Array(--packages-resolved, resolvedMavenCoordinates.mkString(,)) ++ submitArgs --- End diff -- Can we thread this through using an environment variable `_PACKAGES_RESOLVED`? Having this as an extra flag forces you to make `args` here mutable, which is sort of strange. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5982] Remove incorrect Local Read Time ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4749#issuecomment-76054412 LGTM - thanks Kay! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5914] to run spark-submit requiring onl...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4742#discussion_r25237061 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -439,7 +439,14 @@ private[spark] object Utils extends Logging { executeAndGetOutput(Seq(tar, -xf, fileName), targetDir) } // Make the file executable - That's necessary for scripts -FileUtil.chmod(targetFile.getAbsolutePath, a+x) +if(isWindows){ --- End diff -- mind adding a space after `if`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5914] to run spark-submit requiring onl...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4742#issuecomment-75720961 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5979] Made --package exclusions more re...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4754#issuecomment-75911222 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5158] [core] [security] Spark standalon...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4106#issuecomment-75861282 The model I had in mind for this patch was to support dedicated clusters/appliances based on Spark where the Spark cluster itself is fully trusted and not multi-tenant. @harishreedharan - trying to have secured key distribution in the standalone mode is a very large undertaking, this is intentionally designed to avoid that to support a more limited security model. Trying to support arbitrary user code that you don't trust (within the cluster) is also not the intention - there aren't arbitrary users, it's a single embedded application. We've had several requests for some simpler security mechanism in Standalone mode over the users. The main concerns I see here are that the KDC can't handle frequent authentication requests of the same principle from multiple hosts. That might render this approach untenable. It would be worth testing and understanding the limitations in that regard. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5993][Streaming][Build] Fix assembly ja...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4753#issuecomment-75895962 LGTM - I tested this approach locally and it worked. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25148046 --- Diff: launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java --- @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.IOException; +import java.util.List; +import java.util.Map; + +/** + * Internal interface that defines a command builder. + */ +interface CommandBuilder { --- End diff -- IMO - it might be simpler to just not have this interface. It's confusing because a lot of the heavy lifting for building commands in in SparkLauncher, but that class doesn't actually implement this. Also, you never interact with `CommandBuilder`'s in a generic way, you only use it in a case where you already have a branch distinguishing the only two possible implementations. In cases such as this, I'm not sure it the code simpler to understand when there is an interface. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25145503 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -0,0 +1,684 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.BufferedReader; +import java.io.File; +import java.io.FileFilter; +import java.io.FileInputStream; +import java.io.InputStreamReader; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Properties; +import java.util.jar.JarFile; +import java.util.regex.Pattern; + +import static org.apache.spark.launcher.CommandBuilderUtils.*; + +/** + * Launcher for Spark applications. + * p/ + * Use this class to start Spark applications programmatically. The class uses a builder pattern + * to allow clients to configure the Spark application and launch it as a child process. + * p/ + * Note that launching Spark applications using this class will not automatically load environment + * variables from the spark-env.sh or spark-env.cmd scripts in the configuration directory. + */ +public class SparkLauncher { + + /** The Spark master. */ + public static final String SPARK_MASTER = spark.master; + + /** Configuration key for the driver memory. */ + public static final String DRIVER_MEMORY = spark.driver.memory; + /** Configuration key for the driver class path. */ + public static final String DRIVER_EXTRA_CLASSPATH = spark.driver.extraClassPath; + /** Configuration key for the driver VM options. */ + public static final String DRIVER_EXTRA_JAVA_OPTIONS = spark.driver.extraJavaOptions; + /** Configuration key for the driver native library path. */ + public static final String DRIVER_EXTRA_LIBRARY_PATH = spark.driver.extraLibraryPath; + + /** Configuration key for the executor memory. */ + public static final String EXECUTOR_MEMORY = spark.executor.memory; --- End diff -- I'm not sure any of these `EXECUTOR_XX` constants are ever used. Are they intended as convenience constants for users? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-75498418 Hey @vanzin okay took another quite long look (sorry, was delayed this week due to strata) and I have embarrassingly few useful comments given how long I looked at it. Overall, the changes you made all seem great. Unfortunately since spark-submit as it stands before your patch is a fairly wide interface, it's difficult for me to conceive of every possible combination of settings/environments/etc, etc that could be broken by this. I think we'll just have to merge it and hope it gets a substantial amount of user testing. The main thing left for me is the proposal to expose this as a public interface to Spark for application developers. The main issue I see is that it just won't be backwards compatible as it stands now, and this is something where I think compatibility is pretty important. One main use case I see here is some third party application wants to have a nice way of submitting Spark applications for user clusters that already have Spark installed, so they bundle this up with their app. At least, that's the main reason I've heard users directly ask for this feature. If they have to grab a newer version of this for every different Spark version users might have installed, that's a big pain for them. And it's the exact opposite API guarantees as the rest of Spark. And there is an alternative that _is_ backwards compatible which is for them to just script around `spark-submit`. So users are in a weird place where there are two options and neither is strictly optimal. To me that seems like a blocker to exposing it in the current form without exploring more options. If we want to make this backwards compatible, we'd need to minimize the interface between the publicly exposed library and the supplied Spark distribution. If the interface from the user app is that we fork a subprocess when they call launch(), could that subprocess just invoke the spark-submit script in the distribution, which will then itself compute the classpath, etc? We'd just have to detect whether to invoke the windows or unix version, but that seems pretty doable. Could that work? As a point of process, I'd be happy to explore various options for making this work by just merging a version where this is private and then building on top of that, to avoid staleness and allow for ample testing. I really do think its a good idea to have a programmatic version of spark-submit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25147958 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java --- @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.File; +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.regex.Pattern; + +import static org.apache.spark.launcher.CommandBuilderUtils.*; + +/** + * Command builder for internal Spark classes. + * p/ + * This class handles building the command to launch all internal Spark classes except for + * SparkSubmit (which is handled by {@link SparkSubmitCommandBuilder} class. + */ +class SparkClassCommandBuilder extends SparkLauncher implements CommandBuilder { --- End diff -- It might be worth noting here that this never actually uses `launch` and instead it just uses the shared plumbing for building commands. It's a bit non-obvious to have this thing extend SparkLauncher but have no intention of ever being used to launch something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25145313 --- Diff: launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java --- @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.IOException; +import java.util.List; +import java.util.Map; + +/** + * Internal interface that defines a command builder. + */ +interface CommandBuilder { + + ListString buildCommand(MapString, String env) throws IOException; --- End diff -- If it's going to stay as is, it would be good to document what is expected to contain when invoked, and the fact that the callee will mutate it in place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25145264 --- Diff: launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java --- @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.IOException; +import java.util.List; +import java.util.Map; + +/** + * Internal interface that defines a command builder. + */ +interface CommandBuilder { + + ListString buildCommand(MapString, String env) throws IOException; --- End diff -- Should this return a structure containing an environment and a list (maybe a small Command class or something)? I couldn't find any instances where env is expected to be non-empty when this function is invoked - that is, it seems like it's just used as a return value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-74950927 We spoke a bit offline about this, but my feeling was that the best thing here might be to add a way to launch the shuffle service as a standalone application (initially, not one managed by Mesos) so that it can be shared across Spark applications. That would involve writing some simple launching scripts for it in a similar way to existing daemons we launch, and you'd ask users to launch the shuffle service similar to other storage systems like HDFS. That's very simple and would avoid diverging a lot between Mesos and the other modes. And longer term we could actually have a single shared shuffle service that is scheduled by mesos. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4221#issuecomment-74952993 Yeah our auto-close doesn't work on PR's into release branches like this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4420#issuecomment-74946838 @mccheah @mingyukim yeah, there isn't an OOM proof solution at all because these are all heuristics. Even checking every element is not OOM proof since memory estimation is itself a heuristic that involves sampling. My only concern with exposing knobs here is that users will expect us to support these going forward, even though we may want to refactor this in the future in a way where those knobs don't make sense anymore. It's reasonable users would consider it a regression if their tuning of those knobs stopped working. So if possible, it would be good to adjust our heuristics to meet a wider range of use cases and then if we keep hearing more issues we can expose knobs. We can't have them meet every possible use case, since they are heuristics, but in this case I was wondering if we could have a strict improvement to the heuristics. @andrewor14 can you comment on whether this is indeed a strict improvement? One of the main benefits of the new data frames API is that we will be able to have precise control over memory usage in a way that can avoid OOM's ever. But for the current Spark API we are using this more ad-hoc memory estimation along with some heuristics. I'm not 100% against exposing knobs either, but I'd be interested if some simple improvements fix your use case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-75008630 @vanzin about half way through reviewing... will pick up tomorrow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org