from:"pwendell"

[GitHub] spark pull request: [HOTFIX] Some clean-up in shuffle code.

2015-03-31 Thread pwendell

GitHub user pwendell opened a pull request:

https://github.com/apache/spark/pull/5286

[HOTFIX] Some clean-up in shuffle code.

Before diving into review #4450 I did a look through the existing shuffle
code to learn how it works. Unfortunately, there are some very 
confusing things in this code. This patch makes a few small changes
to simplify things. It is not easily to concisely describe the changes 
because of how convoluted the issues were, but they are fairly small
logically:

1. There is a trait named `ShuffleBlockManager` that only deals with
   one logical function which is retrieving shuffle block data given shuffle
   block coordinates. This trait has two implementors 
FileShuffleBlockManager
   and IndexShuffleBlockManager. Confusingly the vast majority of those
   implementations have nothing to do with this particular functionality.
   So I've renamed the trait to ShuffleBlockResolver and documented it.
2. The aforementioned trait had two almost identical methods, for no good
   reason. I removed one method (getBytes) and modified callers to use the
   other one. I think the behavior is preserved in all cases.
3. The sort shuffle code uses an identifier 0 in the reduce slot of a
   BlockID as a placeholder. I made it into a constant since it needs to
   be consistent across multiple places.

I think for (3) there is actually a better solution that would avoid the
need to do this type of workaround/hack in the first place, but it's more
complex so I'm punting it for now.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pwendell/spark cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5286


commit a4060797a83c591be288e8fca48affecb17412c6
Author: Patrick Wendell patr...@databricks.com
Date:   2015-03-31T01:12:43Z

[HOTFIX] Some clean-up in shuffle code.

Before diving into review #4450 I did a look through the existing shuffle
code. Unfortunately, there are some very confusing things in this code.
This patch makes a few small changes to simplify things. It is not easily
to concisely describe the changes because of how convoluted the issues were:

1. There was a trait named ShuffleBlockManager that only deals with
   one logical function which is retrieving shuffle block data given shuffle
   block coordinates. This trait has two implementors 
FileShuffleBlockManager
   and IndexShuffleBlockManager. Confusingly the vast majority of those
   implementations have nothing to do with this particular functionality.
   So I've renamed the trait to ShuffleBlockResolver and documented it.
2. The aformentioned trait had two almost identical methods, for no good
   reason. I removed one method (getBytes) and modified callers to use the
   other one. I think the behavior is preserved in all cases.
3. The sort shuffle code uses an identifier 0 in the reduce slot of a
   BlockID as a placeholder. I made it into a constant since it needs to
   be consistent across multiple places.

I think for (3) there is actually a better solution that would avoid the
need to do this type of workaround/hack in the first place, but it's more
complex so I'm punting it for now.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6433 hive tests to import spark-sql test...

2015-03-31 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5119#discussion_r27537871
  
--- Diff: pom.xml ---
@@ -1472,6 +1473,25 @@
 groupIdorg.scalatest/groupId
 artifactIdscalatest-maven-plugin/artifactId
   /plugin
+  !-- Build the JARs--
--- End diff --

Can this say something more useful? Maybe like:

```
!-- Build test-jar's for all projects, since some projects depend on tests 
from others --
```

Otherwise it might be hard for someone to understand what is going on here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6433 hive tests to import spark-sql test...

2015-03-31 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5119#issuecomment-88300495
  
The overall approach LGTM, but I would suggest adding a better comment 
since it's non-obvious what is going on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6627] Some clean-up in shuffle code.

2015-03-31 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5286#discussion_r27496946
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockManager.scala ---
@@ -39,25 +41,18 @@ import org.apache.spark.storage._
 // Note: Changes to the format in this file should be kept in sync with
 // 
org.apache.spark.network.shuffle.StandaloneShuffleBlockManager#getSortBasedShuffleBlockData().
 private[spark]
-class IndexShuffleBlockManager(conf: SparkConf) extends 
ShuffleBlockManager {
+class IndexShuffleBlockManager(conf: SparkConf) extends 
ShuffleBlockResolver {
 
   private lazy val blockManager = SparkEnv.get.blockManager
 
   private val transportConf = SparkTransportConf.fromSparkConf(conf)
 
-  /**
-   * Mapping to a single shuffleBlockId with reduce ID 0.
-   * */
-  def consolidateId(shuffleId: Int, mapId: Int): ShuffleBlockId = {
-ShuffleBlockId(shuffleId, mapId, 0)
-  }
-
   def getDataFile(shuffleId: Int, mapId: Int): File = {
-blockManager.diskBlockManager.getFile(ShuffleDataBlockId(shuffleId, 
mapId, 0))
+blockManager.diskBlockManager.getFile(ShuffleDataBlockId(shuffleId, 
mapId, NOOP_REDUCE_ID))
--- End diff --

Yeah - I think there is actually a third option which is just to make it so 
the BlockObjectWriter doesn't have to be passed a Block ID (it almost doesn't 
even need one). This way we totally avoid using the BlockID for the purpose of 
naming this file. I'm punting it to future work though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6627] Some clean-up in shuffle code.

2015-03-31 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5286#issuecomment-88186520
  
Jenkins, retest this pleas.e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...

2015-03-30 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4450#discussion_r27423831
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ChainedBuffer.scala ---
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io.OutputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.storage.BlockObjectWriter
+
+/**
+ * A logical byte buffer that wraps a list of byte arrays. All the byte 
arrays have equal size. The
+ * advantage of this over a standard ArrayBuffer is that it can grow 
without claiming large amounts
+ * of memory and needing to copy the full contents.
+ */
+private[spark] class ChainedBuffer(chunkSize: Int) {
--- End diff --

Do we ever use the fact that we can do arbitrary offset reads?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...

2015-03-30 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4450#discussion_r27424659
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ChainedBuffer.scala ---
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io.OutputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.storage.BlockObjectWriter
+
+/**
+ * A logical byte buffer that wraps a list of byte arrays. All the byte 
arrays have equal size. The
+ * advantage of this over a standard ArrayBuffer is that it can grow 
without claiming large amounts
+ * of memory and needing to copy the full contents.
+ */
+private[spark] class ChainedBuffer(chunkSize: Int) {
+  private val chunkSizeLog2 = (math.log(chunkSize) / math.log(2)).toInt
+  assert(math.pow(2, chunkSizeLog2).toInt == chunkSize)
+  private val chunks: ArrayBuffer[Array[Byte]] = new 
ArrayBuffer[Array[Byte]]()
+  private var _size: Int = _
+
+  /**
+   * Feed bytes from this buffer into a BlockObjectWriter.
+   *
+   * @param pos Offset in the buffer to read from.
+   * @param writer BlockObjectWriter to read into.
+   * @param len Number of bytes to read.
+   */
+  def read(pos: Int, writer: BlockObjectWriter, len: Int): Unit = {
--- End diff --

It seems weird to me that this ChainedBuffer abstraction should be aware of 
a BlockObjectWriter - no concrete suggestion at this point (just getting 
started with this patch), but perhaps there is a clear way to separate the 
concerns here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [HOTFIX][SPARK-4123]: Updated to fix bug where...

2015-03-30 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5269#issuecomment-87804589
  
Thanks - pulling this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...

2015-03-30 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4450#discussion_r27424016
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ChainedBuffer.scala ---
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io.OutputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.storage.BlockObjectWriter
+
+/**
+ * A logical byte buffer that wraps a list of byte arrays. All the byte 
arrays have equal size. The
+ * advantage of this over a standard ArrayBuffer is that it can grow 
without claiming large amounts
--- End diff --

I would maybe say explicitly that the disadvantage is we can't view/return 
it as a single large array. Otherwise it's not clear why ArrayBuffer isn't just 
implemented this way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6544][build] Increment Avro version fro...

2015-03-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5193#issuecomment-87030538
  
It's a fair point - in some cases we have made minor (patch) release 
updates in our own patch versions. @srowen what do you think? It can cause some 
friction for packagers if dependencies change in these patch releases, but this 
upgrade may be alright.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6544][build] Increment Avro version fro...

2015-03-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5193#issuecomment-87062417
  
Okay after reviewing the changes to Avro I've decided to backport this. I 
looked a bit more and in other cases we have done fairly small dependency 
updates like this provided the changes in the dependency have been minimal 
between versions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6119][SQL] DataFrame.dropna

2015-03-27 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5225#discussion_r27328262
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -730,6 +731,23 @@ class DataFrame private[sql](
 Generate(generator, join = true, outer = false, None, logicalPlan)
   }
 
+  /**
+   * Returns a new [[DataFrame]] that omits rows with `NULL` values in the 
specified columns.
+   * If no columns are specified, a row is omitted if any column contains 
`NULL`.
+   *
+   * @group dfops
+   */
+  @scala.annotation.varargs
+  def dropna(cols: String*): DataFrame = {
--- End diff --

Is it possible to do what cheng said? I think dropNull is a much better 
name, but if we have an alias dropna, I think it would be good for pandas 
compatibility. In general I think aliasing things might necessary in the way we 
approach data frames since we're trying to meet a few different requirements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark] [SPARK-6168] Expose some of the collec...

2015-03-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5084#issuecomment-86835246
  
Yeah I agree with Sean and what pretty much everyone else said.

My feeling is that we don't want typical spark users to be relying on 
unstable API's, else it creates a lot of upgrade friction for users and/or 
pressure on maintainers to stabilize this stuff. We annotate things as unstable 
and expose them for two reasons (a) to get feedback on new API's we intend to 
stabilize shortly (b) to expose internal hooks that are meant for third party 
integrations (similar to kernel modules) such as data sources, but not expected 
to be needed by end users.

Exposing internal utilities for end users, IMO, it's just not a good idea. 
I don't think slapping an annotation means that there is no cost to doing this. 
Users will rely on this stuff and it will create upgrade friction, and this is 
why we are careful about what we expose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6405] Limiting the maximum Kryo buffer ...

2015-03-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5218#issuecomment-86834844
  
LGTM - merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6549 - Spark console logger logs to stde...

2015-03-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5202#issuecomment-86810509
  
Changing the default logging behavior like this would break compatibility 
for people who rely on the out-of-the-box behavior.

IIRC this change was proposed previously and rejected on the same grounds. 
Also, I'm not even convinced that stdout is the better default. The reason it's 
stderr is to isolate user output from the spark's own output. In any case, I 
think on compatibility grounds we can't merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6544][build] Increment Avro version fro...

2015-03-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5193#issuecomment-86815673
  
Yeah LGTM - I think bumping maintenance releases like this has usually been 
fine for us with Avro. It may be best not merge this into 1.3 though, since we 
typically keep dependencies fixed in patch releases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6554] [SQL] Don't push down predicates ...

2015-03-26 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5210#discussion_r27252832
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
@@ -435,11 +435,18 @@ private[sql] case class ParquetRelation2(
 // Push down filters when possible. Notice that not all filters can be 
converted to Parquet
 // filter predicate. Here we try to convert each individual predicate 
and only collect those
 // convertible ones.
-predicates
-  .flatMap(ParquetFilters.createFilter)
-  .reduceOption(FilterApi.and)
-  .filter(_ = sqlContext.conf.parquetFilterPushDown)
-  .foreach(ParquetInputFormat.setFilterPredicate(jobConf, _))
+if (sqlContext.conf.parquetFilterPushDown) {
+  predicates
+// Don't push down predicates which reference partition columns
+.filter { pred =
+  val partitionColNames = partitionColumns.map(_.name).toSet
+  val referencedColNames = pred.references.map(_.name).toSet
+  referencedColNames.intersect(partitionColNames).isEmpty
+}
+.flatMap(ParquetFilters.createFilter)
--- End diff --

Yeah reynold's one seems more clear to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6477][Build]: Run MIMA tests before the...

2015-03-24 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5145#issuecomment-85371214
  
LGTM too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4123][Project Infra][WIP]: Show new dep...

2015-03-23 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5093#discussion_r26996263
  
--- Diff: dev/tests/pr_new_dependencies.sh ---
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#
+# This script follows the base format for testing pull requests against
+# another branch and returning results to be published. More details can be
+# found at dev/run-tests-jenkins.
+#
+# Arg1: The Github Pull Request Actual Commit
+#+ known as `ghprbActualCommit` in `run-tests-jenkins`
+# Arg2: The SHA1 hash
+#+ known as `sha1` in `run-tests-jenkins`
+#
+
+ghprbActualCommit=$1
+sha1=$2
+
+MVN_BIN=`pwd`/build/mvn
+CURR_CP_FILE=my-classpath.txt
+MASTER_CP_FILE=master-classpath.txt
+
+${MVN_BIN} clean compile dependency:build-classpath 2/dev/null | \
--- End diff --

If that's the case, we should probably gate this entire thing in a check as 
to whether any pom.xml files are modified. Then for most builds this will not 
add any time, since most builds do not modify dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark] [SPARK-6168] Expose some of the collec...

2015-03-21 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5084#issuecomment-84447923
  
I agree with Sean and Sandy here. I don't think we should just expose 
internal utilities like this. The collections classes are simple enough that 
someone can just copy our code if they want to use it. Exposing `Utilities` 
seems like a huge widening of visibility - those are designed as internal 
utilities.

The reason we have @Experimetnal and @DeveloperAPI are to provide public 
API's we expect to eventually stabilize or that we want feedback on from the 
community. They still come with a reasonable expectation of stability and an 
implicit endorsement that we expect people to use them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...

2015-03-21 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5085#issuecomment-84443666
  
Okay - if there is no big performance hit in launching from 1.3-1.4, 
that's fine with me.

I guess then the remaining question is whether to make a smaller change 
(simply make the wildcard match more liberal) vs. doing a larger change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...

2015-03-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5085#issuecomment-84216181
  
@nishkamravi2 - It's about the startup time of the interactive shell. 
That's the usability trade off. If the shell takes 300 extra ms to start up, it 
is not acceptable from a usability perspective. This is not about performance 
for a long running batch processing job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5134 [BUILD] Bump default Hadoop version...

2015-03-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5027#issuecomment-84201074
  
Looks good - thanks for commiting this sean.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...

2015-03-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5085#issuecomment-84205836
  
Hey @vanzin - can you explain a bit why the performance impact isn't 
worrisome? I think a first order goal was to avoid invoking multiple JVM's that 
have the assembly jar in the classpath, because it is slow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6122][Core] Upgrade Tachyon client vers...

2015-03-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4867#issuecomment-84200925
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6406] Launcher backward compatibility i...

2015-03-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5085#issuecomment-84214306
  
What is the intuitive explanation for why it is faster than before? It 
seems like launching the assembly twice would be slower.

Separately, is it worth doing a big refactoring for this issue? If we want 
to support this maybe we should just widen the regex. Or do you now feel the 
refactoring is an important design improvement, and it was simply instigated by 
this issue being reported?

For instance, if a change makes starting spark-shell  100ms slower for 
people, IMO that's bad usability. Responsiveness for things like this can 
matter when people getting an impression of the overall performance of a 
system, and we have components specifically designed for quick interactive 
experience.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5847] Allow for namespacing metrics by ...

2015-03-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4632#issuecomment-84209577
  
Hey @JoshRosen can probably take a look at this. One thing though, IIRC the 
reason why we have a unique ID here is because some sets of users requested the 
exact opposite behavior as what you want:

https://issues.apache.org/jira/browse/SPARK-3377

If we do decide to allow this, I think it's best to have a specific set of 
naming schemes (don't give them a latch to chose an arbitrary conf), in order 
to limit the surface area of possible configurations. I'm also not so sure 
about the right way to do this in terms of the current MetricsConfing 
infrastructure. It seems a bit weird to hard code handling of this particular 
configuration hard coded in the MetricsConfig class. Only had time to take a 
quick look though and I'm not familiar with that code at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4123][Project Infra][WIP]: Show new dep...

2015-03-20 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5093#discussion_r26882829
  
--- Diff: dev/tests/pr_new_dependencies.sh ---
@@ -0,0 +1,77 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#
+# This script follows the base format for testing pull requests against
+# another branch and returning results to be published. More details can be
+# found at dev/run-tests-jenkins.
+#
+# Arg1: The Github Pull Request Actual Commit
+#+ known as `ghprbActualCommit` in `run-tests-jenkins`
+# Arg2: The SHA1 hash
+#+ known as `sha1` in `run-tests-jenkins`
+#
+
+ghprbActualCommit=$1
+sha1=$2
+
+CURR_CP_FILE=my-classpath.txt
+MASTER_CP_FILE=master-classpath.txt
+
+./build/mvn clean compile dependency:build-classpath | \
+  sed -n -e '/Building Spark Project Assembly/,$p' | \
+  grep --context=1 -m 2 Dependencies classpath: | \
+  head -n 3 | \
+  tail -n 1 | \
+  tr : \n | \
+  rev | \
+  cut -d / -f 1 | \
+  rev | \
+  sort  ${CURR_CP_FILE}
+
+# Checkout the master branch to compare against
+git checkout apache/master
+
+./build/mvn clean compile dependency:build-classpath | \
+  sed -n -e '/Building Spark Project Assembly/,$p' | \
+  grep --context=1 -m 2 Dependencies classpath: | \
+  head -n 3 | \
+  tail -n 1 | \
+  tr : \n | \
+  rev | \
+  cut -d / -f 1 | \
+  rev | \
+  sort  ${MASTER_CP_FILE}
+
+DIFF_RESULTS=`diff my-classpath.txt master-classpath.txt`
+
+if [ -z ${DIFF_RESULTS} ]; then
+  echo  * This patch adds no new dependencies.
+else
+  # Pretty print the new dependencies
+  new_deps=$(echo ${DIFF_RESULTS} | grep  | cut -d  -f2 | awk '{print 
   * $1}')
+  echo  * This patch **adds the following new 
dependencies:**\n${new_deps}
--- End diff --

It would actually be good to show additions and removals of dependencies 
(in some cases a library will get a new version, which ends up here as an 
addition and removal). I updated the JIRA title to reflect that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6371] [build] Update version to 1.4.0-S...

2015-03-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5056#issuecomment-84085364
  
LGTM - I did some searching around and I think this upgrades all the 
necessary places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6275][Documentation]Miss toDF() functio...

2015-03-15 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4977#issuecomment-81428099
  
I've pushed this change to the website.
http://spark.apache.org/docs/latest/sql-programming-guide.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-78219718
  
Okay cool - LGTM I will pull this in. I just did some local sanity tests, 
built with Maven and ran (a few) Maven tests. We'll need to keep any eye on the 
Maven build tomorrow to see if this is succeeding there, since it's not caught 
by the PRB. Thanks @vanzin for all the work on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-10 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-78003405
  
@andrewor14 are you good with this one? I'd like to merge it soon!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-09 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-77929101
  
Hey Marcelo,

I think this is really close to being ready to merge. I noticed some extra 
output (a '0') when running usage on `spark-shell`:

```
 ./bin/spark-shell --help
0
Usage: ./bin/spark-shell [options]
Spark assembly has been built with Hive, including Datanucleus jars in 
classpath.
```

I think whether this should be an exposed API is still a question. However, 
I'm fine to merge v1 of this as an exposed API and get feedback. It would be 
helpful to hear from potential users whether this is useful to them... so far I 
haven't heard much in that direction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-09 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r26065985
  
--- Diff: launcher/src/main/java/org/apache/spark/launcher/Main.java ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.apache.spark.launcher.CommandBuilderUtils.*;
+
+/**
+ * Command line interface for the Spark launcher. Used internally by Spark 
scripts.
+ */
+class Main {
+
+  /**
+   * Usage: Main [class] [class args]
+   * p/
+   * This CLI works in two different modes:
+   * ul
+   *   lispark-submit: if iclass/i is 
org.apache.spark.deploy.SparkSubmit, the
+   *   {@link SparkLauncher} class is used to launch a Spark 
application./li
+   *   lispark-class: if another class is provided, an internal Spark 
class is run./li
+   * /ul
+   *
+   * This class works in tandem with the bin/spark-class script on 
Unix-like systems, and
+   * bin/spark-class2.cmd batch script on Windows to execute the final 
command.
+   * p/
+   * On Unix-like systems, the output is a list of command arguments, 
separated by the NULL
+   * character. On Windows, the output is single command line suitable for 
direct execution
+   * form the script.
--- End diff --

from


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-09 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r26067056
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -0,0 +1,327 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+import static org.apache.spark.launcher.CommandBuilderUtils.*;
+
+/**
+ * Special launcher for handling a CLI invocation of SparkSubmit.
+ * p/
+ * This launcher extends SparkLauncher to add command line parsing 
compatible with
--- End diff --

I'm not sure this comment is correct. I think a `SparkLauncher` 
encapsulates a `SparkSubmitCommandBuilder` rather than the latter extending the 
former.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-09 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r26067102
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -0,0 +1,327 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+import static org.apache.spark.launcher.CommandBuilderUtils.*;
+
+/**
+ * Special launcher for handling a CLI invocation of SparkSubmit.
--- End diff --

I don't think this is still a sub type of `Launcher`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5134] Bump default hadoop.version to 2....

2015-03-08 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3917#issuecomment-5142
  
Hey I commented on the JIRA, but some recent changes in the way we publish 
artifacts actually makes this more tenable of a change.

https://issues.apache.org/jira/browse/SPARK-5134


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6200] [SQL] Add a manager for dialects

2015-03-07 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4939#issuecomment-77736075
  
It might be simpler to just accept a dialect via a configuration option. Is 
the idea here that someone might switch dialects multiple times in a single 
session?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6182 [BUILD] spark-parent pom needs to b...

2015-03-05 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4912#issuecomment-77429640
  
I tested this locally and it worked well. I'd like to pull this in to do 
further testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6182 [BUILD] spark-parent pom needs to b...

2015-03-05 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4913#issuecomment-77429517
  
I don't think this works in its current form because the artifact names are 
laid out in the parent pom and those can still only have a single version. The 
only way I think we could preserve having a single parent pom is to simply not 
declare any scala dependencies in the parent pom and only declare them inline 
in each child pom, which is generally bad form. So given that #4912 is simpler, 
I'm inclined to just pull that one in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6149] [SQL] [Build] Excludes Guava 15 r...

2015-03-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4890#issuecomment-77203459
  
@liancheng can you put this exclusion in the pom.xml file instead of in 
sbt? If you look there we already have other exclusions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5143 [BUILD] [WIP] spark-network-yarn 2....

2015-03-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4876#issuecomment-77246969
  
I commented on the JIRA. This LGTM as an immediate fix - clearly the 
property is not correct in the published 2.11 poms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6149] [SQL] [Build] Excludes Guava 15 r...

2015-03-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4890#issuecomment-77306302
  
I'm going to optimistically merge this to cut another RC.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-03-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4838#discussion_r25647202
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -184,7 +184,7 @@ private[spark] class MemoryStore(blockManager: 
BlockManager, maxMemory: Long)
   val entry = entries.remove(blockId)
   if (entry != null) {
 currentMemory -= entry.size
-logInfo(sBlock $blockId of size ${entry.size} dropped from memory 
(free $freeMemory))
+logDebug(sBlock $blockId of size ${entry.size} dropped from 
memory (free $freeMemory))
--- End diff --

I believe we do have INFO level logging for this up the call chain when 
drops are blocked due to cache contention:


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1004

Might be nice to augment that logging to have information on the size and 
limit (like this does).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-03-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4838#discussion_r25646752
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -184,7 +184,7 @@ private[spark] class MemoryStore(blockManager: 
BlockManager, maxMemory: Long)
   val entry = entries.remove(blockId)
   if (entry != null) {
 currentMemory -= entry.size
-logInfo(sBlock $blockId of size ${entry.size} dropped from memory 
(free $freeMemory))
+logDebug(sBlock $blockId of size ${entry.size} dropped from 
memory (free $freeMemory))
--- End diff --

On this one - do you know if this already gets logged somewhere else if a 
block is dropped from memory due to contention? It would be good to make sure 
there is some INFO level logging when a block is dropped due to memory being 
exceeded.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5390 [DOCS] Encourage users to post on S...

2015-03-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4843#discussion_r25646936
  
--- Diff: docs/index.md ---
@@ -115,6 +115,8 @@ options for deployment:
 
 * [Spark Homepage](http://spark.apache.org)
 * [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK)
+* [Spark Community](http://spark.apache.org/community.html) resources, 
including local meetups
--- End diff --

Ah I see - sounds good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25648384
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -574,6 +583,11 @@ private[spark] object JsonProtocol {
 SparkListenerExecutorRemoved(time, executorId, reason)
   }
 
+  def logStartFromJson(json: JValue): SparkListenerLogStart = {
+val version = (json \ Spark Version).extract[String]
--- End diff --

minor - but maybe call this `sparkVersion` so it's clear that this isn't a 
version for the logging format (since that doesn't have its own versioning).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5390 [DOCS] Encourage users to post on S...

2015-03-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4843#discussion_r25646565
  
--- Diff: docs/index.md ---
@@ -115,6 +115,8 @@ options for deployment:
 
 * [Spark Homepage](http://spark.apache.org)
 * [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK)
+* [Spark Community](http://spark.apache.org/community.html) resources, 
including local meetups
--- End diff --

minor - but could this say including mailing lists and local meetups? I 
think it would be good to still mention the mailing list here, but moving this 
dropped that text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76855580
  
Okay I took a close look through this and it LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...

2015-03-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4799#issuecomment-76862763
  
Okay - thanks everyone for helping with this. I'll pull it in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...

2015-03-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4799#discussion_r25582753
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -92,6 +92,16 @@ private[spark] class Executor(
   private val executorActor = env.actorSystem.actorOf(
 Props(new ExecutorActor(executorId)), ExecutorActor)
 
+  // Whether to load classes in user jars before those in Spark jars
+  private val userClassPathFirst: Boolean = {
+val oldKey = spark.files.userClassPathFirst
+val newKey = spark.executor.userClassPathFirst
+if (conf.contains(oldKey)) {
+  logWarning(s$oldKey is deprecated. Please use $newKey instead.)
--- End diff --

Log it on the driver side during validation, or at some earlier point. I 
just think having it here isn't super useufl.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76782633
  
@vanzin so just to be clear - you are anticipating a future case where we 
need to read the version to correctly parse the logs? Is that the argument for 
it? I am only making a simple observation that here we write something and 
never read it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4799#discussion_r25573606
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -92,6 +92,16 @@ private[spark] class Executor(
   private val executorActor = env.actorSystem.actorOf(
 Props(new ExecutorActor(executorId)), ExecutorActor)
 
+  // Whether to load classes in user jars before those in Spark jars
+  private val userClassPathFirst: Boolean = {
+val oldKey = spark.files.userClassPathFirst
+val newKey = spark.executor.userClassPathFirst
+if (conf.contains(oldKey)) {
+  logWarning(s$oldKey is deprecated. Please use $newKey instead.)
--- End diff --

This will log it on the executors, which seems weird since all other 
configuration warnings are logged at the driver or submission client. Can we 
just log this warning on the driver side? Or just don't log a deprecation 
warning always?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25573801
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,57 +223,67 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * key=value format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
-   *
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap((version - SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += (Event - 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY - SPARK_VERSION)
 compressionCodec.foreach { codec =
-  meta += (compressionCodec - codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY - 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length  MAX_HEADER_LINE_LENGTH) {
-throw new IOException(sHeader entry too long: ${entry})
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + \n).getBytes(Charsets.UTF_8)
+if (metadataBytes.length  MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(sEvent log metadata too long: $metadataJson)
 }
-
-meta.foreach { case (k, v) = write(s$k=$v\n) }
-write(s$HEADER_END_MARKER\n)
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll([ :/], -).replaceAll([${}'\], 
_).toLowerCase
-Utils.resolveURI(logBaseDir) + / + name.stripSuffix(/)
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String] = None): String = {
+val sanitizedAppId = appId.replaceAll([ :/], 
-).replaceAll([${}'\], _).toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
+// e.g. EVENT_LOG_

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25573931
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala ---
@@ -23,7 +23,9 @@ private[spark] class ApplicationDescription(
 val memoryPerSlave: Int,
 val command: Command,
 var appUiUrl: String,
-val eventLogDir: Option[String] = None)
+val sparkVersion: String,
--- End diff --

What about just passing a single option `eventLogFile` here (i.e. the full 
path to the log file). The master only uses the evenLogDir, sparkVersion, 
eventLogCodec to reconstruct the filename. Reconstructing it in two places 
seems a little brittle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25573776
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,57 +223,67 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * key=value format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
-   *
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap((version - SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += (Event - 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY - SPARK_VERSION)
 compressionCodec.foreach { codec =
-  meta += (compressionCodec - codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY - 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length  MAX_HEADER_LINE_LENGTH) {
-throw new IOException(sHeader entry too long: ${entry})
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + \n).getBytes(Charsets.UTF_8)
+if (metadataBytes.length  MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(sEvent log metadata too long: $metadataJson)
 }
-
-meta.foreach { case (k, v) = write(s$k=$v\n) }
-write(s$HEADER_END_MARKER\n)
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll([ :/], -).replaceAll([${}'\], 
_).toLowerCase
-Utils.resolveURI(logBaseDir) + / + name.stripSuffix(/)
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String] = None): String = {
+val sanitizedAppId = appId.replaceAll([ :/], 
-).replaceAll([${}'\], _).toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
--- End diff --

I don't see where the `EVENT_LOG` prefix is actually

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4838#discussion_r25574579
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1074,7 +1074,7 @@ private[spark] class BlockManager(
* Remove all blocks belonging to the given broadcast.
*/
   def removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int = {
-logInfo(sRemoving broadcast $broadcastId)
--- End diff --

Yeah that one is sufficient then. I looked earlier, but only at the 
`doDestroy` implementations which did not have info level logging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25574126
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -99,6 +99,12 @@ case class SparkListenerExecutorRemoved(time: Long, 
executorId: String, reason:
   extends SparkListenerEvent
 
 /**
+ * A special dummy event used to identify the metadata header in event 
logs.
+ * This is not actually posted anywhere.
+ */
+private[spark] case object SparkListenerMetadataIdentifier extends 
SparkListenerEvent
--- End diff --

Since this is never used, why don't we instead just ignore this event type 
when reading from JSON (i.e. allow sparkEventFromJson to return None). It seems 
a bit strange to add this event but it contains nothing that's ever used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76641955
  
I took a pass on this with some feedback. Overall, it would be good to 
really minimize the scope of the changes since this is so late in the game. 
There is some clean-up and renaming etc that would be best just left out of the 
patch.

The main thing I'm wondering is why we need this header at all. It doesn't 
even ever get used by our own replay - we just ignore it. It seems like it was 
added for the purpose of conveying the compression codec to bootstrap replaying 
the file, however just having an extension seems like a better, much more 
standard way of doing that. The only argument I see for it is that the header 
could be used in the future to encode things that are necessary for proper 
replay of the logs. However, in that case I don't see why we can't just add it 
later if and when those things occur.

I guess I don't see a good argument against a straw man of just not having 
the header. Curious to hear thoughts from @andrewor14 and @vanzin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6050] [yarn] Add config option to do la...

2015-03-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4818#issuecomment-76659139
  
@vanzin okay so maybe set this to true then? I don't have any opinion, but 
would love to get this in as it's one of the only release blockers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579888
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
*
--- End diff --

This doc might be better if it said in Spark 1.2.X and earlier instead of 
in previous releases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579899
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
*
* @return 2-tuple of (input stream of the events, version of Spark 
which wrote the log)
--- End diff --

This is outdated now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579955
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
*
--- End diff --

Actually I guess this handles multiple versions. My bad, fine to leave as is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6074] [sql] Package pyspark sql binding...

2015-02-28 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4822#issuecomment-76563416
  
Good catch @vanzin. This LGTM. I did some testing to verify that the 
assembly includes all relevant python files now:

```
$ jar -tf 
assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar  |grep 
\\.py$ |grep pyspark  | wc
  62  621480
$ find ./python/pyspark/ | grep \\.py$ | wc
  62  622100
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...

2015-02-28 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4799#issuecomment-76564115
  
I agree @andrewor14 can you add documentation about deprecated configs?

I would extend what's there now:

```
Properties set directly on the SparkConf take highest precedence, then 
flags passed to spark-submit or spark-shell, then options in the 
spark-defaults.conf file. A few configuration keys have been renamed
since earlier versions of Spark; in such cases, the older key names are 
still
accepted, but take lower precedence than any instance of the newer key.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-02-28 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4838#discussion_r25563232
  
--- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
@@ -188,10 +188,10 @@ private[spark] class ContextCleaner(sc: SparkContext) 
extends Logging {
   /** Perform broadcast cleanup. */
   def doCleanupBroadcast(broadcastId: Long, blocking: Boolean) {
 try {
-  logDebug(Cleaning broadcast  + broadcastId)
+  logDebug(sCleaning broadcast $broadcastId)
--- End diff --

I'm +1 on removing these cleaning ones. It's really not useful or 
actionable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-02-28 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4838#issuecomment-76563715
  
Chimed in in a few places. I think overall, a good goal is that when we are 
doing our normal GC of RDD's and broadcasts, we don't want to be so verbose. 
This cleaning occurs asynchronously and it's confusing for users to see these 
messages. 

When dropping blocks due to cache contention though, I'm not so sure we 
want silence these messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6055] [PySpark] fix incorrect DataType....

2015-02-28 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4809#issuecomment-76564137
  
@davies can you close this? (auto close doesn't work for the backport 
commits).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6055] [PySpark] fix incorrect DataType....

2015-02-28 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4810#issuecomment-76564149
  
@davies can you close this? (auto close doesn't work for the backport 
commits).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-02-28 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4838#discussion_r25563229
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -371,7 +371,7 @@ private[spark] class MemoryStore(blockManager: 
BlockManager, maxMemory: Long)
   private def ensureFreeSpace(
   blockIdToAdd: BlockId,
   space: Long): ResultWithDroppedBlocks = {
-logInfo(sensureFreeSpace($space) called with curMem=$currentMemory, 
maxMem=$maxMemory)
+logDebug(sensureFreeSpace($space) called with curMem=$currentMemory, 
maxMem=$maxMemory)
--- End diff --

This one I'd say is fairly important. I've used it somewhat often in 
production clusters to understand when things are dropping to disk and why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-02-28 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4838#discussion_r25563255
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1074,7 +1074,7 @@ private[spark] class BlockManager(
* Remove all blocks belonging to the given broadcast.
*/
   def removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int = {
-logInfo(sRemoving broadcast $broadcastId)
--- End diff --

It would be good to make sure that when the user explicitly `destroy()`s a 
broadcast, we do have some INFO level logging that we are doing it. I think 
that can be achieved by dropping this one to debug and logging things higher up 
in the call stack in the destroy method of either broadcast implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3357 [CORE] Internal log messages should...

2015-02-28 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4838#discussion_r25563263
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala ---
@@ -476,16 +476,16 @@ private[spark] class BlockManagerInfo(
   val blockStatus: BlockStatus = _blocks.get(blockId)
   _blocks.remove(blockId)
   if (blockStatus.storageLevel.useMemory) {
-logInfo(Removed %s on %s in memory (size: %s, free: %s).format(
+logDebug(Removed %s on %s in memory (size: %s, free: %s).format(
--- End diff --

These ones also seem useful in some cases for reasoning about dropping 
blocks due to cache contention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6070] [yarn] Remove unneeded classes fr...

2015-02-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4820#issuecomment-76514114
  
Thanks Marcelo, pulling this in!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5979][SPARK-6032] Smaller safer --packa...

2015-02-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4802#issuecomment-76514538
  
Pulling this in - thanks Burak!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5979][SPARK-6031][SPARK-6032][SPARK-604...

2015-02-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4754#issuecomment-76514573
  
@brkyvz let's close this issue for now and keep it in our back pocket. We 
can use it if we decide to put this in the 1.3 branch down the line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6063

2015-02-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4815#issuecomment-76450569
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6070] [yarn] Remove unneeded classes fr...

2015-02-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4820#issuecomment-76492395
  
Thanks for finding this! Either approach seems fine to me. This LGTM... it 
might be nice to add a comment explaining that this is needed because the root 
pom is declaring them at compile scope.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6070] [yarn] Remove unneeded classes fr...

2015-02-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4820#issuecomment-76498092
  
Cool LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6050] [yarn] Add config option to do la...

2015-02-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4818#issuecomment-76502508
  
It seems reasonable to me to have the default of false and make a comment 
in the release notes. No strong feelings here though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5979][SPARK-6032] Smaller safer fix

2015-02-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4802#issuecomment-76352327
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4579 [WEBUI] Scheduling Delay appears ne...

2015-02-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4796#issuecomment-76301218
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4579 [WEBUI] Scheduling Delay appears ne...

2015-02-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4796#issuecomment-76301244
  
Yeah this would be great to pull into an RC... a very confusing 
presentation in the UI right now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6048] SparkConf should not translate de...

2015-02-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4799#issuecomment-76320547
  
@vanzin does this look okay to you? I commented on the JIRA, but my main 
goal is to find a surgical patch here that can unblock the release, so just 
rewinding the change seems the safest.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5979][SPARK-6031][SPARK-6032][SPARK-604...

2015-02-26 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4754#discussion_r25482866
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala 
---
@@ -82,13 +85,25 @@ private[spark] object SparkSubmitDriverBootstrapper {
   .orElse(confDriverMemory)
   .getOrElse(defaultDriverMemory)
 
-val newClasspath =
+var newClasspath =
   if (submitClasspath.isDefined) {
 classpath
   } else {
 classpath + confClasspath.map(sys.props(path.separator) + 
_).getOrElse()
   }
 
+// Resolve maven dependencies if there are any and add them to 
classpath. Also send them
+// to SparkSubmit so that they can be shipped to executors.
+val resolvedMavenCoordinates =
+  SparkSubmitUtils.resolveMavenCoordinates(
+submitPackages, submitRepositories, confIvyRepo)
+if (resolvedMavenCoordinates.nonEmpty) {
+  newClasspath += sys.props(path.separator) + 
+resolvedMavenCoordinates.mkString(sys.props(path.separator))
+  submitArgs = 
+Array(--packages-resolved, 
resolvedMavenCoordinates.mkString(,)) ++ submitArgs
--- End diff --

Can we thread this through using an environment variable 
`_PACKAGES_RESOLVED`? Having this as an extra flag forces you to make `args` 
here mutable, which is sort of strange.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5982] Remove incorrect Local Read Time ...

2015-02-25 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4749#issuecomment-76054412
  
LGTM - thanks Kay!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5914] to run spark-submit requiring onl...

2015-02-24 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4742#discussion_r25237061
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -439,7 +439,14 @@ private[spark] object Utils extends Logging {
   executeAndGetOutput(Seq(tar, -xf, fileName), targetDir)
 }
 // Make the file executable - That's necessary for scripts
-FileUtil.chmod(targetFile.getAbsolutePath, a+x)
+if(isWindows){
--- End diff --

mind adding a space after `if`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5914] to run spark-submit requiring onl...

2015-02-24 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4742#issuecomment-75720961
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5979] Made --package exclusions more re...

2015-02-24 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4754#issuecomment-75911222
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5158] [core] [security] Spark standalon...

2015-02-24 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4106#issuecomment-75861282
  
The model I had in mind for this patch was to support dedicated 
clusters/appliances based on Spark where the Spark cluster itself is fully 
trusted and not multi-tenant. @harishreedharan - trying to have secured key 
distribution in the standalone mode is a very large undertaking, this is 
intentionally designed to avoid that to support a more limited security model. 
Trying to support arbitrary user code that you don't trust (within the cluster) 
is also not the intention - there aren't arbitrary users, it's a single 
embedded application. We've had several requests for some simpler security 
mechanism in Standalone mode over the users.

The main concerns I see here are that the KDC can't handle frequent 
authentication requests of the same principle from multiple hosts. That might 
render this approach untenable. It would be worth testing and understanding the 
limitations in that regard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5993][Streaming][Build] Fix assembly ja...

2015-02-24 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4753#issuecomment-75895962
  
LGTM - I tested this approach locally and it worked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-23 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25148046
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java ---
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Internal interface that defines a command builder.
+ */
+interface CommandBuilder {
--- End diff --

IMO - it might be simpler to just not have this interface. It's confusing 
because a lot of the heavy lifting for building commands in in SparkLauncher, 
but that class doesn't actually implement this. Also, you never interact with 
`CommandBuilder`'s in a generic way, you only use it in a case where you 
already have a branch distinguishing the only two possible implementations.

In cases such as this, I'm not sure it the code simpler to understand when 
there is an interface.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-22 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25145503
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -0,0 +1,684 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileFilter;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.jar.JarFile;
+import java.util.regex.Pattern;
+
+import static org.apache.spark.launcher.CommandBuilderUtils.*;
+
+/**
+ * Launcher for Spark applications.
+ * p/
+ * Use this class to start Spark applications programmatically. The class 
uses a builder pattern
+ * to allow clients to configure the Spark application and launch it as a 
child process.
+ * p/
+ * Note that launching Spark applications using this class will not 
automatically load environment
+ * variables from the spark-env.sh or spark-env.cmd scripts in the 
configuration directory.
+ */
+public class SparkLauncher {
+
+  /** The Spark master. */
+  public static final String SPARK_MASTER = spark.master;
+
+  /** Configuration key for the driver memory. */
+  public static final String DRIVER_MEMORY = spark.driver.memory;
+  /** Configuration key for the driver class path. */
+  public static final String DRIVER_EXTRA_CLASSPATH = 
spark.driver.extraClassPath;
+  /** Configuration key for the driver VM options. */
+  public static final String DRIVER_EXTRA_JAVA_OPTIONS = 
spark.driver.extraJavaOptions;
+  /** Configuration key for the driver native library path. */
+  public static final String DRIVER_EXTRA_LIBRARY_PATH = 
spark.driver.extraLibraryPath;
+
+  /** Configuration key for the executor memory. */
+  public static final String EXECUTOR_MEMORY = spark.executor.memory;
--- End diff --

I'm not sure any of these `EXECUTOR_XX` constants are ever used. Are they 
intended as convenience constants for users?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-22 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-75498418
  
Hey @vanzin okay took another quite long look (sorry, was delayed this week 
due to strata) and I have embarrassingly few useful comments given how long I 
looked at it. Overall, the changes you made all seem great. Unfortunately since 
spark-submit as it stands before your patch is a fairly wide interface, it's 
difficult for me to conceive of every possible combination of 
settings/environments/etc, etc that could be broken by this. I think we'll just 
have to merge it and hope it gets a substantial amount of user testing.

The main thing left for me is the proposal to expose this as a public 
interface to Spark for application developers. The main issue I see is that it 
just won't be backwards compatible as it stands now, and this is something 
where I think compatibility is pretty important. One main use case I see here 
is some third party application wants to have a nice way of submitting Spark 
applications for user clusters that already have Spark installed, so they 
bundle this up with their app. At least, that's the main reason I've heard 
users directly ask for this feature. If they have to grab a newer version of 
this for every different Spark version users might have installed, that's a big 
pain for them. And it's the exact opposite API guarantees as the rest of Spark. 
And there is an alternative that _is_ backwards compatible which is for them to 
just script around `spark-submit`. So users are in a weird place where there 
are two options and neither is strictly optimal. 

To me that seems like a blocker to exposing it in the current form without 
exploring more options.

If we want to make this backwards compatible, we'd need to minimize the 
interface between the publicly exposed library and the supplied Spark 
distribution. If the interface from the user app is that we fork a subprocess 
when they call launch(), could that subprocess just invoke the spark-submit 
script in the distribution, which will then itself compute the classpath, etc? 
We'd just have to detect whether to invoke the windows or unix version, but 
that seems pretty doable. Could that work?

As a point of process, I'd be happy to explore various options for making 
this work by just merging a version where this is private and then building on 
top of that, to avoid staleness and allow for ample testing. I really do think 
its a good idea to have a programmatic version of spark-submit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-22 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25147958
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java 
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Pattern;
+
+import static org.apache.spark.launcher.CommandBuilderUtils.*;
+
+/**
+ * Command builder for internal Spark classes.
+ * p/
+ * This class handles building the command to launch all internal Spark 
classes except for
+ * SparkSubmit (which is handled by {@link SparkSubmitCommandBuilder} 
class.
+ */
+class SparkClassCommandBuilder extends SparkLauncher implements 
CommandBuilder {
--- End diff --

It might be worth noting here that this never actually uses `launch` and 
instead it just uses the shared plumbing for building commands. It's a bit 
non-obvious to have this thing extend SparkLauncher but have no intention of 
ever being used to launch something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-22 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25145313
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java ---
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Internal interface that defines a command builder.
+ */
+interface CommandBuilder {
+
+  ListString buildCommand(MapString, String env) throws IOException;
--- End diff --

If it's going to stay as is, it would be good to document what is expected 
to contain when invoked, and the fact that the callee will mutate it in place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-22 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25145264
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java ---
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Internal interface that defines a command builder.
+ */
+interface CommandBuilder {
+
+  ListString buildCommand(MapString, String env) throws IOException;
--- End diff --

Should this return a structure containing an environment and a list (maybe 
a small Command class or something)? I couldn't find any instances where env is 
expected to be non-empty when this function is invoked - that is, it seems like 
it's just used as a return value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...

2015-02-18 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3861#issuecomment-74950927
  
We spoke a bit offline about this, but my feeling was that the best thing 
here might be to add a way to launch the shuffle service as a standalone 
application (initially, not one managed by Mesos) so that it can be shared 
across Spark applications. That would involve writing some simple launching 
scripts for it in a similar way to existing daemons we launch, and you'd ask 
users to launch the shuffle service similar to other storage systems like HDFS. 
That's very simple and would avoid diverging a lot between Mesos and the other 
modes. And longer term we could actually have a single shared shuffle service 
that is scheduled by mesos.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...

2015-02-18 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4221#issuecomment-74952993
  
Yeah our auto-close doesn't work on PR's into release branches like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

2015-02-18 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4420#issuecomment-74946838
  
@mccheah @mingyukim yeah, there isn't an OOM proof solution at all because 
these are all heuristics. Even checking every element is not OOM proof since 
memory estimation is itself a heuristic that involves sampling. My only concern 
with exposing knobs here is that users will expect us to support these going 
forward, even though we may want to refactor this in the future in a way where 
those knobs don't make sense anymore. It's reasonable users would consider it a 
regression if their tuning of those knobs stopped working.

So if possible, it would be good to adjust our heuristics to meet a wider 
range of use cases and then if we keep hearing more issues we can expose knobs. 
We can't have them meet every possible use case, since they are heuristics, but 
in this case I was wondering if we could have a strict improvement to the 
heuristics. @andrewor14 can you comment on whether this is indeed a strict 
improvement?

One of the main benefits of the new data frames API is that we will be able 
to have precise control over memory usage in a way that can avoid OOM's ever. 
But for the current Spark API we are using this more ad-hoc memory estimation 
along with some heuristics.

I'm not 100% against exposing knobs either, but I'd be interested if some 
simple improvements fix your use case. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-18 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-75008630
  
@vanzin about half way through reviewing... will pick up tomorrow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 4362 matches

Mail list logo