[GitHub] spark pull request: [SPARK-3615][Streaming]Fix Kafka unit test har...

2014-09-24 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/2483#discussion_r17955115
  
--- Diff: 
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
 ---
@@ -59,16 +58,35 @@ class KafkaStreamSuite extends TestSuiteBase {
 
   override def beforeFunction() {
 // Zookeeper server startup
-zookeeper = new EmbeddedZookeeper(zkConnect)
+zookeeper = new EmbeddedZookeeper(s$zkHost:$zkPort)
+// Get the actual zookeeper binding port
+zkPort = zookeeper.actualPort
 logInfo( 0 )
-zkClient = new ZkClient(zkConnect, zkSessionTimeout, 
zkConnectionTimeout, ZKStringSerializer)
+
+zkClient = new ZkClient(s$zkHost:$zkPort, zkSessionTimeout, 
zkConnectionTimeout,
+  ZKStringSerializer)
 logInfo( 1 )
 
 // Kafka broker startup
-server = new KafkaServer(brokerConf)
-logInfo( 2 )
-server.startup()
-logInfo( 3 )
+var bindSuccess: Boolean = false
+while(!bindSuccess) {
+  try {
+val brokerProps = getBrokerConfig(brokerPort, s$zkHost:$zkPort)
+brokerConf = new KafkaConfig(brokerProps)
+server = new KafkaServer(brokerConf)
--- End diff --

alright. just one more round of testing, and will merge it if it passes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1853] Show Streaming application code c...

2014-09-24 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/2464#issuecomment-56628837
  
@mubarak Thank you very much for this fix! Its finally merged!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3675][SQL] Allow starting a JDBC server...

2014-09-24 Thread marmbrus
GitHub user marmbrus opened a pull request:

https://github.com/apache/spark/pull/2515

[SPARK-3675][SQL] Allow starting a JDBC server on an existing context



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marmbrus/spark jdbcExistingContext

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2515.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2515


commit 7866fad85ce89d38547ffed904ac9a3dbce1aed3
Author: Michael Armbrust mich...@databricks.com
Date:   2014-09-24T06:13:20Z

Allows starting a JDBC server on an existing context.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3675][SQL] Allow starting a JDBC server...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2515#issuecomment-56629604
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20744/consoleFull)
 for   PR 2515 at commit 
[`7866fad`](https://github.com/apache/spark/commit/7866fad85ce89d38547ffed904ac9a3dbce1aed3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2514#issuecomment-56629824
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20742/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2514#issuecomment-56629822
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20742/consoleFull)
 for   PR 2514 at commit 
[`83acb38`](https://github.com/apache/spark/commit/83acb38649ef41917130d7837ab9f4177fc3262d).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3615][Streaming]Fix Kafka unit test har...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2483#issuecomment-56632476
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20743/consoleFull)
 for   PR 2483 at commit 
[`863`](https://github.com/apache/spark/commit/863830eb240f2b5b44a8991d0e45c49bfdaa).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3615][Streaming]Fix Kafka unit test har...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2483#issuecomment-56632481
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20743/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3675][SQL] Allow starting a JDBC server...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2515#issuecomment-56633081
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20744/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3675][SQL] Allow starting a JDBC server...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2515#issuecomment-56633079
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20744/consoleFull)
 for   PR 2515 at commit 
[`7866fad`](https://github.com/apache/spark/commit/7866fad85ce89d38547ffed904ac9a3dbce1aed3).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark Core - [SPARK-3620] - Refactor of SparkS...

2014-09-24 Thread tigerquoll
GitHub user tigerquoll opened a pull request:

https://github.com/apache/spark/pull/2516

Spark Core - [SPARK-3620] - Refactor of SparkSubmit Argument parsing code

Argument processing seems to have gotten a lot of attention lately, so I 
thought I might throw my contribution into the ring.  Attached for 
consideration and to prompt discussion is a revamp of argument handling in 
SparkSubmit aimed at making things a lot more consistent. The only things that 
have been modified are the way that configuration properties are read/ 
processed and prioritised 

Things to note include:
* All configuration parameters can now be consistently set via config file

* Configuration parameters defaults have been removed from the code, and 
placed into a property file which is read from the class path on startup.  
There should be no need to trace through 5 files to see what a config parameter 
defaults to if it is not specified, or have different default values applied in 
multiple places throughout the code.

* Configuration parameter validation is now done once all configuration 
parameters have been read in and resolved from various locations, not just when 
reading the command line.

* All property files (including spark_default_conf) are parsed by Java 
property handling code. All custom parsing code has been removed. Escaping of 
characters should now be consistent everywhere.

* All configuration parameters are overridden in the same consistent way - 
configuration parameters for sparkSubmit are pulled form the following sources 
in order of priority
 1. Entries specified on the command line (except from --conf entries)
 2. Entries specified on the command line with --conf
 3. Environment variables (including legacy variable mappings)
 4. System config variables (eg by using -Dspark.var.name)
 5. $(SPARK_DEFAULT_CONF)/spark-defaults.conf or 
$(SPARK_HOME)/conf/spark-defaults.conf if either exist
 6. Hard coded defaults in class path at spark-submit-defaults.prop

* A property file specified by one of the sources listed above gets read in 
and the properties are considered to be at the priority of the configuration 
source that specified the file. A property specified in a property file will 
not override an existing config value already specifiedby that configuration 
source

The existing argument handling is pretty finicky - chances are high that 
I’ve missed some behaviour - if this PR is going to be accepted/approved let 
me know any bugs and I’ll fix them up and document the behaviour for future 
reference

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tigerquoll/spark-3620 master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2516.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2516


commit b1a9682dd2bbff824c4e8481fa0ce5118c47de68
Author: Dale tigerqu...@outlook.com
Date:   2014-09-21T02:42:24Z

Initial pass at using typesafe's conf object for handling configuration 
options

commit 7bb5ee95b3f06147dba994e3d557221554415bfd
Author: Dale tigerqu...@outlook.com
Date:   2014-09-21T02:44:09Z

Added defaults file

commit e995a6d1e8ab898c85aa5fe259b81c630595075f
Author: Dale tigerqu...@outlook.com
Date:   2014-09-21T12:56:17Z

Existing tests now work

commit 00ee008c5652336d533d9619bc7e6306ed59138b
Author: Dale tigerqu...@outlook.com
Date:   2014-09-21T13:05:14Z

Existing tests now work

commit 295c62b067fb5204efb58892133c77fe49b877e0
Author: Dale tigerqu...@outlook.com
Date:   2014-09-22T22:04:45Z

Created mergedPropertyMap

commit f399170e1c05d75257ff6c508a96e64cadf0d87b
Author: Dale tigerqu...@outlook.com
Date:   2014-09-23T00:10:40Z

Moved sparkSubmitArguments module to use custom property map merging code

commit b0abe3196f9e5d3f577e158704740f1eee8fbb59
Author: Dale tigerqu...@outlook.com
Date:   2014-09-23T23:58:55Z

Merge branch 'master' of https://github.com/apache/spark

commit 562ec7c064e5ad632cf7aaa1720be29fe36b5c9a
Author: Dale tigerqu...@outlook.com
Date:   2014-09-23T23:59:52Z

note for additional tests

commit 86f71f8bb8291fe20a2f0ca0100727d583e97dfd
Author: Dale tigerqu...@outlook.com
Date:   2014-09-24T00:39:47Z

Changes needed to pass scalastyle check

commit 2019554ec307c8d3eabee7e4299cd8bac8faba0f
Author: Dale tigerqu...@outlook.com
Date:   2014-09-24T04:43:58Z

Changes needed to pass scalastyle check, merged from current 
SparkSubmit.scala

commit 8c416a04d064c1475a184785a9135d849c239bff
Author: Dale tigerqu...@outlook.com
Date:   2014-09-24T05:19:24Z

Fixed some typos

commit b69f58e65d919a689942866f59b11a7dcf2fbf91
Author: Dale tigerqu...@outlook.com
Date:   2014-09-24T07:08:01Z

Added spark.app.name to defaults list
   

[GitHub] spark pull request: Spark Core - [SPARK-3620] - Refactor of SparkS...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2516#issuecomment-56634577
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17957565
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala ---
@@ -0,0 +1,430 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impl.{BaggedPoint, TreePoint, 
DecisionTreeMetadata, TimeTracker}
+import org.apache.spark.mllib.tree.impurity.Impurities
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.util.Utils
+
+/**
+ * :: Experimental ::
+ * A class which implements a random forest learning algorithm for 
classification and regression.
+ * It supports both continuous and categorical features.
+ *
+ * @param strategy The configuration parameters for the random forest 
algorithm which specify
+ * the type of algorithm (classification, regression, 
etc.), feature type
+ * (continuous, categorical), depth of the tree, quantile 
calculation strategy,
+ * etc.
+ * @param numTrees If 1, then no bootstrapping is used.  If  1, then 
bootstrapping is done.
+ * @param featureSubsetStrategy Number of features to consider for splits 
at each node.
+ *  Supported: auto (default), all, 
sqrt, log2, onethird.
+ *  If auto is set, this parameter is set 
based on numTrees:
+ *  if numTrees == 1, then 
featureSubsetStrategy = all;
+ *  if numTrees  1, then 
featureSubsetStrategy = sqrt.
+ * @param seed  Random seed for bootstrapping and choosing feature subsets.
+ */
+@Experimental
+private class RandomForest (
+private val strategy: Strategy,
+private val numTrees: Int,
+featureSubsetStrategy: String,
+private val seed: Int)
+  extends Serializable with Logging {
+
+  strategy.assertValid()
+  require(numTrees  0, sRandomForest requires numTrees  0, but was 
given numTrees = $numTrees.)
+  
require(RandomForest.supportedFeatureSubsetStrategies.contains(featureSubsetStrategy),
+sRandomForest given invalid featureSubsetStrategy: 
$featureSubsetStrategy. +
+s Supported values: 
${RandomForest.supportedFeatureSubsetStrategies.mkString(, )}.)
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input Training data: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]]
+   * @return RandomForestModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): RandomForestModel = {
+
+val timer = new TimeTracker()
+
+timer.start(total)
+
+timer.start(init)
+
+val retaggedInput = input.retag(classOf[LabeledPoint])
+val metadata =
+  DecisionTreeMetadata.buildMetadata(retaggedInput, strategy, 
numTrees, featureSubsetStrategy)
+logDebug(algo =  + strategy.algo)
+logDebug(numTrees =  + numTrees)
+logDebug(seed =  + seed)
+logDebug(maxBins =  + metadata.maxBins)
+logDebug(featureSubsetStrategy =  + featureSubsetStrategy)
+logDebug(numFeaturesPerNode =  + metadata.numFeaturesPerNode)
+
+// Find the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+timer.start(findSplitsBins)
+val (splits, bins) = 

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17957573
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala ---
@@ -0,0 +1,430 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impl.{BaggedPoint, TreePoint, 
DecisionTreeMetadata, TimeTracker}
+import org.apache.spark.mllib.tree.impurity.Impurities
+import org.apache.spark.mllib.tree.model._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.util.Utils
+
+/**
+ * :: Experimental ::
+ * A class which implements a random forest learning algorithm for 
classification and regression.
+ * It supports both continuous and categorical features.
+ *
+ * @param strategy The configuration parameters for the random forest 
algorithm which specify
+ * the type of algorithm (classification, regression, 
etc.), feature type
+ * (continuous, categorical), depth of the tree, quantile 
calculation strategy,
+ * etc.
+ * @param numTrees If 1, then no bootstrapping is used.  If  1, then 
bootstrapping is done.
+ * @param featureSubsetStrategy Number of features to consider for splits 
at each node.
+ *  Supported: auto (default), all, 
sqrt, log2, onethird.
+ *  If auto is set, this parameter is set 
based on numTrees:
+ *  if numTrees == 1, then 
featureSubsetStrategy = all;
+ *  if numTrees  1, then 
featureSubsetStrategy = sqrt.
+ * @param seed  Random seed for bootstrapping and choosing feature subsets.
+ */
+@Experimental
+private class RandomForest (
+private val strategy: Strategy,
+private val numTrees: Int,
+featureSubsetStrategy: String,
+private val seed: Int)
+  extends Serializable with Logging {
+
+  strategy.assertValid()
+  require(numTrees  0, sRandomForest requires numTrees  0, but was 
given numTrees = $numTrees.)
+  
require(RandomForest.supportedFeatureSubsetStrategies.contains(featureSubsetStrategy),
+sRandomForest given invalid featureSubsetStrategy: 
$featureSubsetStrategy. +
+s Supported values: 
${RandomForest.supportedFeatureSubsetStrategies.mkString(, )}.)
+
+  /**
+   * Method to train a decision tree model over an RDD
+   * @param input Training data: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]]
+   * @return RandomForestModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): RandomForestModel = {
+
+val timer = new TimeTracker()
+
+timer.start(total)
+
+timer.start(init)
+
+val retaggedInput = input.retag(classOf[LabeledPoint])
+val metadata =
+  DecisionTreeMetadata.buildMetadata(retaggedInput, strategy, 
numTrees, featureSubsetStrategy)
+logDebug(algo =  + strategy.algo)
+logDebug(numTrees =  + numTrees)
+logDebug(seed =  + seed)
+logDebug(maxBins =  + metadata.maxBins)
+logDebug(featureSubsetStrategy =  + featureSubsetStrategy)
+logDebug(numFeaturesPerNode =  + metadata.numFeaturesPerNode)
+
+// Find the splits and the corresponding bins (interval between the 
splits) using a sample
+// of the input data.
+timer.start(findSplitsBins)
+val (splits, bins) = 

[GitHub] spark pull request: [SPARK-3032][Shuffle] Fix key comparison integ...

2014-09-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2514#discussion_r17957564
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -152,7 +152,7 @@ private[spark] class ExternalSorter[K, V, C](
 override def compare(a: K, b: K): Int = {
   val h1 = if (a == null) 0 else a.hashCode()
   val h2 = if (b == null) 0 else b.hashCode()
-  h1 - h2
+  if (h1  h2) -1 else if (h1 == h2) 0 else 1
--- End diff --

@mateiz per my comment, that would no longer run in Java 6 as 
`Integer.compare` doesn't exist before Java 7.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17957577
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex -- offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) = total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute node offset for use with [[nodeUpdate]].
+   */
+  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
+
+  /**
+   * Faster version of [[update]].
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
+   */
+  def nodeUpdate(
+  nodeOffset: Int,
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeOffset + featureOffsets(featureIndex) + binIndex * 
statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
+   * For ordered features only.
+   */
+  def getNodeFeatureOffset(nodeIndex: Int, featureIndex: Int): Int = {
+require(!isUnordered(featureIndex),
+  sDTStatsAggregator.getNodeFeatureOffset is for ordered features 
only, but was called +
+s for unordered feature $featureIndex.)
+nodeIndex * nodeStride + featureOffsets(featureIndex)
+  }
+
+  /**
+   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
+   * For unordered features only.
+   */
+  def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): 
(Int, Int) = {
+require(isUnordered(featureIndex),
--- End diff --

Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17957595
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex -- offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) = total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
--- End diff --

I'll see about improving this, though I may go with just using allStats 
everywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17957666
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex -- offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) = total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute node offset for use with [[nodeUpdate]].
+   */
+  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
+
+  /**
+   * Faster version of [[update]].
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
--- End diff --

I was curious too.  I just ran some experiments on EC2.  With 1 worker, 
there is basically no difference.  With 16 workers, there is a difference when 
there are lots of ordered features (where this function nodeUpdate is used): 
eliminating nodeUpdate and using update makes things run about 5% slower.  I 
will keep it for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3676][Sql]spark sql hive test suite fai...

2014-09-24 Thread scwf
GitHub user scwf opened a pull request:

https://github.com/apache/spark/pull/2517

[SPARK-3676][Sql]spark sql hive test suite failed in JDK 1.6

https://issues.apache.org/jira/browse/SPARK-3676
spark sql hive test failed in jdk 1.6, you can replay this by set jdk 
version = 1.6.0_31
[info] - division *** FAILED ***
[info] Results do not match for division:
[info] SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT FROM src LIMIT 1
[info] == Parsed Logical Plan ==
[info] Limit 1
[info] Project (2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 
/ COUNT(1)) AS c_3#695
[info] UnresolvedRelation None, src, None
[info] 
[info] == Analyzed Logical Plan ==
[info] Limit 1
[info] Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
[info] MetastoreRelation default, src, None
[info] 
[info] == Optimized Logical Plan ==
[info] Limit 1
[info] Aggregate [], 2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695
[info] Project []
[info] MetastoreRelation default, src, None
[info] 
[info] == Physical Plan ==
[info] Limit 1
[info] Aggregate false, [], 2.0 AS c_0#692,0.5 AS 
c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
DoubleType)) AS c_3#695
[info] Exchange SinglePartition
[info] Aggregate true, [], COUNT(1) AS PartialCount#699L
[info] HiveTableScan [], (MetastoreRelation default, src, None), None
[info] 
[info] Code Generation: false
[info] == RDD ==
[info] c_0 c_1 c_2 c_3
[info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) ==
[info] !2.0 0.5 0. 0.002 2.0 0.5 0. 0.0020 
(HiveComparisonTest.scala:370)
[info] - timestamp cast #1 *** FAILED ***
[info] Results do not match for timestamp cast #1:
[info] SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
[info] == Parsed Logical Plan ==
[info] Limit 1
[info] Project CAST(CAST(1, TimestampType), DoubleType) AS c_0#995
[info] UnresolvedRelation None, src, None
[info] 
[info] == Analyzed Logical Plan ==
[info] Limit 1
[info] Project CAST(CAST(1, TimestampType), DoubleType) AS c_0#995
[info] MetastoreRelation default, src, None
[info] 
[info] == Optimized Logical Plan ==
[info] Limit 1
[info] Project 0.0010 AS c_0#995
[info] MetastoreRelation default, src, None
[info] 
[info] == Physical Plan ==
[info] Limit 1
[info] Project 0.0010 AS c_0#995
[info] HiveTableScan [], (MetastoreRelation default, src, None), None
[info] 
[info] Code Generation: false
[info] == RDD ==
[info] c_0
[info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) ==
[info] !0.001 0.0010 (HiveComparisonTest.scala:370)


this is because jdk has different logic to operate ```double```, 
```System.out.println(1/500d)``` in different jdk get different result
jdk 1.6.0(_31)  0.0020
jdk 1.7.0(_05)  0.002
this lead to HiveQuerySuite failed when generate golden answer in jdk 1.7 
and run tests in jdk 1.6, result did not matched



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/scwf/spark HiveQuerySuite

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2517


commit 1df3964f1ff99aa93ed5f556675fe0d6d0285401
Author: w00228970 wangf...@huawei.com
Date:   2014-09-24T06:44:54Z

Jdk version leads to different query output for Double, this make 
HiveQuerySuite failed

commit 0cb5e8d6c45f6587497ec854353b96b2d6f536e8
Author: w00228970 wangf...@huawei.com
Date:   2014-09-24T06:53:05Z

delete golden answer of division-0 and timestamp cast #1




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3676][Sql]spark sql hive test suite fai...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2517#issuecomment-56636736
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3663 Document SPARK_LOG_DIR and SPARK_PI...

2014-09-24 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/2518

SPARK-3663 Document SPARK_LOG_DIR and SPARK_PID_DIR

These descriptions are from the header of spark-daemon.sh

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark SPARK-3663

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2518


commit af89096fd93a6c85ce5828268ba546fc691f3e3b
Author: Andrew Ash and...@andrewash.com
Date:   2014-09-24T08:07:21Z

SPARK-3663 Document SPARK_LOG_DIR and SPARK_PID_DIR

These descriptions are from the header of spark-daemon.sh




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3663 Document SPARK_LOG_DIR and SPARK_PI...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2518#issuecomment-56638326
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20745/consoleFull)
 for   PR 2518 at commit 
[`af89096`](https://github.com/apache/spark/commit/af89096fd93a6c85ce5828268ba546fc691f3e3b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3676][Sql]spark sql hive test suite fai...

2014-09-24 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2517#issuecomment-56639208
  
actually this is a bug in jdk6
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3642. Document the nuances of shared var...

2014-09-24 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2490#discussion_r17959346
  
--- Diff: docs/programming-guide.md ---
@@ -1121,6 +1121,11 @@ than shipping a copy of it with tasks. They can be 
used, for example, to give ev
 large input dataset in an efficient manner. Spark also attempts to 
distribute broadcast variables
 using efficient broadcast algorithms to reduce communication cost.
 
+Spark automatically broadcasts the common data needed by tasks within each 
stage. The data
+broadcasted this way is cached in serialized form and deserialized before 
running each task. This
+means that explicitly creating broadcast variables is only useful when 
tasks across multiple stages
--- End diff --

The concept of stage is mentioned only in the two added paragraphs. Users 
new to Spark may not know the internals and the execution mmechanism. It would 
be nice to if some background is introduced here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3642. Document the nuances of shared var...

2014-09-24 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2490#discussion_r17959656
  
--- Diff: docs/programming-guide.md ---
@@ -1183,6 +1188,10 @@ running on the cluster can then add to it using the 
`add` method or the `+=` ope
 However, they cannot read its value.
 Only the driver program can read the accumulator's value, using its 
`value` method.
 
+The same task may run multiple times, either when its output data becomes 
lost or when multiple
--- End diff --

The same task can mean the same task id or the same computation in a 
stage. Two tasks that have the some computation may have different task id. It 
would be nice if some backgrounds is introduced here, eg like the relationship 
between stage and task sets. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/2519

SPARK-3526 Add section about data locality to the tuning guide

cc @kayousterhout

I have a few outstanding questions from compiling this documentation:
- What's the difference between NO_PREF and ANY?  I understand the 
implications of the ordering but don't know what an example of each would be
- Why is NO_PREF ahead of RACK_LOCAL?  I would think it'd be better to 
schedule rack-local tasks ahead of no preference if you could only do one or 
the other.  Is the idea to wait longer and hope for the rack-local tasks to 
turn into node-local or better?
- Will there be a datacenter-local locality level in the future?  Apache 
Cassandra for example has this level

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark SPARK-3526

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2519.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2519


commit 20e0e31158fe0350b8f59617f2228a48c34274ef
Author: Andrew Ash and...@andrewash.com
Date:   2014-09-24T08:50:07Z

SPARK-3526 Add section about data locality to the tuning guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56642802
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20746/consoleFull)
 for   PR 2519 at commit 
[`20e0e31`](https://github.com/apache/spark/commit/20e0e31158fe0350b8f59617f2228a48c34274ef).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3663 Document SPARK_LOG_DIR and SPARK_PI...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2518#issuecomment-56645105
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20745/consoleFull)
 for   PR 2518 at commit 
[`af89096`](https://github.com/apache/spark/commit/af89096fd93a6c85ce5828268ba546fc691f3e3b).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3663 Document SPARK_LOG_DIR and SPARK_PI...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2518#issuecomment-56645112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20745/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-56647000
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20747/consoleFull)
 for   PR 2388 at commit 
[`7bc691a`](https://github.com/apache/spark/commit/7bc691ab142edba8a127937dfbd836d5738f6527).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56649713
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20746/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56649704
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20746/consoleFull)
 for   PR 2519 at commit 
[`20e0e31`](https://github.com/apache/spark/commit/20e0e31158fe0350b8f59617f2228a48c34274ef).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread sarutak
GitHub user sarutak opened a pull request:

https://github.com/apache/spark/pull/2520

[SPARK-3677] [BUILD] [YARN] Scalastyle is never applyed to the sources 
under yarn/common



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sarutak/spark yarn-scalastyle-modification

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2520.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2520


commit f7f4755252077dd3b79c928d95ac67ee51bbe9e8
Author: Kousuke Saruta saru...@oss.nttdata.co.jp
Date:   2014-09-24T10:15:18Z

Modified SparkBuild.scala so that scalastyle is applied to the sources 
under yarn/common

Modified style for some sources under yarn/common




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3304] [YARN] ApplicationMaster's Finish...

2014-09-24 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2198#issuecomment-56650781
  
@tgravescs Thanks for your notification.
I found the issue which causes that scalastyle is not applied to 
yarn/common.
I resolved this issue in #2520 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56650830
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20748/consoleFull)
 for   PR 2520 at commit 
[`f7f4755`](https://github.com/apache/spark/commit/f7f4755252077dd3b79c928d95ac67ee51bbe9e8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...

2014-09-24 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2508#issuecomment-56651085
  
@mateiz Got it. On the zip methods, I want to capture the key point from 
https://issues.apache.org/jira/browse/SPARK-3098 , that the ordering is not 
only not guaranteed but also may change on reevaluation. I hope that wording is 
OK to retain and merge into yours.

I'll find some place in the programming guide to note this, and remove 
wording about persist and/or replace with suggestion to sort the RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56651140
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20748/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56651137
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20748/consoleFull)
 for   PR 2520 at commit 
[`f7f4755`](https://github.com/apache/spark/commit/f7f4755252077dd3b79c928d95ac67ee51bbe9e8).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56651663
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56652132
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20749/consoleFull)
 for   PR 2520 at commit 
[`f7f4755`](https://github.com/apache/spark/commit/f7f4755252077dd3b79c928d95ac67ee51bbe9e8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56652447
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20749/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56652443
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20749/consoleFull)
 for   PR 2520 at commit 
[`f7f4755`](https://github.com/apache/spark/commit/f7f4755252077dd3b79c928d95ac67ee51bbe9e8).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3304] [YARN] ApplicationMaster's Finish...

2014-09-24 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2198#issuecomment-56652661
  
@tgravescs Sorry, I have something wrong. Please wait a little.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-56653261
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20747/consoleFull)
 for   PR 2388 at commit 
[`7bc691a`](https://github.com/apache/spark/commit/7bc691ab142edba8a127937dfbd836d5738f6527).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, 
SSV)],`
  * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-56653270
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20747/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2508#issuecomment-56661626
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20750/consoleFull)
 for   PR 2508 at commit 
[`ad4aeec`](https://github.com/apache/spark/commit/ad4aeec504ad07269511a2aad843a5b815dfcf5d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-5764
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20751/consoleFull)
 for   PR 2520 at commit 
[`c3e5e6d`](https://github.com/apache/spark/commit/c3e5e6d47f37fc5b40db1050ff100e11cf48bd52).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2508#issuecomment-56669661
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20750/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2508#issuecomment-56669653
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20750/consoleFull)
 for   PR 2508 at commit 
[`ad4aeec`](https://github.com/apache/spark/commit/ad4aeec504ad07269511a2aad843a5b815dfcf5d).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2788] [STREAMING] Add location filterin...

2014-09-24 Thread sjbrunst
Github user sjbrunst commented on a diff in the pull request:

https://github.com/apache/spark/pull/1717#discussion_r17972016
  
--- Diff: 
external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterUtils.scala
 ---
@@ -33,15 +33,38 @@ object TwitterUtils {
*twitter4j.oauth.consumerSecret, twitter4j.oauth.accessToken and
*twitter4j.oauth.accessTokenSecret
* @param filters Set of filter strings to get only those tweets that 
match them
+   * @param locations   Bounding boxes to get only geotagged tweets within 
them. Example: 
+Seq(BoundingBox(-180.0,-90.0,180.0,90.0)) gives any geotagged 
tweet. If locations and
+filters are both nonempty, then any tweet matching either 
condition may be returned.
* @param storageLevel Storage level to use for storing the received 
objects
*/
   def createStream(
   ssc: StreamingContext,
   twitterAuth: Option[Authorization],
   filters: Seq[String] = Nil,
+  locations: Seq[BoundingBox] = Nil,
--- End diff --

It looks like I'm changing the method here, but this whole method is new. 
The original one is below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2788] [STREAMING] Add location filterin...

2014-09-24 Thread sjbrunst
Github user sjbrunst commented on the pull request:

https://github.com/apache/spark/pull/1717#issuecomment-56673018
  
@tdas The current version of TwitterUtils.scala only has new methods. The 
diff makes it look like I changed the original methods, but they are all there. 
The original unit tests from the StreamSuites pass, so I don't know why we're 
still getting the binary compatibility error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56675917
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20751/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56675904
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20751/consoleFull)
 for   PR 2520 at commit 
[`c3e5e6d`](https://github.com/apache/spark/commit/c3e5e6d47f37fc5b40db1050ff100e11cf48bd52).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3454] Expose JSON representation of dat...

2014-09-24 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2333#issuecomment-56677764
  
Thank you for you work @JoshRosen !
I'll check it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2432#issuecomment-56678722
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20752/consoleFull)
 for   PR 2432 at commit 
[`086ee25`](https://github.com/apache/spark/commit/086ee252424f1862998957327ef3c70ff1a5650b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3389] Add Converter for ease of Parquet...

2014-09-24 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/2256#issuecomment-56679585
  
Hey - I'm traveling at the moment without laptop access, so will be able to 
check it out tomorrow evening - hope that's ok :)—
Sent from Mailbox

On Wed, Sep 24, 2014 at 4:50 AM, Matei Zaharia notificati...@github.com
wrote:

 It looks like we can merge it without a rebase. I'll wait to see whether 
Nick has any comments because he built this feature.
 ---
 Reply to this email directly or view it on GitHub:
 https://github.com/apache/spark/pull/2256#issuecomment-56618758


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2432#issuecomment-56690206
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20752/consoleFull)
 for   PR 2432 at commit 
[`086ee25`](https://github.com/apache/spark/commit/086ee252424f1862998957327ef3c70ff1a5650b).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2432#issuecomment-56690217
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20752/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3645][SQL] Makes table caching eager by...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2513#issuecomment-56690446
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20753/consoleFull)
 for   PR 2513 at commit 
[`8d2192d`](https://github.com/apache/spark/commit/8d2192daa3bd2df2c686aa94e46a95dfb0540f08).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2778] [yarn] Add yarn integration tests...

2014-09-24 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2257#issuecomment-56694970
  
I'll merge with master and see if I can reproduce the failure...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2778] [yarn] Add yarn integration tests...

2014-09-24 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2257#issuecomment-56696661
  
Yep, fails locally too after the merge. Let me look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread jyotiska
GitHub user jyotiska opened a pull request:

https://github.com/apache/spark/pull/2521

Python SQL Example Code

SQL example code for Python, as shown on [SQL Programming 
Guide](https://spark.apache.org/docs/1.0.2/sql-programming-guide.html)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jyotiska/spark sql_example

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2521.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2521


commit 8f67b5b9152bbc4ab22de48198fdc2aa6f2fb6ab
Author: jyotiska jyotiska...@gmail.com
Date:   2014-09-24T16:25:54Z

added python sql example

commit 0b4614800a852bba709815a393dda0370049901e
Author: jyotiska jyotiska...@gmail.com
Date:   2014-09-24T16:27:56Z

fixed appname for python sql example




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56698292
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20754/consoleFull)
 for   PR 2521 at commit 
[`0b46148`](https://github.com/apache/spark/commit/0b4614800a852bba709815a393dda0370049901e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56698293
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20755/consoleFull)
 for   PR 2520 at commit 
[`3858089`](https://github.com/apache/spark/commit/3858089fd149ed92a9c27a2308c77f96f1c9a964).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56698439
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20754/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56698438
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20754/consoleFull)
 for   PR 2521 at commit 
[`0b46148`](https://github.com/apache/spark/commit/0b4614800a852bba709815a393dda0370049901e).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-24 Thread rnowling
Github user rnowling commented on the pull request:

https://github.com/apache/spark/pull/2494#issuecomment-56698570
  
@mengxr doesn't look like the tests started -- maybe Jenkins ignores 
comments that address users? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3645][SQL] Makes table caching eager by...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2513#issuecomment-56698729
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20753/consoleFull)
 for   PR 2513 at commit 
[`8d2192d`](https://github.com/apache/spark/commit/8d2192daa3bd2df2c686aa94e46a95dfb0540f08).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class CacheTableCommand(tableName: String, plan: 
Option[LogicalPlan], isLazy: Boolean)`
  * `case class UncacheTableCommand(tableName: String) extends Command`
  * `case class CacheTableCommand(tableName: String, logicalPlan: 
Option[LogicalPlan], isLazy: Boolean)`
  * `case class UncacheCommand(tableName: String) extends LeafNode with 
Command `
  * `case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3645][SQL] Makes table caching eager by...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2513#issuecomment-56698738
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20753/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56700932
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20756/consoleFull)
 for   PR 2521 at commit 
[`c90502a`](https://github.com/apache/spark/commit/c90502a62c1114cee15194c1190733e75889d0d1).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56700939
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20756/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56700691
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20756/consoleFull)
 for   PR 2521 at commit 
[`c90502a`](https://github.com/apache/spark/commit/c90502a62c1114cee15194c1190733e75889d0d1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56703017
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20757/consoleFull)
 for   PR 2521 at commit 
[`306667e`](https://github.com/apache/spark/commit/306667e1fb905c38c8753520467b95dd27406f70).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-24 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/2485#issuecomment-56706929
  
@pwendell @mateiz @andrewor14  can any of you kick jenkins?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2485#issuecomment-56707385
  
I just kicked it from the `spark-prs` parameterized build trigger; let's 
wait and see if it starts...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2485#issuecomment-56707584
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/146/consoleFull)
 for   PR 2485 at commit 
[`f00fa31`](https://github.com/apache/spark/commit/f00fa311945c1eafa8957eae5c84719521761dcd).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-24 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/2485#issuecomment-56707989
  
ah sorry, looks like something conflicts now and it needs upmerged.

@nishkamravi2  can you please upmerge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2521#discussion_r17987196
  
--- Diff: examples/src/main/python/sql.py ---
@@ -0,0 +1,52 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import sys
+
+from pyspark import SparkContext
+from pyspark.sql import SQLContext
+
+
+if __name__ == __main__:
+if len(sys.argv) != 2:
+print  sys.stderr, Usage: sql file
+exit(-1)
+sc = SparkContext(appName=PythonSQL)
+sqlContext = SQLContext(sc)
+
+# A JSON dataset is pointed to by path.
+# The path can be either a single text file or a directory storing 
text files.
+path = examples/src/main/resources/people.json
--- End diff --

This assume that this script will be run at SPARK_HOME, it will be broken 
if user run it at SPARK_HOME/examples/src/main/python



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56709666
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20755/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56709659
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20755/consoleFull)
 for   PR 2520 at commit 
[`3858089`](https://github.com/apache/spark/commit/3858089fd149ed92a9c27a2308c77f96f1c9a964).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread jyotiska
Github user jyotiska commented on a diff in the pull request:

https://github.com/apache/spark/pull/2521#discussion_r17987266
  
--- Diff: examples/src/main/python/sql.py ---
@@ -0,0 +1,52 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import sys
+
+from pyspark import SparkContext
+from pyspark.sql import SQLContext
+
+
+if __name__ == __main__:
+if len(sys.argv) != 2:
+print  sys.stderr, Usage: sql file
+exit(-1)
+sc = SparkContext(appName=PythonSQL)
+sqlContext = SQLContext(sc)
+
+# A JSON dataset is pointed to by path.
+# The path can be either a single text file or a directory storing 
text files.
+path = examples/src/main/resources/people.json
--- End diff --

In that case, should the JSON file be supplied as codesys.argv[1]/code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2778] [yarn] Add yarn integration tests...

2014-09-24 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2257#issuecomment-56709844
  
I found the problem - it was caused by a recent PR that basically broke 
yarn-cluster mode...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56710085
  
This example only demonstrate jsonFile(), it will more powerful if it could 
have some usage of `inferSchema()` and `applySchema()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2778] [yarn] Add yarn integration tests...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2257#issuecomment-56710658
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20758/consoleFull)
 for   PR 2257 at commit 
[`6d5b84e`](https://github.com/apache/spark/commit/6d5b84e8b5987683591d8c07b3ff8557d9581871).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2778] [yarn] Add yarn integration tests...

2014-09-24 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2257#discussion_r17987981
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala 
---
@@ -401,17 +401,17 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments,
   // it has an uncaught exception thrown out.  It needs a shutdown 
hook to set SUCCEEDED.
   status = FinalApplicationStatus.SUCCEEDED
 } catch {
-  case e: InvocationTargetException = {
+  case e: InvocationTargetException =
 e.getCause match {
-  case _: InterruptedException = {
+  case _: InterruptedException =
 // Reporter thread can interrupt to stop user class
-  }
+
+  case e = throw e
--- End diff --

I don't think you need this right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] Scalastyle is neve...

2014-09-24 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-56711425
  
LGTM. I don't really understand why you need to tell sbt again where the 
sources are (after all, sbt does build the yarn code properly), but then I'm 
not an sbt expert.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2778] [yarn] Add yarn integration tests...

2014-09-24 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2257#issuecomment-56711888
  
Ah good catch. The latest changes LGTM if you get the tests to pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56713322
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20757/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2521#issuecomment-56713314
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20757/consoleFull)
 for   PR 2521 at commit 
[`306667e`](https://github.com/apache/spark/commit/306667e1fb905c38c8753520467b95dd27406f70).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3679] [PySpark] pickle the exact global...

2014-09-24 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/2522

[SPARK-3679] [PySpark] pickle the exact globals of functions

function.func_code.co_names has all the names used in the function, 
including name of attributes. It will pickle some unnecessary globals if there 
is a global having the same name with attribute (in co_names).

There is a regression introduced by #2114 

cc @JoshRosen 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark globals

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2522.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2522


commit dfbccf5c92333da8ab835fc4730aadc844e9f895
Author: Davies Liu davies@gmail.com
Date:   2014-09-24T18:23:10Z

fix bug while pickle globals of function




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Build] Diff from branch point

2014-09-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2512#issuecomment-56717149
  
Looks good - thanks Nick.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2478#issuecomment-56717218
  
I think `len(rdd)` has the potential to be confused with `rdd.count()`, 
since calling `len()` on a Python collection usually returns the size of that 
collection.

I also agree that we shouldn't expose Java `Partition` objects to users.  
Is there any reason to expose `Partition` objects besides allowing 
`len(rdd.partitions())` to work?  If not, I'm not sure that we should add this 
feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3659] Set EC2 version to 1.1.0 and upda...

2014-09-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2510#issuecomment-56717292
  
Looks good, thanks Shivaram. I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3679] [PySpark] pickle the exact global...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2522#issuecomment-56717278
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20759/consoleFull)
 for   PR 2522 at commit 
[`dfbccf5`](https://github.com/apache/spark/commit/dfbccf5c92333da8ab835fc4730aadc844e9f895).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Build] Diff from branch point

2014-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2512


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: logNormalGraph missing partition parameter

2014-09-24 Thread elmalto
GitHub user elmalto opened a pull request:

https://github.com/apache/spark/pull/2523

logNormalGraph missing partition parameter



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/elmalto/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2523


commit 5478e716b9f080b7419285752708f0f4050f23da
Author: elmalto elma...@users.noreply.github.com
Date:   2014-09-24T18:34:45Z

logNormalGraph missing partition parameter




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Potential error of message construction of SCC

2014-09-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2507#issuecomment-56717540
  
Hey can you create a JIRA issue for this? Also, can you add [GraphX] to the 
title? Thanks /cc @ankurdave 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3659] Set EC2 version to 1.1.0 and upda...

2014-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2510


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: logNormalGraph missing partition parameter

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2523#issuecomment-56717691
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3634] [PySpark] User's module should ta...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2492#discussion_r17991203
  
--- Diff: python/pyspark/context.py ---
@@ -183,10 +183,9 @@ def _do_init(self, master, appName, sparkHome, 
pyFiles, environment, batchSize,
 for path in self._conf.get(spark.submit.pyFiles, ).split(,):
 if path != :
 (dirname, filename) = os.path.split(path)
-self._python_includes.append(filename)
-sys.path.append(path)
-if dirname not in sys.path:
-sys.path.append(dirname)
+if filename.lower().endswith(zip) or 
filename.lower().endswith(egg):
--- End diff --

I think that `spark.submit.pyFiles` is allowed to contain `.py` files, too:

```
  --py-files PY_FILES Comma-separated list of .zip, .egg, or .py 
files to place
  on the PYTHONPATH for Python apps.
```

Will this new filtering by `.zip` and `.egg` prevent this from working?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >