date:20140511

[GitHub] spark pull request: [SPARK-1773] Update outdated docs for spark-su...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/701#issuecomment-42729688
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14866/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42763089
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...

2014-05-11 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/646#discussion_r12507598
  
--- Diff: conf/spark-env.sh.template ---
@@ -38,6 +38,7 @@
 # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
 # - SPARK_WORKER_DIR, to set the working directory of worker processes
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. 
-Dx=y)
+# - SPARK_DRIVER_MEMORY, Memory for driver (e.g. 1000M, 2G) (Default: 512 
Mb)
--- End diff --

If so, 
```
if [ ! -z $DRIVER_MEMORY ]  [ ! -z $DEPLOY_MODE ]  [ $DEPLOY_MODE = 
client ]; then
  export SPARK_MEM=$DRIVER_MEMORY
fi
```  is not correct


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread pwendell

GitHub user pwendell opened a pull request:

https://github.com/apache/spark/pull/730

SPARK-1652: Set driver memory correctly in spark-submit.

The previous check didn't account for the fact that the default
deploy mode is client unless otherwise specified. Also, this
sets the more narrowly defined SPARK_DRIVER_MEM instead of setting
SPARK_MEM.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pwendell/spark spark-submit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/730.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #730


commit f5081465658ee9035a498ac1fb8a532b89e895cc
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-05-11T06:06:12Z

SPARK-1652: Set driver memory correctly in spark-submit.

The previous check didn't account for the fact that the default
deploy mode is client unless otherwise specified. Also, this
sets the more narrowly defined SPARK_DRIVER_MEM instead of setting
SPARK_MEM.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...

2014-05-11 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/646#discussion_r12507600
  
--- Diff: conf/spark-env.sh.template ---
@@ -38,6 +38,7 @@
 # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
 # - SPARK_WORKER_DIR, to set the working directory of worker processes
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. 
-Dx=y)
+# - SPARK_DRIVER_MEMORY, Memory for driver (e.g. 1000M, 2G) (Default: 512 
Mb)
--- End diff --

Yes - does this work for you?:

https://github.com/apache/spark/pull/730/files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42763096
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42763144
  
This was reported by @witgo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...

2014-05-11 Thread witgo

Github user witgo closed the pull request at:

https://github.com/apache/spark/pull/646


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/720#issuecomment-42756947
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14876/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1759

2014-05-11 Thread pjopensource

GitHub user pjopensource opened a pull request:

https://github.com/apache/spark/pull/693

SPARK-1759



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pjopensource/spark branch-0.9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/693.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #693


commit a017d2ee597473d483350b5a3d57713fae392522
Author: Jason Penn pjopensou...@gmail.com
Date:   2014-05-08T08:48:09Z

modified:   sbt/sbt




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/685#issuecomment-42501713
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-05-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-42623759
  
Thanks, merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...

2014-05-11 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/646#discussion_r12507619
  
--- Diff: conf/spark-env.sh.template ---
@@ -38,6 +38,7 @@
 # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
 # - SPARK_WORKER_DIR, to set the working directory of worker processes
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. 
-Dx=y)
+# - SPARK_DRIVER_MEMORY, Memory for driver (e.g. 1000M, 2G) (Default: 512 
Mb)
--- End diff --

Yes,it work for me. 
`./bin/spark-shell --driver-memory 2g` =
```
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java 
-cp 
::/Users/witgo/work/code/java/spark/dist/conf:/Users/witgo/work/code/java/spark/dist/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop0.23.9.jar
 -Djava.library.path= -Xms2g -Xmx2g org.apache.spark.deploy.SparkSubmit 
spark-internal --driver-memory 2g --class org.apache.spark.repl.Main
```
But  in  ` --driver-memory 2g --class org.apache.spark.repl.Main ` , 
`--driver-memory 2g` is unnecessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Improve column pruning.

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/729#issuecomment-42763487
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Improve column pruning.

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/729#issuecomment-42763488
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14883/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: remove outdated runtime Information scala home

2014-05-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/728#issuecomment-42763506
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1686: keep schedule() calling in the mai...

2014-05-11 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/639#discussion_r12490657
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/master/MasterMessages.scala ---
@@ -39,4 +39,6 @@ private[master] object MasterMessages {
   case object RequestWebUIPort
 
   case class WebUIPortResponse(webUIBoundPort: Int)
+
+  case object TriggerSchedule
--- End diff --

I think this is no longer used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: remove outdated runtime Information scala home

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/728#issuecomment-42763556
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42763555
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14884/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: remove outdated runtime Information scala home

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/728#issuecomment-42763553
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...

2014-05-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/685#discussion_r12503350
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/util/NumericParser.scala ---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+import scala.collection.mutable.{ArrayBuffer, ListBuffer}
+
+private[mllib] object NumericTokenizer {
+  val NUMBER = -1
+  val END = -2
+}
+
+import NumericTokenizer._
+
+/**
+ * Simple tokenizer for a numeric structure consisting of three types:
+ *
+ *  - number: a double in Java's floating number format
+ *  - array: an array of numbers stored as `[v0,v1,...,vn]`
+ *  - tuple: a list of numbers, arrays, or tuples stored as `(...)`
+ *
+ * @param s input string
+ * @param start start index
+ * @param end end index
+ */
+private[mllib] class NumericTokenizer(s: String, start: Int, end: Int) {
--- End diff --

Tiny comment - `StringTokenizer` is considered just about deprecated 
(http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html). 
Guava's `Splitter` does a better job and can return delimiters, and is already 
available in the project.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...

2014-05-11 Thread mateiz

Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/685#discussion_r12502576
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/util/NumericParser.scala ---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+import scala.collection.mutable.{ArrayBuffer, ListBuffer}
+
+private[mllib] object NumericTokenizer {
+  val NUMBER = -1
+  val END = -2
+}
+
+import NumericTokenizer._
+
+/**
+ * Simple tokenizer for a numeric structure consisting of three types:
+ *
+ *  - number: a double in Java's floating number format
+ *  - array: an array of numbers stored as `[v0,v1,...,vn]`
+ *  - tuple: a list of numbers, arrays, or tuples stored as `(...)`
+ *
+ * @param s input string
+ * @param start start index
+ * @param end end index
+ */
+private[mllib] class NumericTokenizer(s: String, start: Int, end: Int) {
--- End diff --

Much of the logic here could perhaps be replaced by `new 
StringTokenizer(s.substring(start, end), [](), true)`. StringTokenizer can be 
configured to return the tokens (namely '[', ',', etc) so it should be enough 
for parsing these data types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42763554
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/715#issuecomment-42763346
  
/cc @scrapcodes who worked on the IndestructableActorSystem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1795 - Add recursive directory file sear...

2014-05-11 Thread patrickotoole

Github user patrickotoole commented on a diff in the pull request:

https://github.com/apache/spark/pull/537#discussion_r12507593
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala ---
@@ -327,18 +327,18 @@ class StreamingContext private[streaming] (
* @param directory HDFS directory to monitor for new file
* @param filter Function to filter paths to process
* @param newFilesOnly Should process only new files and ignore existing 
files in the directory
+   * @param recursive Should search through the directory recursively to 
find new files
* @tparam K Key type for reading HDFS file
* @tparam V Value type for reading HDFS file
* @tparam F Input format for reading HDFS file
*/
   def fileStream[
-K: ClassTag,
-V: ClassTag,
-F : NewInputFormat[K, V]: ClassTag
-  ] (directory: String, filter: Path = Boolean, newFilesOnly: Boolean): 
InputDStream[(K, V)] = {
-new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly)
-  }
-
+  K: ClassTag,
+  V: ClassTag,
+  F : NewInputFormat[K, V]: ClassTag
+] (directory: String, filter: Path = Boolean, newFilesOnly: Boolean, 
recursive: Boolean): DStream[(K, V)] = {
+  new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly, 
recursive)
+  } 
--- End diff --

I have included a default value on the FileInputDStream but not on the API 
itself. 

Wondering if we want to introduce default values to the more granular 
version of the API. Currently, it looks like the exposed API essentially has 
two versions for these method -- one that assumes default values and one that 
exposes all the parameters of the DStream constructor.

Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...

2014-05-11 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/646#issuecomment-42763279
  
A better solution [PR 730](https://github.com/apache/spark/pull/730)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: remove outdated runtime Information scala home

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/728#issuecomment-42764076
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14885/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: remove outdated runtime Information scala home

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/728#issuecomment-42764075
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...

2014-05-11 Thread lianhuiwang

Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/636#discussion_r12507640
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala ---
@@ -28,6 +28,7 @@ private[spark] class ApplicationDescription(
   extends Serializable {
 
   val user = System.getProperty(user.name, unknown)
-
+  // only valid when spark.executor.multiPerWorker is set to true
+  var maxCorePerExecutor = maxCores
--- End diff --

yes,but in ApplicationInfo,app.coresLeft is equal to app.maxCores. so in 
schedule when the sum of all worker free cores is greater than app.coresLeft, 
now leftCoreToAssign actually is equal to maxCorePerExecutor. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

2014-05-11 Thread ajtulloch

GitHub user ajtulloch opened a pull request:

https://github.com/apache/spark/pull/725

SPARK-1791 - SVM implementation does not use threshold parameter

Summary:
https://issues.apache.org/jira/browse/SPARK-1791

Simple fix, and backward compatible, since

- anyone who set the threshold was getting completely wrong answers.
- anyone who did not set the threshold had the default 0.0 value for the 
threshold anyway.

Test Plan:
Unit test added that is verified to fail under the old implementation,
and pass under the new implementation.

Reviewers:

CC:

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ajtulloch/spark SPARK-1791-SVM

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #725


commit 6f7075a75927b08c3c1211642632e03c2246f5cd
Author: Andrew Tulloch and...@tullo.ch
Date:   2014-05-10T23:22:02Z

SPARK-1791 - SVM implementation does not use threshold parameter

Summary:
https://issues.apache.org/jira/browse/SPARK-1791

Simple fix, and backward compatible, since

- anyone who set the threshold was getting completely wrong answers.
- anyone who did not set the threshold had the default 0.0 value for the 
threshold anyway.

Test Plan:
Unit test added that is verified to fail under the old implementation,
and pass under the new implementation.

Reviewers:

CC:




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1774] Respect SparkSubmit --jars on YAR...

2014-05-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/710


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Enabled incremental build that comes with sbt ...

2014-05-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/525


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Upgrade parquet library.

2014-05-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/684#issuecomment-42750659
  
Thanks for taking a look michael - I merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1773] Update outdated docs for spark-su...

2014-05-11 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/701#discussion_r12505284
  
--- Diff: docs/spark-standalone.md ---
@@ -150,20 +162,30 @@ You can also pass an option `--cores numCores` to 
control the number of cores
 
 Spark supports two deploy modes. Spark applications may run with the 
driver inside the client process or entirely inside the cluster.
 
-The spark-submit script described in the [cluster mode 
overview](cluster-overview.html) provides the most straightforward way to 
submit a compiled Spark application to the cluster in either deploy mode. For 
info on the lower-level invocations used to launch an app inside the cluster, 
read ahead.
+The spark-submit script provides the most straightforward way to submit a 
compiled Spark application to the cluster in either deploy mode. For more 
detail, see the [cluster mode overview](cluster-overview.html).
+
+./bin/spark-submit \
+  --class main-class
+  --master master-url \
+  --deploy-mode deploy-mode \
+  ... // other options
+  application-jar
+  [application-arguments]
 
-## Launching Applications Inside the Cluster
+main-class: The entry point for your application (e.g. 
org.apache.spark.examples.SparkPi)
+master-url: The URL of the master node (e.g. 
spark://23.195.26.187:7077)
+deploy-mode: Whether to deploy this application within the cluster or 
from an external client (e.g. client)
+application-jar: Path to a bundled jar including your application and 
all dependencies. The URL must be globally visible inside of your cluster, for 
instance, an `hdfs://` path or a `file://` path that is present on all nodes.
+application-arguments: Arguments passed to the main method of 
main-class
+
+Behind the scenes, this invokes the standalone Client to launch your 
application, which is also the legacy way to launch your application before 
Spark 1.0.
--- End diff --

Since this is deprecated we shouldn't document it. Otherwise new users of 
spark will use this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: The org.datanucleus:* should not be packaged i...

2014-05-11 Thread witgo

GitHub user witgo opened a pull request:

https://github.com/apache/spark/pull/688

The org.datanucleus:* should not be packaged into spark-assembly-*.jar



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark SPARK-1644

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/688.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #688


commit a5974140300f4fc0ab2dee72a9807bc37daff0a1
Author: witgo wi...@qq.com
Date:   2014-05-08T01:59:18Z

The org.datanucleus:* should not be packaged into spark-assembly-*.jar




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1569 Spark on Yarn, authentication broke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/649#issuecomment-42427401
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...

2014-05-11 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/706#issuecomment-42697568
  
https://issues.apache.org/jira/browse/SPARK-1776


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: L-BFGS Documentation

2014-05-11 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/702#discussion_r12460491
  
--- Diff: docs/mllib-optimization.md ---
@@ -163,3 +177,100 @@ each iteration, to compute the gradient direction.
 Available algorithms for gradient descent:
 
 * 
[GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+
+### Limited-memory BFGS
+L-BFGS is currently only a low-level optimization primitive in `MLlib`. If 
you want to use L-BFGS in various 
+ML algorithms such as Linear Regression, and Logistic Regression, you have 
to pass the gradient of objective
+function, and updater into optimizer yourself instead of using the 
training APIs like 

+[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegression).
+See the example below. It will be addressed in the next release. 
+
+The L1 regularization by using 

+[Updater.L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.Updater)
 will not work since the 
+soft-thresholding logic in L1Updater is designed for gradient descent.
+
+The L-BFGS method

+[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS)
+has the following parameters:
+
+* `gradient` is a class that computes the gradient of the objective 
function
+being optimized, i.e., with respect to a single training example, at the
+current parameter value. MLlib includes gradient classes for common loss
+functions, e.g., hinge, logistic, least-squares.  The gradient class takes 
as
+input a training example, its label, and the current parameter value. 
+* `updater` is a class originally designed for gradient decent which 
computes 
+the actual gradient descent step. However, we're able to take the gradient 
and 
+loss of objective function of regularization for L-BFGS by ignoring the 
part of logic
+only for gradient decent such as adaptive step size stuff. We will 
refactorize
+this into regularizer to replace updater to separate the logic between 
+regularization and step update later. 
+* `numCorrections` is the number of corrections used in the L-BFGS update. 
10 is 
+recommended.
+* `maxNumIterations` is the maximal number of iterations that L-BFGS can 
be run.
+* `regParam` is the regularization parameter when using regularization.
+* `return` A tuple containing two elements. The first element is a column 
matrix
+containing weights for every feature, and the second element is an array 
containing 
+the loss computed for every iteration.
+
+Here is an example to train binary logistic regression with L2 
regularization using
+L-BFGS optimizer. 
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.classification.LogisticRegressionModel
+import breeze.linalg.{DenseVector = BDV}
--- End diff --

We don't need `breeze` for the example. See my next comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Docs] Update YARN and standalone docs

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/701#issuecomment-42713025
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1789. Multiple versions of Netty depende...

2014-05-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/723#discussion_r12507895
  
--- Diff: core/src/test/scala/org/apache/spark/LocalSparkContext.scala ---
@@ -17,8 +17,7 @@
 
 package org.apache.spark
 
-import org.jboss.netty.logging.InternalLoggerFactory
-import org.jboss.netty.logging.Slf4JLoggerFactory
+import _root_.io.netty.util.internal.logging.{Slf4JLoggerFactory, 
InternalLoggerFactory}
--- End diff --

Yes you're right. I had thought it was conflicting with `scala.io` somehow 
but according to my IDE it's the Spark subpackage. (The abundance of import 
syntax in Scala is not my favorite thing.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...

2014-05-11 Thread mateiz

Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/685#discussion_r12502581
  
--- Diff: python/pyspark/mllib/linalg.py ---
@@ -234,6 +233,45 @@ def dense(elements):
 return array(elements, dtype=float64)
 
 
+@staticmethod
+def parse(s):
+
+Parses a string resulted from Vectors.stringify() into a vector.
+
+ Vectors.parse([0.0,1.0])
+array([ 0.,  1.])
+ print Vectors.parse((2,[1],[1.0]))
+(2,[1],[1.0])
+
+return Vectors._parse_structured(eval(s))
--- End diff --

We can't use eval here; it will be both slow and a security risk. If it's a 
lot of work to create a fast parser in Python (which is might be), maybe we 
should instead do JSON, with just tuples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/715#issuecomment-42716337
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14854/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1565, update examples to be used with sp...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/552#issuecomment-42520748
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1789. Multiple versions of Netty depende...

2014-05-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/723


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Fixed streaming examples docs to use run-examp...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/722#issuecomment-42738393
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/720#issuecomment-42756921
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1565 (Addendum): Replace `run-example` w...

2014-05-11 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/704#discussion_r12462266
  
--- Diff: bin/run-example ---
@@ -49,46 +31,31 @@ fi
 
 if [[ -z $SPARK_EXAMPLES_JAR ]]; then
   echo Failed to find Spark examples assembly in $FWDIR/lib or 
$FWDIR/examples/target 2
-  echo You need to build Spark with sbt/sbt assembly before running this 
program 2
+  echo You need to build Spark before running this program 2
   exit 1
 fi
 
+SPARK_EXAMPLES_JAR_REL=${SPARK_EXAMPLES_JAR#$FWDIR/}
 
-# Since the examples JAR ideally shouldn't include spark-core (that 
dependency should be
-# provided), also add our standard Spark classpath, built using 
compute-classpath.sh.
-CLASSPATH=`$FWDIR/bin/compute-classpath.sh`
-CLASSPATH=$SPARK_EXAMPLES_JAR:$CLASSPATH
-
-if $cygwin; then
-CLASSPATH=`cygpath -wp $CLASSPATH`
-export SPARK_EXAMPLES_JAR=`cygpath -w $SPARK_EXAMPLES_JAR`
-fi
-
-# Find java binary
-if [ -n ${JAVA_HOME} ]; then
-  RUNNER=${JAVA_HOME}/bin/java
-else
-  if [ `command -v java` ]; then
-RUNNER=java
-  else
-echo JAVA_HOME is not set 2
-exit 1
-  fi
-fi
+EXAMPLE_CLASS=example-class
+EXAMPLE_ARGS=[example args]
+EXAMPLE_MASTER=${MASTER:-master}
 
-# Set JAVA_OPTS to be able to load native libraries and to set heap size
-JAVA_OPTS=$SPARK_JAVA_OPTS
-# Load extra JAVA_OPTS from conf/java-opts, if it exists
-if [ -e $FWDIR/conf/java-opts ] ; then
-  JAVA_OPTS=$JAVA_OPTS `cat $FWDIR/conf/java-opts`
+if [ -n $1 ]; then
+  EXAMPLE_CLASS=$1
+  shift
 fi
-export JAVA_OPTS
 
-if [ $SPARK_PRINT_LAUNCH_COMMAND == 1 ]; then
-  echo -n Spark Command: 
-  echo $RUNNER -cp $CLASSPATH $JAVA_OPTS $@
-  echo 
-  echo
+if [ -n $1 ]; then
+  EXAMPLE_ARGS=$@
 fi
 
-exec $RUNNER -cp $CLASSPATH $JAVA_OPTS $@
+echo NOTE: This script has been replaced with ./bin/spark-submit. Please 
run: 2
+echo
+echo ./bin/spark-submit \\ 2
--- End diff --

I thought about this some more, I think maybe we should just call 
spark-submit with the supplied master instead of telling the user this stuff. 
Or we could call spark submit and then print out the user how to run this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1688] Propagate PySpark worker stderr t...

2014-05-11 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/603#discussion_r12388303
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -208,6 +170,19 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
 }
   }
 
+  /**
+   * Redirect a worker's stdout and stderr to our stderr.
+   */
+  private def redirectWorkerStreams(stdout: InputStream, stderr: 
InputStream) {
+try {
+  new RedirectThread(stdout, System.err, stdout reader for  + 
pythonExec).start()
+  new RedirectThread(stderr, System.err, stderr reader for  + 
pythonExec).start()
+} catch {
+  case e: Throwable =
--- End diff --

It could. The reason for paranoia is because I don't want both the stderr 
RedirectThread and the catch block in `startDaemon` to contend for the error 
stream. But sure, it doesn't have to be a throwable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Include the sbin/spark-config.sh in spark-exec...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/651#issuecomment-42608067
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/636#issuecomment-42504239
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1668: Add implicit preference as an opti...

2014-05-11 Thread techaddict

Github user techaddict commented on a diff in the pull request:

https://github.com/apache/spark/pull/597#discussion_r12389144
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -121,11 +157,14 @@ object MovieLensALS {
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: 
Long) = {
+  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
+
+def evalRating(r: Double) =
--- End diff --

ya better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Typo fix: fetchting - fetching

2014-05-11 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/680#issuecomment-42485074
  
Thanks. Merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...

2014-05-11 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/706#discussion_r12508408
  
--- Diff: project/SparkBuild.scala ---
@@ -297,7 +273,7 @@ object SparkBuild extends Build {
   val chillVersion = 0.3.6
   val codahaleMetricsVersion = 3.0.0
   val jblasVersion = 1.2.3
-  val jets3tVersion = if 
(^2\\.[3-9]+.r.findFirstIn(hadoopVersion).isDefined) 0.9.0 else 0.7.1
+  val jets3tVersion = 
^2\\.[3-9]+.r.findFirstIn(hadoopVersion).fold(0.7.1)(_ = 0.9.0)
--- End diff --

I myself learnt it not so long ago and noticed it spurred some discussions 
about it in the Scala community (with [Martin Ordersky 
himself](http://stackoverflow.com/a/5332657/1305344)).

[Functional Programming in Scala](http://www.manning.com/bjarnason/) reads 
on page 69 (in the pdf version):

*Itâs fine to use pattern matching, though you should be able to 
implement all the functions besides map and getOrElse without resorting to 
pattern matching.*

So I followed the advice and applied Option.map(...).getOrElse(...) quite 
intensively, but...IntelliJ IDEA just before I committed the change suggested 
to replace it with fold. I thought I'd need to change my habits once more and 
sent the PR.

With all that said, it's not clear what's the most idiomatic approach, but 
[pattern matching is in my opinion a step back from 
map/getOrElse](https://github.com/scala/scala/blob/2.12.x/src/library/scala/Option.scala#L156)
 and there's no need for it.

I'd appreciate being corrected and could even replace `Option.fold` with 
`map`/`getOrElse` if that would make the PR better for more committers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...

2014-05-11 Thread ScrapCodes

Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/706#issuecomment-42768086
  
This is what Patrick was trying to say 
https://issues.apache.org/jira/browse/SPARK-1776. So while we have this on our 
mind, even if we merge this patch it has to eventually be replaced entirely. We 
are still experimenting and lets see if we all agree to this. OTOH We should 
retain the comments, in case we decide to merge this patch. 

Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...

2014-05-11 Thread CodingCat

Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/636#discussion_r12508764
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala ---
@@ -28,6 +28,7 @@ private[spark] class ApplicationDescription(
   extends Serializable {
 
   val user = System.getProperty(user.name, unknown)
-
+  // only valid when spark.executor.multiPerWorker is set to true
+  var maxCorePerExecutor = maxCores
--- End diff --


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/ApplicationInfo.scala#L84,
 coreLeft implementation, 


https://github.com/CodingCat/spark/blob/SPARK-1706/core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala#L63,
 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1690] Tolerating empty elements when sa...

2014-05-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/644#issuecomment-42752706
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1688] Propagate PySpark worker stderr t...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/603#issuecomment-42469138
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14780/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Fix Performance Issue in data type casti...

2014-05-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/679


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-05-11 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-42494200
  
@CodingCat @kayousterhout @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1668: Add implicit preference as an opti...

2014-05-11 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/597#discussion_r12407485
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -121,11 +157,23 @@ object MovieLensALS {
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: 
Long) = {
+  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
+
+def mapPredictedRating(r: Double) =
+  if (!implicitPrefs) {
--- End diff --

Can we change it to the following:

~~~
if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/497#issuecomment-42489547
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Docs] Update YARN docs

2014-05-11 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/701#discussion_r12456854
  
--- Diff: docs/running-on-yarn.md ---
@@ -68,11 +69,12 @@ In yarn-cluster mode, the driver runs on a different 
machine than the client, so
 
 $ ./bin/spark-submit --class my.main.Class \
 --master yarn-cluster \
+--deploy-mode cluster \
 --jars my-other-jar.jar,my-other-other-jar.jar
 my-main-jar.jar
-yarn-cluster 5
+[app arguments]
--- End diff --

Same as above, --master should probably just be yarn.  And to be concrete 
like the other parts of the example, maybe use apparg1 apparg2 instead of 
[app arguments]?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Bug fix of sparse vector conversion

2014-05-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/661


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1631] Correctly set the Yarn app name w...

2014-05-11 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/539#issuecomment-42512812
  
I dug through the code a bunch. For those following this PR, here's a quick 
summary. Looks like the way we set Spark app name on YARN is actually quite 
convoluted because there are two types of app names:

1. The one observed by the RM
2. The one observed by SparkContext

This PR is mainly concerned with (1) and only on the YARN client mode. 
Before the changes, there are two ways of affecting (1), in the order of 
decreasing priority.

(a) set `spark.app.name` in spark-defaults
(b) set `SPARK_YARN_APP_NAME` in spark-env

After the changes, this becomes

(a) set `SPARK_YARN_APP_NAME` in spark-env
(b) set `spark.app.name` in SparkConf (through `conf.setAppName`)
(c) set `spark.app.name` in spark-defaults

This guarantees that if an environment variable is explicitly specified, 
then the RM should use that name. Otherwise, use whatever SparkContext is using 
for its app name, in which case (1) == (2). To me, the semantics are pretty 
clear.

Notice however that in either before or after, there is now way for 
SparkSubmit's --name parameter to affect either (1) or (2). This is a separate 
bug with SparkSubmit that I have summarized in 
(SPARK-1755)[https://issues.apache.org/jira/browse/SPARK-1755].


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/497#issuecomment-42483855
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-42494319
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14790/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1565 (Addendum): Replace `run-example` w...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/704#issuecomment-42633729
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1588 Re-add support for SPARK_YARN_USER_...

2014-05-11 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/514#issuecomment-42525085
  
Closing this assuming it has been resolved through other PR's. Please let 
me know in case this is not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Include the sbin/spark-config.sh in spark-exec...

2014-05-11 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/651#discussion_r12450552
  
--- Diff: sbin/spark-executor ---
@@ -19,5 +19,10 @@
 
 FWDIR=$(cd `dirname $0`/..; pwd)
 
+sbin=`dirname $0`
+sbin=`cd $sbin; pwd`
+
+. $sbin/spark-config.sh
--- End diff --

Hey actually - instead of doing this, do you mind just inlining the 
setting up of the PYTHONPATH into this file? `spark-config.sh` is intended to 
configure Spark's standalone daemons, it's not meant to be used at all in mesos 
mode. I realize the current approach fixes the problem, but it would be cleaner 
to just have the mesos executor setup $PYTHONPATH directly.
```
export PYTHONPATH=$FWDIR/python:$PYTHONPATH
export PYTHONPATH=$FWDIR/python/lib/py4j-0.8.1-src.zip:$PYTHONPATH
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark-1461] Deferred Expression Evaluation (s...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/446#issuecomment-42718954
  
Build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1779] add warning when memoryFraction i...

2014-05-11 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/714#issuecomment-42770141
  
Hi @pwendell, thanks. I didn't notice validateSettings in SparkConf and 
it's good to validate this in it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1470] Spark logger moving to use scala-...

2014-05-11 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/332#issuecomment-42771828
  
@pwendell @marmbrus  @rxin 
typesafehub/scala-logging#4 has been solved by typesafehub/scala-logging#15
The PR should be able to merge into master master



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...

2014-05-11 Thread CodingCat

Github user CodingCat closed the pull request at:

https://github.com/apache/spark/pull/636


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...

2014-05-11 Thread lianhuiwang

Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/636#discussion_r12509207
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala ---
@@ -28,6 +28,7 @@ private[spark] class ApplicationDescription(
   extends Serializable {
 
   val user = System.getProperty(user.name, unknown)
-
+  // only valid when spark.executor.multiPerWorker is set to true
+  var maxCorePerExecutor = maxCores
--- End diff --

yes thanks i see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files

2014-05-11 Thread srowen

GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/732

SPARK-1798. Tests should clean up temp files

Three issues related to temp files that tests generate â these should be 
touched up for hygiene but are not urgent.

Modules have a log4j.properties which directs the unit-test.log output file 
to a directory like `[module]/target/unit-test.log`. But this ends up creating 
`[module]/[module]/target/unit-test.log` instead of former.

The `work/` directory is not deleted by mvn clean, in the parent and in 
modules. Neither is the `checkpoint/` directory created under the various 
external modules.

Many tests create a temp directory, which is not usually deleted. This can 
be largely resolved by calling `deleteOnExit()` at creation and trying to call 
`Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` 
method.

_If anyone seconds the motion, I can create a more significant change that 
introduces a new test trait along the lines of `LocalSparkContext`, which 
provides management of temp directories for subclasses to take advantage of._

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-1798

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/732.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #732


commit bdd0f410e17f69c5463dbb3014240e7f552dccda
Author: Sean Owen so...@cloudera.com
Date:   2014-05-06T21:19:07Z

Remove duplicate module dir in log4j.properties output path for tests

commit b21b3566fa51b5375ca809449004a873cb5feefb
Author: Sean Owen so...@cloudera.com
Date:   2014-05-06T21:21:48Z

Remove work/ and checkpoint/ dirs with mvn clean

commit 5af578e5be2450371ec14aa33d1a679f029fb720
Author: Sean Owen so...@cloudera.com
Date:   2014-05-06T21:22:23Z

Try to consistently delete test temp dirs and files, and set deleteOnExit() 
for each




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-42772900
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-11 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/720#discussion_r12502303
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/SynthBenchmark.scala ---
@@ -0,0 +1,136 @@
+package org.apache.spark.graphx.lib
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.graphx.PartitionStrategy
+import 
org.apache.spark.graphx.PartitionStrategy.{CanonicalRandomVertexCut, 
EdgePartition2D, EdgePartition1D, RandomVertexCut}
+import org.apache.spark.{SparkContext, SparkConf}
+import org.apache.spark.graphx.util.GraphGenerators
+import java.io.{PrintWriter, FileOutputStream}
+
+/**
+ * The SynthBenchmark application can be used to run various GraphX 
algorithms on
+ * synthetic log-normal graphs.  The intent of this code is to enable 
users to
+ * profile the GraphX system without access to large graph datasets.
+ */
+object SynthBenchmark {
--- End diff --

Hey @jegonzal - we've generally been trying to put runnable programs like 
this in the examples package. Would it be possible to do that here? The main 
benefits are that (i) users understand this is not a stable API and (ii) users 
can launch it with `run-example` or `spark-submit` which allows them to easily 
do things like launch on YARN or ask for a specific number of executors when 
running the program.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-42773948
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-42773949
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14886/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/732#issuecomment-42773573
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-42772897
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...

2014-05-11 Thread markhamstra

Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/706#discussion_r12509377
  
--- Diff: project/SparkBuild.scala ---
@@ -297,7 +273,7 @@ object SparkBuild extends Build {
   val chillVersion = 0.3.6
   val codahaleMetricsVersion = 3.0.0
   val jblasVersion = 1.2.3
-  val jets3tVersion = if 
(^2\\.[3-9]+.r.findFirstIn(hadoopVersion).isDefined) 0.9.0 else 0.7.1
+  val jets3tVersion = 
^2\\.[3-9]+.r.findFirstIn(hadoopVersion).fold(0.7.1)(_ = 0.9.0)
--- End diff --

We discussed this a while ago, soon after Spark development moved to Scala 
2.10.  The overwhelming preference was that we use map/getOrElse instead of 
fold.  When working on Spark, consider turning off that particular inspection 
and warning in the IntelliJ preferences.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: L-BFGS Documentation

2014-05-11 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/702#discussion_r12460376
  
--- Diff: docs/mllib-optimization.md ---
@@ -163,3 +177,100 @@ each iteration, to compute the gradient direction.
 Available algorithms for gradient descent:
 
 * 
[GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+
+### Limited-memory BFGS
+L-BFGS is currently only a low-level optimization primitive in `MLlib`. If 
you want to use L-BFGS in various 
+ML algorithms such as Linear Regression, and Logistic Regression, you have 
to pass the gradient of objective
+function, and updater into optimizer yourself instead of using the 
training APIs like 

+[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegression).
--- End diff --

last `LogisticRegression` - `LogisticRegressionWithSGD`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...

2014-05-11 Thread markhamstra

Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/706#discussion_r12489193
  
--- Diff: project/SparkBuild.scala ---
@@ -16,17 +16,18 @@
  */
 
 import sbt._
-import sbt.Classpaths.publishTask
-import sbt.Keys._
+import Keys._
+import Classpaths.publishTask
 import sbtassembly.Plugin._
 import AssemblyKeys._
-import scala.util.Properties
+import util.Properties
--- End diff --


https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

2014-05-11 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/629#issuecomment-42673811
  
Although there's now a 'fix' in, it is only going to do automatically what 
you are doing by hand.
My hunch is that perhaps the script, or env variable, already contains a 
classpath with 0.9.0 jars in it, which means that when they're replaced with 
differently-named jars, suddenly it can't find basic classes. Just a guess.

(Not sure what support will do with it given that this is into unsupported 
territory. You can ask. There's a reasonable question here about understanding 
the deployment. They'll reach out to the right people as needed.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Update RoutingTable.scala

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/647#issuecomment-42465768
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1755] Respect SparkSubmit --name on YAR...

2014-05-11 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/699#issuecomment-42630934
  
I've merged this thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: L-BFGS Documentation

2014-05-11 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/702#discussion_r12460451
  
--- Diff: docs/mllib-optimization.md ---
@@ -163,3 +177,100 @@ each iteration, to compute the gradient direction.
 Available algorithms for gradient descent:
 
 * 
[GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+
+### Limited-memory BFGS
+L-BFGS is currently only a low-level optimization primitive in `MLlib`. If 
you want to use L-BFGS in various 
+ML algorithms such as Linear Regression, and Logistic Regression, you have 
to pass the gradient of objective
+function, and updater into optimizer yourself instead of using the 
training APIs like 

+[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegression).
+See the example below. It will be addressed in the next release. 
+
+The L1 regularization by using 

+[Updater.L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.Updater)
 will not work since the 
+soft-thresholding logic in L1Updater is designed for gradient descent.
+
+The L-BFGS method

+[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS)
+has the following parameters:
+
+* `gradient` is a class that computes the gradient of the objective 
function
+being optimized, i.e., with respect to a single training example, at the
+current parameter value. MLlib includes gradient classes for common loss
+functions, e.g., hinge, logistic, least-squares.  The gradient class takes 
as
+input a training example, its label, and the current parameter value. 
+* `updater` is a class originally designed for gradient decent which 
computes 
--- End diff --

Let's remove this paragraph or move it to a separate `Developer` section. 
For the guide, we better separate user-facing content from developer-facing 
content.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

2014-05-11 Thread ankurdave

Github user ankurdave commented on a diff in the pull request:

https://github.com/apache/spark/pull/497#discussion_r12456034
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/impl/ReplicatedVertexView.scala 
---
@@ -21,192 +21,102 @@ import scala.reflect.{classTag, ClassTag}
 
 import org.apache.spark.SparkContext._
 import org.apache.spark.rdd.RDD
-import org.apache.spark.util.collection.{PrimitiveVector, OpenHashSet}
 
 import org.apache.spark.graphx._
 
 /**
- * A view of the vertices after they are shipped to the join sites 
specified in
- * `vertexPlacement`. The resulting view is co-partitioned with `edges`. 
If `prevViewOpt` is
- * specified, `updatedVerts` are treated as incremental updates to the 
previous view. Otherwise, a
- * fresh view is created.
- *
- * The view is always cached (i.e., once it is evaluated, it remains 
materialized). This avoids
- * constructing it twice if the user calls graph.triplets followed by 
graph.mapReduceTriplets, for
- * example. However, it means iterative algorithms must manually call 
`Graph.unpersist` on previous
- * iterations' graphs for best GC performance. See the implementation of
- * [[org.apache.spark.graphx.Pregel]] for an example.
+ * Manages shipping vertex attributes to the edge partitions of an
+ * [[org.apache.spark.graphx.EdgeRDD]]. Vertex attributes may be partially 
shipped to construct a
+ * triplet view with vertex attributes on only one side, and they may be 
updated. An active vertex
+ * set may additionally be shipped to the edge partitions. Be careful not 
to store a reference to
+ * `edges`, since it may be modified when the attribute shipping level is 
upgraded.
  */
 private[impl]
-class ReplicatedVertexView[VD: ClassTag](
-updatedVerts: VertexRDD[VD],
-edges: EdgeRDD[_],
-routingTable: RoutingTable,
-prevViewOpt: Option[ReplicatedVertexView[VD]] = None) {
+class ReplicatedVertexView[VD: ClassTag, ED: ClassTag](
+var edges: EdgeRDD[ED, VD],
+var hasSrcId: Boolean = false,
+var hasDstId: Boolean = false) {
 
   /**
-   * Within each edge partition, create a local map from vid to an index 
into the attribute
-   * array. Each map contains a superset of the vertices that it will 
receive, because it stores
-   * vids from both the source and destination of edges. It must always 
include both source and
-   * destination vids because some operations, such as 
GraphImpl.mapReduceTriplets, rely on this.
+   * Return a new `ReplicatedVertexView` with the specified `EdgeRDD`, 
which must have the same
+   * shipping level.
*/
-  private val localVertexIdMap: RDD[(Int, VertexIdToIndexMap)] = 
prevViewOpt match {
-case Some(prevView) =
-  prevView.localVertexIdMap
-case None =
-  edges.partitionsRDD.mapPartitions(_.map {
-case (pid, epart) =
-  val vidToIndex = new VertexIdToIndexMap
-  epart.foreach { e =
-vidToIndex.add(e.srcId)
-vidToIndex.add(e.dstId)
-  }
-  (pid, vidToIndex)
-  }, preservesPartitioning = 
true).cache().setName(ReplicatedVertexView localVertexIdMap)
-  }
-
-  private lazy val bothAttrs: RDD[(PartitionID, VertexPartition[VD])] = 
create(true, true)
-  private lazy val srcAttrOnly: RDD[(PartitionID, VertexPartition[VD])] = 
create(true, false)
-  private lazy val dstAttrOnly: RDD[(PartitionID, VertexPartition[VD])] = 
create(false, true)
-  private lazy val noAttrs: RDD[(PartitionID, VertexPartition[VD])] = 
create(false, false)
-
-  def unpersist(blocking: Boolean = true): ReplicatedVertexView[VD] = {
-bothAttrs.unpersist(blocking)
-srcAttrOnly.unpersist(blocking)
-dstAttrOnly.unpersist(blocking)
-noAttrs.unpersist(blocking)
-// Don't unpersist localVertexIdMap because a future 
ReplicatedVertexView may be using it
-// without modification
-this
+  def withEdges[VD2: ClassTag, ED2: ClassTag](
+  edges_ : EdgeRDD[ED2, VD2]): ReplicatedVertexView[VD2, ED2] = {
+new ReplicatedVertexView(edges_, hasSrcId, hasDstId)
   }
 
-  def get(includeSrc: Boolean, includeDst: Boolean): RDD[(PartitionID, 
VertexPartition[VD])] = {
-(includeSrc, includeDst) match {
-  case (true, true) = bothAttrs
-  case (true, false) = srcAttrOnly
-  case (false, true) = dstAttrOnly
-  case (false, false) = noAttrs
-}
+  /**
+   * Return a new `ReplicatedVertexView` where edges are reversed and 
shipping levels are swapped to
+   * match.
+   */
+  def reverse() = {
+val newEdges = edges.mapEdgePartitions((pid, part) = part.reverse)
+new ReplicatedVertexView(newEdges, hasDstId, hasSrcId)
   }

[GitHub] spark pull request: SPARK-1569 Spark on Yarn, authentication broke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/649#issuecomment-42423123
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...

2014-05-11 Thread darose

Github user darose commented on the pull request:

https://github.com/apache/spark/pull/629#issuecomment-42635850
  
I tried downloading the latest from the master branch of both spark and 
shark, and built and ran them (in the hopes of getting past that class 
incompatible issue) but no luck there either.  Now it's dying on Exception in 
thread main java.lang.NoClassDefFoundError: org/json4s/JsonAST$JValue

I'm really at wits end here - have spent days on trying to get this to run. 
 Is there a matched set of spark  shark binaries available somewhere that can 
run on CDH5?  (I.e., that includes this jets3t fix.)  Or if not, could someone 
provide instructions on how I might build them?

I'm going to have to fall back to slow-as-molasses Hive if I can't find a 
way through this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Use numpy directly for matrix multiply.

2014-05-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/687


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1774] Respect SparkSubmit --jars on YAR...

2014-05-11 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/710#issuecomment-42634045
  
We may not always want to automatically add the jar to `spark.jar`. I will 
spend some time looking into cases in which we want to skip this (which likely 
involves `yarn-cluster` and python files). Do not merge this yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Docs] Update YARN and standalone docs

2014-05-11 Thread tgravescs

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/701#issuecomment-42715097
  
thanks for the update, the changes look good to me.  

I did notice one other thing in the yarn docs and that is that we don't say 
what parameters spark-shell takes.  I know it used to list the environment 
variables to use to configure but now it doesn't say how to configure and I 
don't see any other docs that say how to do that. I actually went to look at 
the code to see how it is now done.  So perhaps I'll file a separate jira for 
that unless you know of another pr perhaps handling that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/732#issuecomment-42773567
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/732#issuecomment-42774629
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/732#issuecomment-42774630
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14887/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-42774784
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add init script to the debian packaging

2014-05-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/733#issuecomment-42775785
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread CodingCat

GitHub user CodingCat opened a pull request:

https://github.com/apache/spark/pull/731

SPARK-1706: Allow multiple executors per worker in Standalone mode

resubmit of SPARK-1706 for a totally different algorithm

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CodingCat/spark SPARK-1706-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/731.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #731


commit 01308816e4eee4efba40dc8ce374bf0166edc3b9
Author: CodingCat zhunans...@gmail.com
Date:   2014-05-04T18:13:41Z

make master support multiple executors per worker

commit 893974d4c51e0e5de22b4b3f85f5413310a3b86d
Author: CodingCat zhunans...@gmail.com
Date:   2014-05-04T18:17:12Z

memory dominated allocation

commit cdd2556102bface6a80a22fbdb7375ed853fafc3
Author: CodingCat zhunans...@gmail.com
Date:   2014-05-11T14:53:03Z

memory dominated allocation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

1 2 3 >

1 - 100 of 236 matches

Mail list logo