[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5096


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91123979
  
:+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91121381
  
Thanks @andrewor14 @pwendell for the reviews. Now that Jenkins is happy I 
am going merge this in and I'll file follow up issues for things like YARN 
cluster mode which we didn't get to in this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91121091
  
  [Test build #29919 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29919/consoleFull)
 for   PR 5096 at commit 
[`da64742`](https://github.com/apache/spark/commit/da64742dc1543346623acc420beac209c0c951ce).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91121106
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29919/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91104742
  
  [Test build #29919 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29919/consoleFull)
 for   PR 5096 at commit 
[`da64742`](https://github.com/apache/spark/commit/da64742dc1543346623acc420beac209c0c951ce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91088926
  
@pwendell Its around 2 minutes on my laptop. Here is the output on my 
machine
```
time ./run-tests.sh


./run-tests.sh  1:56.96 total
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91084853
  
@shivaram - hey one thing I forgot to ask, how much time do the SparkR 
tests add to the overall Spark tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91080612
  
  [Test build #29908 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29908/consoleFull)
 for   PR 5096 at commit 
[`bac3a6b`](https://github.com/apache/spark/commit/bac3a6bc05fecca9d7ebb3e544b2edcfdca1c50d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91070385
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29894/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91070376
  
  [Test build #29894 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29894/consoleFull)
 for   PR 5096 at commit 
[`59266d1`](https://github.com/apache/spark/commit/59266d14416a614d900447788806f958ab1088f9).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91052419
  
> also, i can't believe how long this build is... sad panda etc.

Test parallelization is going to be a lot of work, but I think we could see 
huge speedups for the pull request builders if we didn't run all tests for 
every PR.  Most PRs touch the higher-level libraries and not core, so it should 
be safe to skip most of the tests if core hasn't been modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91042724
  
  [Test build #29894 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29894/consoleFull)
 for   PR 5096 at commit 
[`59266d1`](https://github.com/apache/spark/commit/59266d14416a614d900447788806f958ab1088f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shaneknapp
Github user shaneknapp commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91041984
  
also, i can't believe how long this build is...  sad panda etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shaneknapp
Github user shaneknapp commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91041894
  
jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91041394
  
Thanks @shaneknapp ! Could you re-trigger this build once its upped ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shaneknapp
Github user shaneknapp commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91040738
  
i'll up it to 180, just so we have some headroom.

On Wed, Apr 8, 2015 at 2:21 PM, Shivaram Venkataraman <
notificati...@github.com> wrote:

> @brennonyork  The overall Jenkins build
> runner has a timeout of 130 minutes right now (cc @shaneknapp
> ). So all the RAT tests, Mima checks,
> style checks, new dependencies plus all the unit tests have to run within
> 130 minutes and this PR seems to be failing that.
>
> @shaneknapp  can we increase the 130 min
> timeout to say 140 minutes ?
>
> —
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91040259
  
@brennonyork The overall Jenkins build runner has a timeout of 130 minutes 
right now (cc @shaneknapp). So all the RAT tests, Mima checks, style checks, 
new dependencies plus all the unit tests have to run within 130 minutes and 
this PR seems to be failing that.

@shaneknapp can we increase the 130 min timeout to say 140 minutes ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91039248
  
SparkSubmit parts LGTM. We should merge this soon so people can start 
testing this well in advance of the release window.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r28013923
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala ---
@@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{DataInputStream, DataOutputStream}
+import java.sql.{Date, Time}
+
+import scala.collection.JavaConversions._
+
+/**
+ * Utility functions to serialize, deserialize objects to / from R
+ */
+private[spark] object SerDe {
+
+  // Type mapping from R to Java
+  //
+  // NULL -> void
+  // integer -> Int
+  // character -> String
+  // logical -> Boolean
+  // double, numeric -> Double
+  // raw -> Array[Byte]
+  // Date -> Date
+  // POSIXlt/POSIXct -> Time
+  //
+  // list[T] -> Array[T], where T is one of above mentioned types
+  // environment -> Map[String, T], where T is a native type
+  // jobj -> Object, where jobj is an object created in the backend
+
+  def readObjectType(dis: DataInputStream): Char = {
+dis.readByte().toChar
+  }
+
+  def readObject(dis: DataInputStream): Object = {
+val dataType = readObjectType(dis)
+readTypedObject(dis, dataType)
+  }
+
+  def readTypedObject(
+  dis: DataInputStream,
+  dataType: Char): Object = {
+dataType match {
+  case 'n' => null
+  case 'i' => new java.lang.Integer(readInt(dis))
+  case 'd' => new java.lang.Double(readDouble(dis))
+  case 'b' => new java.lang.Boolean(readBoolean(dis))
+  case 'c' => readString(dis)
+  case 'e' => readMap(dis)
+  case 'r' => readBytes(dis)
+  case 'l' => readList(dis)
+  case 'D' => readDate(dis)
+  case 't' => readTime(dis)
+  case 'j' => JVMObjectTracker.getObject(readString(dis))
+  case _ => throw new IllegalArgumentException(s"Invalid type 
$dataType")
+}
+  }
+
+  def readBytes(in: DataInputStream): Array[Byte] = {
+val len = readInt(in)
+val out = new Array[Byte](len)
+val bytesRead = in.readFully(out)
+out
+  }
+
+  def readInt(in: DataInputStream): Int = {
+in.readInt()
+  }
+
+  def readDouble(in: DataInputStream): Double = {
+in.readDouble()
+  }
+
+  def readString(in: DataInputStream): String = {
+val len = in.readInt()
+val asciiBytes = new Array[Byte](len)
+in.readFully(asciiBytes)
+assert(asciiBytes(len - 1) == 0)
+val str = new String(asciiBytes.dropRight(1).map(_.toChar))
+str
+  }
+
+  def readBoolean(in: DataInputStream): Boolean = {
+val intVal = in.readInt()
+if (intVal == 0) false else true
--- End diff --

can be `intVal != 0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r28013485
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -469,6 +469,9 @@ private[spark] class ApplicationMaster(
   System.setProperty("spark.submit.pyFiles",
 PythonRunner.formatPaths(args.pyFiles).mkString(","))
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  // TODO(davies): add R dependencies here
--- End diff --

that's fine. We can add full support for SparkR on YARN cluster later


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread brennonyork
Github user brennonyork commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91028275
  
@shivaram a few things after looking at the build code some more...

1. The timeout value comes from the line [here in 
`dev/run-tests-jenkins`](https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L50).
 Its currently set at 120 minutes and **doesn't** include the time it takes for 
PR's to be tested against the master branch (i.e. for dependencies). We could 
certainly up that value, but I'd ask that since, I'm assuming, the 
`dev/run-tests` script on this PR runs all the new SparkR tests (plus any 
additional for core Spark you've added), that you run `dev/run-tests` locally 
and, for whatever additional time is needed, update the timeout in 
`dev/run-tests-jenkins` for this PR. The impetus for running locally first is 
that I'd much rather get a baseline for what it takes for all the new tests to 
run and then add 15ish minutes for fluff rather than throw a number into the 
wind.
2. Completely agree we should get some timing metrics for the various PR 
tests (thanks for the idea!). I'll generate a JIRA for that and take a look 
soon. That said, just to reiterate, those tests **are not** holding up the 
actual Spark test suite from finishing unless Jenkins has some deeper timing 
hooks than I know about. I assume though that it's merely a factor of the large 
corpus tests that were likely added in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91010961
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29870/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90977866
  
  [Test build #29870 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29870/consoleFull)
 for   PR 5096 at commit 
[`59266d1`](https://github.com/apache/spark/commit/59266d14416a614d900447788806f958ab1088f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90977367
  
Jenkins, retest this please (is the fourth time lucky ?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90975665
  
@brennonyork Thanks - another thing that might be helpful is to log how 
long it took to run the test. I am trying to figure out where the 120 minutes 
we have are being spent and its tricky to get a breakdown right now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-08 Thread brennonyork
Github user brennonyork commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90827465
  
We can certainly set the timeout to be something larger. Let me take a look 
at the previous builds and see if I can find a good timeout number and if there 
might be anything else we can do. 
@pwendell any other ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90806814
  
@brennonyork @pwendell The new dependency checks seem to add around 20 
minutes to a Jenkins run and this PR which has SQL and pom.xml changes has 
timed out thrice now (it didn't even get a chance to run the SparkR tests, so I 
don't think that is the problem). Is there anyway we can increase the time out 
or speed up the new dependency checks ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90806760
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29828/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90792602
  
  [Test build #29828 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29828/consoleFull)
 for   PR 5096 at commit 
[`59266d1`](https://github.com/apache/spark/commit/59266d14416a614d900447788806f958ab1088f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90792138
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90768943
  
Okay LGTM from a packaging perspective. Once @andrewor14 sign's off on the 
spark-submit stuff I think this is ready to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90768068
  
  [Test build #640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/640/consoleFull)
 for   PR 5096 at commit 
[`59266d1`](https://github.com/apache/spark/commit/59266d14416a614d900447788806f958ab1088f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90767895
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29814/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90748077
  
@andrewor14 We should had addressed all you comments, could you take 
another look?

Waiting for jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90747911
  
  [Test build #29814 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29814/consoleFull)
 for   PR 5096 at commit 
[`59266d1`](https://github.com/apache/spark/commit/59266d14416a614d900447788806f958ab1088f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27926454
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -469,6 +469,9 @@ private[spark] class ApplicationMaster(
   System.setProperty("spark.submit.pyFiles",
 PythonRunner.formatPaths(args.pyFiles).mkString(","))
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  // TODO(davies): add R dependencies here
--- End diff --

Right now, SparkR does not support ship other package as dependencies. It 
may be added in future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27926373
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala
 ---
@@ -92,6 +97,7 @@ class ApplicationMasterArguments(val args: Array[String]) 
{
   |  --jar JAR_PATH   Path to your application's JAR file
   |  --class CLASS_NAME   Name of your application's main class
   |  --primary-py-fileA main Python file
+  |  --primary-r-file A main R file
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27926357
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -497,12 +503,15 @@ private[spark] class Client(
 if (args.primaryPyFile != null && args.primaryPyFile.endsWith(".py")) {
   args.userArgs = ArrayBuffer(args.primaryPyFile, args.pyFiles) ++ 
args.userArgs
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  args.userArgs = ArrayBuffer(args.primaryRFile) ++ args.userArgs
+}
 val userArgs = args.userArgs.flatMap { arg =>
   Seq("--arg", YarnSparkHadoopUtil.escapeForShell(arg))
 }
 val amArgs =
-  Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ pyFiles ++ 
userArgs ++
-Seq(
+  Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ pyFiles ++ 
primaryRFile ++
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90408348
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29780/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90408326
  
  [Test build #29780 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29780/consoleFull)
 for   PR 5096 at commit 
[`f731b48`](https://github.com/apache/spark/commit/f731b48c1fdaff80020d8683252f67f0e24502c1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90362850
  
  [Test build #29780 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29780/consoleFull)
 for   PR 5096 at commit 
[`f731b48`](https://github.com/apache/spark/commit/f731b48c1fdaff80020d8683252f67f0e24502c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90362775
  
@pwendell I tried to reproduce the build with R not installed on the 
machine with a clean Ubuntu VM, but I actually got a failure with a relevant 
error message 
```
[ERROR] Failed to execute goal 
org.codehaus.mojo:exec-maven-plugin:1.3.2:exec (sparkr-pkg) on project 
spark-core_2.10: Command execution failed. Process exited with an error: 127 
(Exit value: 127) -> [Help 1]
```
And before the Maven summary I also see
```
../R/install-dev.sh: line 36: R: command not found
```

On a related note I also changed `run-tests` to not run SparkR tests if R 
is not installed.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90356825
  
@andrewor14 I think the test failure is because of a SparkSubmit unit test 
that is broken right now. If you see the test at 
https://github.com/apache/spark/blob/e40ea8742a8771ecd46b182f45b5fcd8bd6dd725/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L97
 

we don't pass in a primaryResource or a mainClass and yet expect the 
argument parsing to succeed. The test will exit at 
https://github.com/apache/spark/blob/e40ea8742a8771ecd46b182f45b5fcd8bd6dd725/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L214
 but we replace the `exitFn` and so it doesn't exit

Any good ideas on how to change the test case ? Should we just add a dummy 
primary resource, main class to it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90354390
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29768/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90354388
  
  [Test build #29768 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29768/consoleFull)
 for   PR 5096 at commit 
[`64eda24`](https://github.com/apache/spark/commit/64eda24017d1d05c3dfba3a84c96b557cbd2ca76).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90322086
  
Thanks @pwendell @andrewor14 @JoshRosen for your comments. I am still 
investigating the return code problem for machines without R, but other than 
that most of the comments have been fixed.

Will be great if you could another pass !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847946
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -497,12 +503,15 @@ private[spark] class Client(
 if (args.primaryPyFile != null && args.primaryPyFile.endsWith(".py")) {
   args.userArgs = ArrayBuffer(args.primaryPyFile, args.pyFiles) ++ 
args.userArgs
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  args.userArgs = ArrayBuffer(args.primaryRFile) ++ args.userArgs
+}
 val userArgs = args.userArgs.flatMap { arg =>
   Seq("--arg", YarnSparkHadoopUtil.escapeForShell(arg))
 }
 val amArgs =
-  Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ pyFiles ++ 
userArgs ++
-Seq(
+  Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ pyFiles ++ 
primaryRFile ++
--- End diff --

They should be mutually exclusive. So if SparkSubmit is the only class used 
to create these arguments we have a `else if` block [1] that prevents both 
primaryPyFile and primary R file from being set.  I could also add an 
exclusivity check somewhere -- Which file would be the best place to do this ?

[1] 
https://github.com/amplab-extras/spark/blob/64eda24017d1d05c3dfba3a84c96b557cbd2ca76/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L482


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847871
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRDD.scala ---
@@ -0,0 +1,450 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.ServerSocket
+import java.util.{Map => JMap}
+
+import scala.collection.JavaConversions._
+import scala.io.Source
+import scala.reflect.ClassTag
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.api.java.{JavaPairRDD, JavaRDD, JavaSparkContext}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private abstract class BaseRRDD[T: ClassTag, U: ClassTag](
+parent: RDD[T],
+numPartitions: Int,
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+rLibDir: String,
+broadcastVars: Array[Broadcast[Object]])
+  extends RDD[U](parent) with Logging {
+  override def getPartitions = parent.partitions
+
+  override def compute(partition: Partition, context: TaskContext): 
Iterator[U] = {
+
+// The parent may be also an RRDD, so we should launch it first.
+val parentIterator = firstParent[T].iterator(partition, context)
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2)
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRDD.createRWorker(rLibDir, listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), parentIterator, 
partition.index)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+val dataStream = openDataStream(inputStream)
+serverSocket.close()
+
+try {
+
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread[T](
+output: OutputStream,
+iter: Iterator[T],
+partition: Int) = {
+
+val env = SparkEnv.get
+val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt
+val stream = new BufferedOutputStream(output, bufferSize)
+
+new Thread("writer for R") {
+  override def run() {
+try {
+  SparkEnv.set(env)
+  val dataOut = new DataOutputStream(stream)
+  dataOut.writeInt(partition)
+
+  SerDe.writeString(dataOut, deserializer)
+  SerDe.writeString(dataOut, serializer)
+
+  dataOut.writeInt(packageNames.length)
+  dataOut.write(packageNames)
+
+  dataOut.writeInt(func.length)
+  dataOut.write(func)
+
+

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847873
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -317,11 +328,32 @@ object SparkSubmit {
   }
 }
 
-// In yarn-cluster mode for a python app, add primary resource and 
pyFiles to files
-// that can be distributed with the job
-if (args.isPython && isYarnCluster) {
-  args.files = mergeFileLists(args.files, args.primaryResource)
-  args.files = mergeFileLists(args.files, args.pyFiles)
+// If we're running a R app, set the main class to our specific R 
runner
+if (args.isR && deployMode == CLIENT) {
+  if (args.primaryResource == SPARKR_SHELL) {
+args.mainClass = "org.apache.spark.api.r.RBackend"
+  } else {
+// If a R file is provided, add it to the child arguments and list 
of files to deploy.
+// Usage: PythonAppRunner  [app arguments]
--- End diff --

Fixed now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847862
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
DataInputStream, DataOutputStream}
+
+import scala.collection.mutable.HashMap
+
+import io.netty.channel.ChannelHandler.Sharable
+import io.netty.channel.{ChannelHandlerContext, 
SimpleChannelInboundHandler}
+
+import org.apache.spark.Logging
+import org.apache.spark.api.r.SerDe._
+
+/**
+ * Handler for RBackend
+ * TODO: This is marked as sharable to get a handle to RBackend. Is it 
safe to re-use
+ * this across connections ?
+ */
+@Sharable
+private[r] class RBackendHandler(server: RBackend)
+  extends SimpleChannelInboundHandler[Array[Byte]] with Logging {
+
+  override def channelRead0(ctx: ChannelHandlerContext, msg: Array[Byte]) {
+val bis = new ByteArrayInputStream(msg)
+val dis = new DataInputStream(bis)
+
+val bos = new ByteArrayOutputStream()
+val dos = new DataOutputStream(bos)
+
+// First bit is isStatic
+val isStatic = readBoolean(dis)
+val objId = readString(dis)
+val methodName = readString(dis)
+val numArgs = readInt(dis)
+
+if (objId == "SparkRHandler") {
+  methodName match {
+case "stopBackend" =>
+  writeInt(dos, 0)
+  writeType(dos, "void")
+  server.close()
+case "rm" =>
+  try {
+val t = readObjectType(dis)
+assert(t == 'c')
+val objToRemove = readString(dis)
+JVMObjectTracker.remove(objToRemove)
+writeInt(dos, 0)
+writeObject(dos, null)
+  } catch {
+case e: Exception =>
+  logError(s"Removing $objId failed", e)
+  writeInt(dos, -1)
+  }
+case _ => dos.writeInt(-1)
+  }
+} else {
+  handleMethodCall(isStatic, objId, methodName, numArgs, dis, dos)
+}
+
+val reply = bos.toByteArray
+ctx.write(reply)
+  }
+  
+  override def channelReadComplete(ctx: ChannelHandlerContext) {
+ctx.flush()
+  }
+
+  override def exceptionCaught(ctx: ChannelHandlerContext, cause: 
Throwable) {
+// Close the connection when an exception is raised.
+cause.printStackTrace()
+ctx.close()
+  }
+
+  def handleMethodCall(
+  isStatic: Boolean,
+  objId: String,
+  methodName: String,
+  numArgs: Int,
+  dis: DataInputStream,
+  dos: DataOutputStream) {
+var obj: Object = null
+try {
+  val cls = if (isStatic) {
+Class.forName(objId)
+  } else {
+JVMObjectTracker.get(objId) match {
+  case None => throw new IllegalArgumentException("Object not 
found " + objId)
+  case Some(o) =>
+obj = o
+o.getClass
+}
+  }
+
+  val args = readArgs(numArgs, dis)
+
+  val methods = cls.getMethods
+  val selectedMethods = methods.filter(m => m.getName == methodName)
+  if (selectedMethods.length > 0) {
+val methods = selectedMethods.filter { x =>
+  matchMethod(numArgs, args, x.getParameterTypes)
+}
+if (methods.isEmpty) {
+  logWarning(s"cannot find matching method ${cls}.$methodName. "
++ s"Candidates are:")
+  selectedMethods.foreach { method =>
+
logWarning(s"$methodName(${method.getParameterTypes.mkString(",")})")
+  }
+  throw new Exception(s"No matched method found for 
$cls.$methodName")
+}
+val ret = methods.

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847889
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -243,6 +258,36 @@
 return pyargs;
   }
 
+  private List buildSparkRCommand(Map env) throws 
IOException {
+if (!appArgs.isEmpty() && appArgs.get(0).endsWith(".R")) {
+  appResource = appArgs.get(0);
+  appArgs.remove(0);
+  return buildCommand(env);
+}
+
+Properties props = loadPropertiesFile();
+mergeEnvPathList(env, getLibPathEnvName(),
+firstNonEmptyValue(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH, 
conf, props));
+
+// Store spark-submit arguments in an environment variable, since 
there's no way to pass
+// them to sparkR on the command line.
+StringBuilder submitArgs = new StringBuilder();
+for (String arg : buildSparkSubmitArgs()) {
+  if (submitArgs.length() > 0) {
+submitArgs.append(" ");
+  }
+  submitArgs.append(quoteForPython(arg));
--- End diff --

I refactored this into a new function. Let me know if this looks okay (I'm 
not very familiar with the code here)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847877
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -406,7 +438,7 @@ object SparkSubmit {
 // Add the application jar automatically so the user doesn't have to 
call sc.addJar
 // For YARN cluster mode, the jar is already distributed on each node 
as "app.jar"
 // For python files, the primary resource is already distributed as a 
regular file
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847879
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -211,9 +212,9 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
   printUsageAndExit(-1)
 }
 if (primaryResource == null) {
-  SparkSubmit.printErrorAndExit("Must specify a primary resource (JAR 
or Python file)")
+  SparkSubmit.printErrorAndExit("Must specify a primary resource (JAR 
or Python or R file)")
 }
-if (mainClass == null && !isPython) {
+if (mainClass == null && !isPython && !isR) {
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90321290
  
  [Test build #29768 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29768/consoleFull)
 for   PR 5096 at commit 
[`64eda24`](https://github.com/apache/spark/commit/64eda24017d1d05c3dfba3a84c96b557cbd2ca76).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847867
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRDD.scala ---
@@ -0,0 +1,450 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.ServerSocket
+import java.util.{Map => JMap}
+
+import scala.collection.JavaConversions._
+import scala.io.Source
+import scala.reflect.ClassTag
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.api.java.{JavaPairRDD, JavaRDD, JavaSparkContext}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private abstract class BaseRRDD[T: ClassTag, U: ClassTag](
+parent: RDD[T],
+numPartitions: Int,
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+rLibDir: String,
+broadcastVars: Array[Broadcast[Object]])
+  extends RDD[U](parent) with Logging {
+  override def getPartitions = parent.partitions
+
+  override def compute(partition: Partition, context: TaskContext): 
Iterator[U] = {
+
+// The parent may be also an RRDD, so we should launch it first.
+val parentIterator = firstParent[T].iterator(partition, context)
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2)
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRDD.createRWorker(rLibDir, listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), parentIterator, 
partition.index)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+val dataStream = openDataStream(inputStream)
+serverSocket.close()
+
+try {
+
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread[T](
+output: OutputStream,
+iter: Iterator[T],
+partition: Int) = {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847838
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
DataInputStream, DataOutputStream}
+
+import scala.collection.mutable.HashMap
+
+import io.netty.channel.ChannelHandler.Sharable
+import io.netty.channel.{ChannelHandlerContext, 
SimpleChannelInboundHandler}
+
+import org.apache.spark.Logging
+import org.apache.spark.api.r.SerDe._
+
+/**
+ * Handler for RBackend
+ * TODO: This is marked as sharable to get a handle to RBackend. Is it 
safe to re-use
+ * this across connections ?
+ */
+@Sharable
+private[r] class RBackendHandler(server: RBackend)
+  extends SimpleChannelInboundHandler[Array[Byte]] with Logging {
+
+  override def channelRead0(ctx: ChannelHandlerContext, msg: Array[Byte]) {
+val bis = new ByteArrayInputStream(msg)
+val dis = new DataInputStream(bis)
+
+val bos = new ByteArrayOutputStream()
+val dos = new DataOutputStream(bos)
+
+// First bit is isStatic
+val isStatic = readBoolean(dis)
+val objId = readString(dis)
+val methodName = readString(dis)
+val numArgs = readInt(dis)
+
+if (objId == "SparkRHandler") {
+  methodName match {
+case "stopBackend" =>
+  writeInt(dos, 0)
+  writeType(dos, "void")
+  server.close()
+case "rm" =>
+  try {
+val t = readObjectType(dis)
+assert(t == 'c')
+val objToRemove = readString(dis)
+JVMObjectTracker.remove(objToRemove)
+writeInt(dos, 0)
+writeObject(dos, null)
+  } catch {
+case e: Exception =>
+  logError(s"Removing $objId failed", e)
+  writeInt(dos, -1)
+  }
+case _ => dos.writeInt(-1)
+  }
+} else {
+  handleMethodCall(isStatic, objId, methodName, numArgs, dis, dos)
+}
+
+val reply = bos.toByteArray
+ctx.write(reply)
+  }
+  
+  override def channelReadComplete(ctx: ChannelHandlerContext) {
+ctx.flush()
+  }
+
+  override def exceptionCaught(ctx: ChannelHandlerContext, cause: 
Throwable) {
+// Close the connection when an exception is raised.
+cause.printStackTrace()
+ctx.close()
+  }
+
+  def handleMethodCall(
+  isStatic: Boolean,
+  objId: String,
+  methodName: String,
+  numArgs: Int,
+  dis: DataInputStream,
+  dos: DataOutputStream) {
+var obj: Object = null
+try {
+  val cls = if (isStatic) {
+Class.forName(objId)
+  } else {
+JVMObjectTracker.get(objId) match {
+  case None => throw new IllegalArgumentException("Object not 
found " + objId)
+  case Some(o) =>
+obj = o
+o.getClass
+}
+  }
+
+  val args = readArgs(numArgs, dis)
+
+  val methods = cls.getMethods
+  val selectedMethods = methods.filter(m => m.getName == methodName)
+  if (selectedMethods.length > 0) {
+val methods = selectedMethods.filter { x =>
+  matchMethod(numArgs, args, x.getParameterTypes)
+}
+if (methods.isEmpty) {
+  logWarning(s"cannot find matching method ${cls}.$methodName. "
++ s"Candidates are:")
+  selectedMethods.foreach { method =>
+
logWarning(s"$methodName(${method.getParameterTypes.mkString(",")})")
+  }
+  throw new Exception(s"No matched method found for 
$cls.$methodName")
+}
+val ret = methods.

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847834
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
DataInputStream, DataOutputStream}
+
+import scala.collection.mutable.HashMap
+
+import io.netty.channel.ChannelHandler.Sharable
+import io.netty.channel.{ChannelHandlerContext, 
SimpleChannelInboundHandler}
+
+import org.apache.spark.Logging
+import org.apache.spark.api.r.SerDe._
+
+/**
+ * Handler for RBackend
+ * TODO: This is marked as sharable to get a handle to RBackend. Is it 
safe to re-use
+ * this across connections ?
+ */
+@Sharable
+private[r] class RBackendHandler(server: RBackend)
+  extends SimpleChannelInboundHandler[Array[Byte]] with Logging {
+
+  override def channelRead0(ctx: ChannelHandlerContext, msg: Array[Byte]) {
--- End diff --

Thanks  - Done now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847828
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackend.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{DataOutputStream, File, FileOutputStream, IOException}
+import java.net.{InetSocketAddress, ServerSocket}
+import java.util.concurrent.TimeUnit
+
+import io.netty.bootstrap.ServerBootstrap
+import io.netty.channel.{ChannelFuture, ChannelInitializer, EventLoopGroup}
+import io.netty.channel.nio.NioEventLoopGroup
+import io.netty.channel.socket.SocketChannel
+import io.netty.channel.socket.nio.NioServerSocketChannel
+import io.netty.handler.codec.LengthFieldBasedFrameDecoder
+import io.netty.handler.codec.bytes.{ByteArrayDecoder, ByteArrayEncoder}
+
+import org.apache.spark.Logging
+
+/**
+ * Netty-based backend server that is used to communicate between R and 
Java.
+ */
+private[spark] class RBackend {
+
+  private[this] var channelFuture: ChannelFuture = null
+  private[this] var bootstrap: ServerBootstrap = null
+  private[this] var bossGroup: EventLoopGroup = null
+
+  def init(): Int = {
+bossGroup = new NioEventLoopGroup(2)
+val workerGroup = bossGroup
+val handler = new RBackendHandler(this)
+  
+bootstrap = new ServerBootstrap()
+  .group(bossGroup, workerGroup)
+  .channel(classOf[NioServerSocketChannel])
+  
+bootstrap.childHandler(new ChannelInitializer[SocketChannel]() {
+  def initChannel(ch: SocketChannel) = {
+ch.pipeline()
+  .addLast("encoder", new ByteArrayEncoder())
+  .addLast("frameDecoder",
+// maxFrameLength = 2G
+// lengthFieldOffset = 0
+// lengthFieldLength = 4
+// lengthAdjustment = 0
+// initialBytesToStrip = 4, i.e. strip out the length field 
itself
+new LengthFieldBasedFrameDecoder(Integer.MAX_VALUE, 0, 4, 0, 
4))
+  .addLast("decoder", new ByteArrayDecoder())
+  .addLast("handler", handler)
+  }
+})
+
+channelFuture = bootstrap.bind(new InetSocketAddress(0))
+channelFuture.syncUninterruptibly()
+
channelFuture.channel().localAddress().asInstanceOf[InetSocketAddress].getPort()
+  }
+
+  def run(): Unit = {
+channelFuture.channel.closeFuture().syncUninterruptibly()
+  }
+
+  def close(): Unit = {
+if (channelFuture != null) {
+  // close is a local operation and should finish within milliseconds; 
timeout just to be safe
+  channelFuture.channel().close().awaitUninterruptibly(10, 
TimeUnit.SECONDS)
+  channelFuture = null
+}
+if (bootstrap != null && bootstrap.group() != null) {
+  bootstrap.group().shutdownGracefully()
+}
+if (bootstrap != null && bootstrap.childGroup() != null) {
+  bootstrap.childGroup().shutdownGracefully()
+}
+bootstrap = null
+  }
+
+}
+
+private[spark] object RBackend extends Logging {
+  def main(args: Array[String]) {
+if (args.length < 1) {
+  System.err.println("Usage: RBackend ")
+  System.exit(-1)
+}
+val sparkRBackend = new RBackend()
+try {
+  // bind to random port
+  val boundPort = sparkRBackend.init()
+  val serverSocket = new ServerSocket(0, 1)
+  val listenPort = serverSocket.getLocalPort()
+
+  // tell the R process via temporary file
+  val path = args(0)
+  val f = new File(path + ".tmp")
+  val dos = new DataOutputStream(new FileOutputStream(f))
+  dos.writeInt(boundPort)
+  dos.writeInt(listenPort)
+  dos.close()
+  f.renameTo(new File(path))
+
+  // wait for the end of stdin, then exit
+  new Thread("wait 

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847796
  
--- Diff: R/create-docs.sh ---
@@ -0,0 +1,46 @@
+#!/bin/bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Script to create API docs for SparkR
+# This requires `devtools` and `knitr` to be installed on the machine.
+
+# After running this script the html docs can be found in 
+# $SPARK_HOME/R/pkg/html
+
+# Figure out where the script is
+export FWDIR="$(cd "`dirname "$0"`"; pwd)"
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847808
  
--- Diff: bin/sparkR2.cmd ---
@@ -0,0 +1,26 @@
+@echo off
+
+rem
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements.  See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License.  You may obtain a copy of the License at
+rem
+remhttp://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+rem
+
+rem Figure out where the Spark framework is installed
+set SPARK_HOME=%~dp0..
+
+rem Load environment variables from conf\spark-env.cmd, if it exists
+if exist "%SPARK_HOME%\conf\spark-env.cmd" call 
"%SPARK_HOME%\conf\spark-env.cmd"
--- End diff --

I followed the changes in #5328 now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847771
  
--- Diff: R/README.md ---
@@ -0,0 +1,73 @@
+# R on Spark
+
+SparkR is an R package that provides a light-weight frontend to use Spark 
from R.
+
+### SparkR development
+
+ Build Spark
+
+Build Spark with 
[Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn)
 and include the `-PsparkR` profile to build the R package. For example to use 
the default Hadoop versions you can run
+```
+  build/mvn -DskipTests -Psparkr package
+```
+
+ Running sparkR
+
+You can start using SparkR by launching the SparkR shell with
+
+./bin/sparkR
+
+The `sparkR` script automatically creates a SparkContext with Spark by 
default in
+local mode. To specify the Spark master of a cluster for the automatically 
created
+SparkContext, you can run
+
+./bin/sparkR --master "local[2]"
+
+To set other options like driver memory, executor memory etc. you can pass 
in the 
[spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html)
 arguments to `./bin/sparkR`
+
+ Using SparkR from RStudio
+
+If you wish to use SparkR from RStudio or other R frontends you will need 
to set some environment variables which point SparkR to your Spark 
installation. For example 
+```
+# Set this to where Spark is installed
+Sys.setenv(SPARK_HOME="/Users/shivaram/spark")
+# This line loads SparkR from the installed directory
+.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
+library(SparkR)
+sc <- sparkR.init(master="local")
+```
+
+ Making changes to SparkR
+
+The 
[instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
 for making contributions to Spark also apply to SparkR.
+If you only make R file changes (i.e. no Scala changes) then you can just 
re-install the R package using `R/install-dev.sh` and test your changes.
+Once you have made your changes, please include unit tests for them and 
run existing unit tests using the `run-tests.sh` script as described below. 
+
+ Generating documentation
+
+The SparkR documentation (Rd files and HTML files) are not a part of the 
source repository. To generate them you can run the script `R/create-docs.sh`. 
This script uses `devtools` and `knitr` to generate the docs and these packages 
need to be installed on the machine before using the script.
+
+### Examples, Unit tests
+
+SparkR comes with several sample programs in the `examples/src/main/r` 
directory.
+To run one of them, use `./bin/sparkR  `. For example:
+
+./bin/sparkR examples/src/main/r/pi.R local[2]
+
+You can also run the unit-tests for SparkR by running (you need to install 
the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) 
package first):
+
+R -e 'install.packages("testthat", 
repos="http://cran.us.r-project.org";)'
+./R/run-tests.sh
+
+### Running on YARN
+The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit 
jobs to YARN clusters. You will need to set YARN conf dir before doing so. For 
example on CDH you can run
+```
+export YARN_CONF_DIR=/etc/hadoop/conf
+./bin/spark-submit --master yarn examples/src/main/r/pi.R 4
+```
+
+### Report Issues/Feedback 
+
+For better tracking and collaboration, issues and TODO items are reported 
to a dedicated [SparkR JIRA](https://sparkr.atlassian.net/browse/SPARKR/).
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847768
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -469,6 +469,9 @@ private[spark] class ApplicationMaster(
   System.setProperty("spark.submit.pyFiles",
 PythonRunner.formatPaths(args.pyFiles).mkString(","))
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  // TODO(davies): add R dependencies here
--- End diff --

Well so this is yes and no - In YARN cluster we construct the correct 
command line but the launcher script can't find the SparkR location as we right 
now use SPARK_HOME to find the install location. I'll open a JIRA for this and 
describe more details.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27847734
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -0,0 +1,35 @@
+Package: SparkR
+Type: Package
+Title: R frontend for Spark
+Version: 0.1
--- End diff --

Made it 1.4 - R doesn't like having `-SNAPSHOT` in the version number :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27841793
  
--- Diff: R/SparkR_prep-0.1.sh ---
@@ -0,0 +1,52 @@
+#!/bin/sh
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Create and move to a new directory that can be easily cleaned up
+mkdir build_SparkR
--- End diff --

This was a helper script useful for a getting started guide. I think we can 
maintain this in a wiki or somewhere else and don't need this in the tree. 
Removing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27841627
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -469,6 +469,9 @@ private[spark] class ApplicationMaster(
   System.setProperty("spark.submit.pyFiles",
 PythonRunner.formatPaths(args.pyFiles).mkString(","))
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  // TODO(davies): add R dependencies here
--- End diff --

In YARN cluster mode we will launch the Runner with the right R file etc. 
However the SparkR package won't be found unless its already installed on all 
the machines. I'll open a JIRA for this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90280783
  
  [Test build #29758 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29758/consoleFull)
 for   PR 5096 at commit 
[`eb5da53`](https://github.com/apache/spark/commit/eb5da53d6bda72e1d181cd4e7ef3b8e48308d101).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch **removes the following dependencies:**
   * `RoaringBitmap-0.4.5.jar`
   * `activation-1.1.jar`
   * `akka-actor_2.10-2.3.4-spark.jar`
   * `akka-remote_2.10-2.3.4-spark.jar`
   * `akka-slf4j_2.10-2.3.4-spark.jar`
   * `aopalliance-1.0.jar`
   * `arpack_combined_all-0.1.jar`
   * `avro-1.7.7.jar`
   * `breeze-macros_2.10-0.11.2.jar`
   * `breeze_2.10-0.11.2.jar`
   * `chill-java-0.5.0.jar`
   * `chill_2.10-0.5.0.jar`
   * `commons-beanutils-1.7.0.jar`
   * `commons-beanutils-core-1.8.0.jar`
   * `commons-cli-1.2.jar`
   * `commons-codec-1.10.jar`
   * `commons-collections-3.2.1.jar`
   * `commons-compress-1.4.1.jar`
   * `commons-configuration-1.6.jar`
   * `commons-digester-1.8.jar`
   * `commons-httpclient-3.1.jar`
   * `commons-io-2.1.jar`
   * `commons-lang-2.5.jar`
   * `commons-lang3-3.3.2.jar`
   * `commons-math-2.1.jar`
   * `commons-math3-3.1.1.jar`
   * `commons-net-2.2.jar`
   * `compress-lzf-1.0.0.jar`
   * `config-1.2.1.jar`
   * `core-1.1.2.jar`
   * `curator-client-2.4.0.jar`
   * `curator-framework-2.4.0.jar`
   * `curator-recipes-2.4.0.jar`
   * `gmbal-api-only-3.0.0-b023.jar`
   * `grizzly-framework-2.1.2.jar`
   * `grizzly-http-2.1.2.jar`
   * `grizzly-http-server-2.1.2.jar`
   * `grizzly-http-servlet-2.1.2.jar`
   * `grizzly-rcm-2.1.2.jar`
   * `groovy-all-2.3.7.jar`
   * `guava-14.0.1.jar`
   * `guice-3.0.jar`
   * `hadoop-annotations-2.2.0.jar`
   * `hadoop-auth-2.2.0.jar`
   * `hadoop-client-2.2.0.jar`
   * `hadoop-common-2.2.0.jar`
   * `hadoop-hdfs-2.2.0.jar`
   * `hadoop-mapreduce-client-app-2.2.0.jar`
   * `hadoop-mapreduce-client-common-2.2.0.jar`
   * `hadoop-mapreduce-client-core-2.2.0.jar`
   * `hadoop-mapreduce-client-jobclient-2.2.0.jar`
   * `hadoop-mapreduce-client-shuffle-2.2.0.jar`
   * `hadoop-yarn-api-2.2.0.jar`
   * `hadoop-yarn-client-2.2.0.jar`
   * `hadoop-yarn-common-2.2.0.jar`
   * `hadoop-yarn-server-common-2.2.0.jar`
   * `ivy-2.4.0.jar`
   * `jackson-annotations-2.4.0.jar`
   * `jackson-core-2.4.4.jar`
   * `jackson-core-asl-1.8.8.jar`
   * `jackson-databind-2.4.4.jar`
   * `jackson-jaxrs-1.8.8.jar`
   * `jackson-mapper-asl-1.8.8.jar`
   * `jackson-module-scala_2.10-2.4.4.jar`
   * `jackson-xc-1.8.8.jar`
   * `jansi-1.4.jar`
   * `javax.inject-1.jar`
   * `javax.servlet-3.0.0.v201112011016.jar`
   * `javax.servlet-3.1.jar`
   * `javax.servlet-api-3.0.1.jar`
   * `jaxb-api-2.2.2.jar`
   * `jaxb-impl-2.2.3-1.jar`
   * `jcl-over-slf4j-1.7.10.jar`
   * `jersey-client-1.9.jar`
   * `jersey-core-1.9.jar`
   * `jersey-grizzly2-1.9.jar`
   * `jersey-guice-1.9.jar`
   * `jersey-json-1.9.jar`
   * `jersey-server-1.9.jar`
   * `jersey-test-framework-core-1.9.jar`
   * `jersey-test-framework-grizzly2-1.9.jar`
   * `jets3t-0.7.1.jar`
   * `jettison-1.1.jar`
   * `jetty-util-6.1.26.jar`
   * `jline-0.9.94.jar`
   * `jline-2.10.4.jar`
   * `jodd-core-3.6.3.jar`
   * `json4s-ast_2.10-3.2.10.jar`
   * `json4s-core_2.10-3.2.10.jar`
   * `json4s-jackson_2.10-3.2.10.jar`
   * `jsr305-1.3.9.jar`
   * `jtransforms-2.4.0.jar`
   * `jul-to-slf4j-1.7.10.jar`
   * `kryo-2.21.jar`
   * `log4j-1.2.17.jar`
   * `lz4-1.2.0.jar`
   * `management-api-3.0.0-b012.jar`
   * `mesos-0.21.0-shaded-protobuf.jar`
   * `metrics-core-3.1.0.jar`
   * `metrics-graphite-3.1.0.jar`
   * `metrics-json-3.1.0.jar`
   * `metrics-jvm-3.1.0.jar`
   * `minlog-1.2.jar`
   * `netty-3.8.0.Final.jar`
   * `netty-all-4.0.23.Final.jar`
   * `objenesis-1.2.jar`
   * `opencsv-2.3.jar`
   * `oro-2.0.8.jar`
   * `paranamer-2.6.jar`
   * `parquet-column-1.6.0rc3.jar`
   * `parquet-common-1.6.0rc3.jar`
   * `parquet-encoding-1.6.0rc3.jar`
   * `parquet-format-2.2.0-rc1.jar`
   * `parquet-generator-1.6.0rc3.jar`
   * `parquet-hadoop-1.6.0rc3.jar`
   * `parquet-jackson-1.6.0rc3.jar`
   * `protobuf-java-2.4.1.jar`
   * `protobuf-java-2.5.0-spark.jar`
   * `py4j-0.8.2.1.jar`
   * `pyrolite-2.0.1.jar`
   * `quasiquotes_2.10-2.0.1.jar`
   * `reflectasm-1.07-shaded.jar`
   * `scala-compiler-2.10.4.jar`
   * `scala-library-2.10.4.jar`
   *

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90280801
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29758/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90275418
  
  [Test build #29758 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29758/consoleFull)
 for   PR 5096 at commit 
[`eb5da53`](https://github.com/apache/spark/commit/eb5da53d6bda72e1d181cd4e7ef3b8e48308d101).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90163530
  
  [Test build #29746 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29746/consoleFull)
 for   PR 5096 at commit 
[`5133f3a`](https://github.com/apache/spark/commit/5133f3ae448b992fcaedc19d834aed9a649aa740).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch **removes the following dependencies:**
   * `RoaringBitmap-0.4.5.jar`
   * `activation-1.1.jar`
   * `akka-actor_2.10-2.3.4-spark.jar`
   * `akka-remote_2.10-2.3.4-spark.jar`
   * `akka-slf4j_2.10-2.3.4-spark.jar`
   * `aopalliance-1.0.jar`
   * `arpack_combined_all-0.1.jar`
   * `avro-1.7.7.jar`
   * `breeze-macros_2.10-0.11.2.jar`
   * `breeze_2.10-0.11.2.jar`
   * `chill-java-0.5.0.jar`
   * `chill_2.10-0.5.0.jar`
   * `commons-beanutils-1.7.0.jar`
   * `commons-beanutils-core-1.8.0.jar`
   * `commons-cli-1.2.jar`
   * `commons-codec-1.10.jar`
   * `commons-collections-3.2.1.jar`
   * `commons-compress-1.4.1.jar`
   * `commons-configuration-1.6.jar`
   * `commons-digester-1.8.jar`
   * `commons-httpclient-3.1.jar`
   * `commons-io-2.1.jar`
   * `commons-lang-2.5.jar`
   * `commons-lang3-3.3.2.jar`
   * `commons-math-2.1.jar`
   * `commons-math3-3.1.1.jar`
   * `commons-net-2.2.jar`
   * `compress-lzf-1.0.0.jar`
   * `config-1.2.1.jar`
   * `core-1.1.2.jar`
   * `curator-client-2.4.0.jar`
   * `curator-framework-2.4.0.jar`
   * `curator-recipes-2.4.0.jar`
   * `gmbal-api-only-3.0.0-b023.jar`
   * `grizzly-framework-2.1.2.jar`
   * `grizzly-http-2.1.2.jar`
   * `grizzly-http-server-2.1.2.jar`
   * `grizzly-http-servlet-2.1.2.jar`
   * `grizzly-rcm-2.1.2.jar`
   * `groovy-all-2.3.7.jar`
   * `guava-14.0.1.jar`
   * `guice-3.0.jar`
   * `hadoop-annotations-2.2.0.jar`
   * `hadoop-auth-2.2.0.jar`
   * `hadoop-client-2.2.0.jar`
   * `hadoop-common-2.2.0.jar`
   * `hadoop-hdfs-2.2.0.jar`
   * `hadoop-mapreduce-client-app-2.2.0.jar`
   * `hadoop-mapreduce-client-common-2.2.0.jar`
   * `hadoop-mapreduce-client-core-2.2.0.jar`
   * `hadoop-mapreduce-client-jobclient-2.2.0.jar`
   * `hadoop-mapreduce-client-shuffle-2.2.0.jar`
   * `hadoop-yarn-api-2.2.0.jar`
   * `hadoop-yarn-client-2.2.0.jar`
   * `hadoop-yarn-common-2.2.0.jar`
   * `hadoop-yarn-server-common-2.2.0.jar`
   * `ivy-2.4.0.jar`
   * `jackson-annotations-2.4.0.jar`
   * `jackson-core-2.4.4.jar`
   * `jackson-core-asl-1.8.8.jar`
   * `jackson-databind-2.4.4.jar`
   * `jackson-jaxrs-1.8.8.jar`
   * `jackson-mapper-asl-1.8.8.jar`
   * `jackson-module-scala_2.10-2.4.4.jar`
   * `jackson-xc-1.8.8.jar`
   * `jansi-1.4.jar`
   * `javax.inject-1.jar`
   * `javax.servlet-3.0.0.v201112011016.jar`
   * `javax.servlet-3.1.jar`
   * `javax.servlet-api-3.0.1.jar`
   * `jaxb-api-2.2.2.jar`
   * `jaxb-impl-2.2.3-1.jar`
   * `jcl-over-slf4j-1.7.10.jar`
   * `jersey-client-1.9.jar`
   * `jersey-core-1.9.jar`
   * `jersey-grizzly2-1.9.jar`
   * `jersey-guice-1.9.jar`
   * `jersey-json-1.9.jar`
   * `jersey-server-1.9.jar`
   * `jersey-test-framework-core-1.9.jar`
   * `jersey-test-framework-grizzly2-1.9.jar`
   * `jets3t-0.7.1.jar`
   * `jettison-1.1.jar`
   * `jetty-util-6.1.26.jar`
   * `jline-0.9.94.jar`
   * `jline-2.10.4.jar`
   * `jodd-core-3.6.3.jar`
   * `json4s-ast_2.10-3.2.10.jar`
   * `json4s-core_2.10-3.2.10.jar`
   * `json4s-jackson_2.10-3.2.10.jar`
   * `jsr305-1.3.9.jar`
   * `jtransforms-2.4.0.jar`
   * `jul-to-slf4j-1.7.10.jar`
   * `kryo-2.21.jar`
   * `log4j-1.2.17.jar`
   * `lz4-1.2.0.jar`
   * `management-api-3.0.0-b012.jar`
   * `mesos-0.21.0-shaded-protobuf.jar`
   * `metrics-core-3.1.0.jar`
   * `metrics-graphite-3.1.0.jar`
   * `metrics-json-3.1.0.jar`
   * `metrics-jvm-3.1.0.jar`
   * `minlog-1.2.jar`
   * `netty-3.8.0.Final.jar`
   * `netty-all-4.0.23.Final.jar`
   * `objenesis-1.2.jar`
   * `opencsv-2.3.jar`
   * `oro-2.0.8.jar`
   * `paranamer-2.6.jar`
   * `parquet-column-1.6.0rc3.jar`
   * `parquet-common-1.6.0rc3.jar`
   * `parquet-encoding-1.6.0rc3.jar`
   * `parquet-format-2.2.0-rc1.jar`
   * `parquet-generator-1.6.0rc3.jar`
   * `parquet-hadoop-1.6.0rc3.jar`
   * `parquet-jackson-1.6.0rc3.jar`
   * `protobuf-java-2.4.1.jar`
   * `protobuf-java-2.5.0-spark.jar`
   * `py4j-0.8.2.1.jar`
   * `pyrolite-2.0.1.jar`
   * `quasiquotes_2.10-2.0.1.jar`
   * `reflectasm-1.07-shaded.jar`
   * `scala-compiler-2.10.4.jar`
   * `scala-library-2.10.4.jar`
   *

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90163536
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29746/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-90160377
  
  [Test build #29746 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29746/consoleFull)
 for   PR 5096 at commit 
[`5133f3a`](https://github.com/apache/spark/commit/5133f3ae448b992fcaedc19d834aed9a649aa740).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-89479811
  
Spark submit changes look fine for the most part. The comments I left are 
mostly minor. One thing I noticed is that there are a lot of places where the 
return type is not specified. Even if a method has no return type, we should 
try to enforce adding `Unit` to the end so developers new to the project will 
do the same.

I intend to take another pass after the comments are addressed. I have not 
looked in detail at changes outside of Spark submit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765342
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala
 ---
@@ -92,6 +97,7 @@ class ApplicationMasterArguments(val args: Array[String]) 
{
   |  --jar JAR_PATH   Path to your application's JAR file
   |  --class CLASS_NAME   Name of your application's main class
   |  --primary-py-fileA main Python file
+  |  --primary-r-file A main R file
--- End diff --

IIUC if the user wants to run R then s/he shouldn't have python files. 
Maybe we should validate only one of `primaryRFile` and `primaryPyFile` can be 
set at a given time. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765330
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -497,12 +503,15 @@ private[spark] class Client(
 if (args.primaryPyFile != null && args.primaryPyFile.endsWith(".py")) {
   args.userArgs = ArrayBuffer(args.primaryPyFile, args.pyFiles) ++ 
args.userArgs
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  args.userArgs = ArrayBuffer(args.primaryRFile) ++ args.userArgs
+}
 val userArgs = args.userArgs.flatMap { arg =>
   Seq("--arg", YarnSparkHadoopUtil.escapeForShell(arg))
 }
 val amArgs =
-  Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ pyFiles ++ 
userArgs ++
-Seq(
+  Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ pyFiles ++ 
primaryRFile ++
--- End diff --

this is getting kind of scary. Can we add a comment or a check that says 
the python and the R files are mutually exclusive? (are they?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765306
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -469,6 +469,9 @@ private[spark] class ApplicationMaster(
   System.setProperty("spark.submit.pyFiles",
 PythonRunner.formatPaths(args.pyFiles).mkString(","))
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  // TODO(davies): add R dependencies here
--- End diff --

wait, so will this work on YARN?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765296
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -243,6 +258,36 @@
 return pyargs;
   }
 
+  private List buildSparkRCommand(Map env) throws 
IOException {
+if (!appArgs.isEmpty() && appArgs.get(0).endsWith(".R")) {
+  appResource = appArgs.get(0);
+  appArgs.remove(0);
+  return buildCommand(env);
+}
+
+Properties props = loadPropertiesFile();
+mergeEnvPathList(env, getLibPathEnvName(),
+firstNonEmptyValue(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH, 
conf, props));
+
+// Store spark-submit arguments in an environment variable, since 
there's no way to pass
+// them to sparkR on the command line.
+StringBuilder submitArgs = new StringBuilder();
+for (String arg : buildSparkSubmitArgs()) {
+  if (submitArgs.length() > 0) {
+submitArgs.append(" ");
+  }
+  submitArgs.append(quoteForPython(arg));
--- End diff --

not a big deal, but these few lines are duplicated with python. Could we 
have a common method that puts Spark submit args, escapes them properly, and 
puts them in an environment variable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765269
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -243,6 +258,36 @@
 return pyargs;
   }
 
+  private List buildSparkRCommand(Map env) throws 
IOException {
+if (!appArgs.isEmpty() && appArgs.get(0).endsWith(".R")) {
+  appResource = appArgs.get(0);
+  appArgs.remove(0);
+  return buildCommand(env);
+}
+
+Properties props = loadPropertiesFile();
+mergeEnvPathList(env, getLibPathEnvName(),
+firstNonEmptyValue(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH, 
conf, props));
+
+// Store spark-submit arguments in an environment variable, since 
there's no way to pass
+// them to sparkR on the command line.
+StringBuilder submitArgs = new StringBuilder();
+for (String arg : buildSparkSubmitArgs()) {
+  if (submitArgs.length() > 0) {
+submitArgs.append(" ");
+  }
+  submitArgs.append(quoteForPython(arg));
--- End diff --

we should rename this instead of just grabbing it from the python space


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765229
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -211,9 +212,9 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
   printUsageAndExit(-1)
 }
 if (primaryResource == null) {
-  SparkSubmit.printErrorAndExit("Must specify a primary resource (JAR 
or Python file)")
+  SparkSubmit.printErrorAndExit("Must specify a primary resource (JAR 
or Python or R file)")
 }
-if (mainClass == null && !isPython) {
+if (mainClass == null && !isPython && !isR) {
--- End diff --

how about `if (mainClass == null && isUserJar(primaryResource))`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765206
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -406,7 +438,7 @@ object SparkSubmit {
 // Add the application jar automatically so the user doesn't have to 
call sc.addJar
 // For YARN cluster mode, the jar is already distributed on each node 
as "app.jar"
 // For python files, the primary resource is already distributed as a 
regular file
--- End diff --

need to update this to include R


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765167
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -317,11 +328,32 @@ object SparkSubmit {
   }
 }
 
-// In yarn-cluster mode for a python app, add primary resource and 
pyFiles to files
-// that can be distributed with the job
-if (args.isPython && isYarnCluster) {
-  args.files = mergeFileLists(args.files, args.primaryResource)
-  args.files = mergeFileLists(args.files, args.pyFiles)
+// If we're running a R app, set the main class to our specific R 
runner
+if (args.isR && deployMode == CLIENT) {
+  if (args.primaryResource == SPARKR_SHELL) {
+args.mainClass = "org.apache.spark.api.r.RBackend"
+  } else {
+// If a R file is provided, add it to the child arguments and list 
of files to deploy.
+// Usage: PythonAppRunner  [app arguments]
--- End diff --

forgot to change this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765097
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRDD.scala ---
@@ -0,0 +1,450 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.ServerSocket
+import java.util.{Map => JMap}
+
+import scala.collection.JavaConversions._
+import scala.io.Source
+import scala.reflect.ClassTag
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.api.java.{JavaPairRDD, JavaRDD, JavaSparkContext}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private abstract class BaseRRDD[T: ClassTag, U: ClassTag](
+parent: RDD[T],
+numPartitions: Int,
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+rLibDir: String,
+broadcastVars: Array[Broadcast[Object]])
+  extends RDD[U](parent) with Logging {
+  override def getPartitions = parent.partitions
+
+  override def compute(partition: Partition, context: TaskContext): 
Iterator[U] = {
+
+// The parent may be also an RRDD, so we should launch it first.
+val parentIterator = firstParent[T].iterator(partition, context)
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2)
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRDD.createRWorker(rLibDir, listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), parentIterator, 
partition.index)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+val dataStream = openDataStream(inputStream)
+serverSocket.close()
+
+try {
+
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread[T](
+output: OutputStream,
+iter: Iterator[T],
+partition: Int) = {
+
+val env = SparkEnv.get
+val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt
+val stream = new BufferedOutputStream(output, bufferSize)
+
+new Thread("writer for R") {
+  override def run() {
+try {
+  SparkEnv.set(env)
+  val dataOut = new DataOutputStream(stream)
+  dataOut.writeInt(partition)
+
+  SerDe.writeString(dataOut, deserializer)
+  SerDe.writeString(dataOut, serializer)
+
+  dataOut.writeInt(packageNames.length)
+  dataOut.write(packageNames)
+
+  dataOut.writeInt(func.length)
+  dataOut.write(func)
+
+  

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765090
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRDD.scala ---
@@ -0,0 +1,450 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.ServerSocket
+import java.util.{Map => JMap}
+
+import scala.collection.JavaConversions._
+import scala.io.Source
+import scala.reflect.ClassTag
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.api.java.{JavaPairRDD, JavaRDD, JavaSparkContext}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private abstract class BaseRRDD[T: ClassTag, U: ClassTag](
+parent: RDD[T],
+numPartitions: Int,
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+rLibDir: String,
+broadcastVars: Array[Broadcast[Object]])
+  extends RDD[U](parent) with Logging {
+  override def getPartitions = parent.partitions
+
+  override def compute(partition: Partition, context: TaskContext): 
Iterator[U] = {
+
+// The parent may be also an RRDD, so we should launch it first.
+val parentIterator = firstParent[T].iterator(partition, context)
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2)
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRDD.createRWorker(rLibDir, listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), parentIterator, 
partition.index)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+val dataStream = openDataStream(inputStream)
+serverSocket.close()
+
+try {
+
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread[T](
+output: OutputStream,
+iter: Iterator[T],
+partition: Int) = {
--- End diff --

`Unit`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765075
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
DataInputStream, DataOutputStream}
+
+import scala.collection.mutable.HashMap
+
+import io.netty.channel.ChannelHandler.Sharable
+import io.netty.channel.{ChannelHandlerContext, 
SimpleChannelInboundHandler}
+
+import org.apache.spark.Logging
+import org.apache.spark.api.r.SerDe._
+
+/**
+ * Handler for RBackend
+ * TODO: This is marked as sharable to get a handle to RBackend. Is it 
safe to re-use
+ * this across connections ?
+ */
+@Sharable
+private[r] class RBackendHandler(server: RBackend)
+  extends SimpleChannelInboundHandler[Array[Byte]] with Logging {
+
+  override def channelRead0(ctx: ChannelHandlerContext, msg: Array[Byte]) {
+val bis = new ByteArrayInputStream(msg)
+val dis = new DataInputStream(bis)
+
+val bos = new ByteArrayOutputStream()
+val dos = new DataOutputStream(bos)
+
+// First bit is isStatic
+val isStatic = readBoolean(dis)
+val objId = readString(dis)
+val methodName = readString(dis)
+val numArgs = readInt(dis)
+
+if (objId == "SparkRHandler") {
+  methodName match {
+case "stopBackend" =>
+  writeInt(dos, 0)
+  writeType(dos, "void")
+  server.close()
+case "rm" =>
+  try {
+val t = readObjectType(dis)
+assert(t == 'c')
+val objToRemove = readString(dis)
+JVMObjectTracker.remove(objToRemove)
+writeInt(dos, 0)
+writeObject(dos, null)
+  } catch {
+case e: Exception =>
+  logError(s"Removing $objId failed", e)
+  writeInt(dos, -1)
+  }
+case _ => dos.writeInt(-1)
+  }
+} else {
+  handleMethodCall(isStatic, objId, methodName, numArgs, dis, dos)
+}
+
+val reply = bos.toByteArray
+ctx.write(reply)
+  }
+  
+  override def channelReadComplete(ctx: ChannelHandlerContext) {
+ctx.flush()
+  }
+
+  override def exceptionCaught(ctx: ChannelHandlerContext, cause: 
Throwable) {
+// Close the connection when an exception is raised.
+cause.printStackTrace()
+ctx.close()
+  }
+
+  def handleMethodCall(
+  isStatic: Boolean,
+  objId: String,
+  methodName: String,
+  numArgs: Int,
+  dis: DataInputStream,
+  dos: DataOutputStream) {
+var obj: Object = null
+try {
+  val cls = if (isStatic) {
+Class.forName(objId)
+  } else {
+JVMObjectTracker.get(objId) match {
+  case None => throw new IllegalArgumentException("Object not 
found " + objId)
+  case Some(o) =>
+obj = o
+o.getClass
+}
+  }
+
+  val args = readArgs(numArgs, dis)
+
+  val methods = cls.getMethods
+  val selectedMethods = methods.filter(m => m.getName == methodName)
+  if (selectedMethods.length > 0) {
+val methods = selectedMethods.filter { x =>
+  matchMethod(numArgs, args, x.getParameterTypes)
+}
+if (methods.isEmpty) {
+  logWarning(s"cannot find matching method ${cls}.$methodName. "
++ s"Candidates are:")
+  selectedMethods.foreach { method =>
+
logWarning(s"$methodName(${method.getParameterTypes.mkString(",")})")
+  }
+  throw new Exception(s"No matched method found for 
$cls.$methodName")
+}
+val ret = method

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765041
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
DataInputStream, DataOutputStream}
+
+import scala.collection.mutable.HashMap
+
+import io.netty.channel.ChannelHandler.Sharable
+import io.netty.channel.{ChannelHandlerContext, 
SimpleChannelInboundHandler}
+
+import org.apache.spark.Logging
+import org.apache.spark.api.r.SerDe._
+
+/**
+ * Handler for RBackend
+ * TODO: This is marked as sharable to get a handle to RBackend. Is it 
safe to re-use
+ * this across connections ?
+ */
+@Sharable
+private[r] class RBackendHandler(server: RBackend)
+  extends SimpleChannelInboundHandler[Array[Byte]] with Logging {
+
+  override def channelRead0(ctx: ChannelHandlerContext, msg: Array[Byte]) {
+val bis = new ByteArrayInputStream(msg)
+val dis = new DataInputStream(bis)
+
+val bos = new ByteArrayOutputStream()
+val dos = new DataOutputStream(bos)
+
+// First bit is isStatic
+val isStatic = readBoolean(dis)
+val objId = readString(dis)
+val methodName = readString(dis)
+val numArgs = readInt(dis)
+
+if (objId == "SparkRHandler") {
+  methodName match {
+case "stopBackend" =>
+  writeInt(dos, 0)
+  writeType(dos, "void")
+  server.close()
+case "rm" =>
+  try {
+val t = readObjectType(dis)
+assert(t == 'c')
+val objToRemove = readString(dis)
+JVMObjectTracker.remove(objToRemove)
+writeInt(dos, 0)
+writeObject(dos, null)
+  } catch {
+case e: Exception =>
+  logError(s"Removing $objId failed", e)
+  writeInt(dos, -1)
+  }
+case _ => dos.writeInt(-1)
+  }
+} else {
+  handleMethodCall(isStatic, objId, methodName, numArgs, dis, dos)
+}
+
+val reply = bos.toByteArray
+ctx.write(reply)
+  }
+  
+  override def channelReadComplete(ctx: ChannelHandlerContext) {
+ctx.flush()
+  }
+
+  override def exceptionCaught(ctx: ChannelHandlerContext, cause: 
Throwable) {
+// Close the connection when an exception is raised.
+cause.printStackTrace()
+ctx.close()
+  }
+
+  def handleMethodCall(
+  isStatic: Boolean,
+  objId: String,
+  methodName: String,
+  numArgs: Int,
+  dis: DataInputStream,
+  dos: DataOutputStream) {
+var obj: Object = null
+try {
+  val cls = if (isStatic) {
+Class.forName(objId)
+  } else {
+JVMObjectTracker.get(objId) match {
+  case None => throw new IllegalArgumentException("Object not 
found " + objId)
+  case Some(o) =>
+obj = o
+o.getClass
+}
+  }
+
+  val args = readArgs(numArgs, dis)
+
+  val methods = cls.getMethods
+  val selectedMethods = methods.filter(m => m.getName == methodName)
+  if (selectedMethods.length > 0) {
+val methods = selectedMethods.filter { x =>
+  matchMethod(numArgs, args, x.getParameterTypes)
+}
+if (methods.isEmpty) {
+  logWarning(s"cannot find matching method ${cls}.$methodName. "
++ s"Candidates are:")
+  selectedMethods.foreach { method =>
+
logWarning(s"$methodName(${method.getParameterTypes.mkString(",")})")
+  }
+  throw new Exception(s"No matched method found for 
$cls.$methodName")
+}
+val ret = method

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27765001
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
DataInputStream, DataOutputStream}
+
+import scala.collection.mutable.HashMap
+
+import io.netty.channel.ChannelHandler.Sharable
+import io.netty.channel.{ChannelHandlerContext, 
SimpleChannelInboundHandler}
+
+import org.apache.spark.Logging
+import org.apache.spark.api.r.SerDe._
+
+/**
+ * Handler for RBackend
+ * TODO: This is marked as sharable to get a handle to RBackend. Is it 
safe to re-use
+ * this across connections ?
+ */
+@Sharable
+private[r] class RBackendHandler(server: RBackend)
+  extends SimpleChannelInboundHandler[Array[Byte]] with Logging {
+
+  override def channelRead0(ctx: ChannelHandlerContext, msg: Array[Byte]) {
--- End diff --

`: Unit = `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27764985
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackend.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.{DataOutputStream, File, FileOutputStream, IOException}
+import java.net.{InetSocketAddress, ServerSocket}
+import java.util.concurrent.TimeUnit
+
+import io.netty.bootstrap.ServerBootstrap
+import io.netty.channel.{ChannelFuture, ChannelInitializer, EventLoopGroup}
+import io.netty.channel.nio.NioEventLoopGroup
+import io.netty.channel.socket.SocketChannel
+import io.netty.channel.socket.nio.NioServerSocketChannel
+import io.netty.handler.codec.LengthFieldBasedFrameDecoder
+import io.netty.handler.codec.bytes.{ByteArrayDecoder, ByteArrayEncoder}
+
+import org.apache.spark.Logging
+
+/**
+ * Netty-based backend server that is used to communicate between R and 
Java.
+ */
+private[spark] class RBackend {
+
+  private[this] var channelFuture: ChannelFuture = null
+  private[this] var bootstrap: ServerBootstrap = null
+  private[this] var bossGroup: EventLoopGroup = null
+
+  def init(): Int = {
+bossGroup = new NioEventLoopGroup(2)
+val workerGroup = bossGroup
+val handler = new RBackendHandler(this)
+  
+bootstrap = new ServerBootstrap()
+  .group(bossGroup, workerGroup)
+  .channel(classOf[NioServerSocketChannel])
+  
+bootstrap.childHandler(new ChannelInitializer[SocketChannel]() {
+  def initChannel(ch: SocketChannel) = {
+ch.pipeline()
+  .addLast("encoder", new ByteArrayEncoder())
+  .addLast("frameDecoder",
+// maxFrameLength = 2G
+// lengthFieldOffset = 0
+// lengthFieldLength = 4
+// lengthAdjustment = 0
+// initialBytesToStrip = 4, i.e. strip out the length field 
itself
+new LengthFieldBasedFrameDecoder(Integer.MAX_VALUE, 0, 4, 0, 
4))
+  .addLast("decoder", new ByteArrayDecoder())
+  .addLast("handler", handler)
+  }
+})
+
+channelFuture = bootstrap.bind(new InetSocketAddress(0))
+channelFuture.syncUninterruptibly()
+
channelFuture.channel().localAddress().asInstanceOf[InetSocketAddress].getPort()
+  }
+
+  def run(): Unit = {
+channelFuture.channel.closeFuture().syncUninterruptibly()
+  }
+
+  def close(): Unit = {
+if (channelFuture != null) {
+  // close is a local operation and should finish within milliseconds; 
timeout just to be safe
+  channelFuture.channel().close().awaitUninterruptibly(10, 
TimeUnit.SECONDS)
+  channelFuture = null
+}
+if (bootstrap != null && bootstrap.group() != null) {
+  bootstrap.group().shutdownGracefully()
+}
+if (bootstrap != null && bootstrap.childGroup() != null) {
+  bootstrap.childGroup().shutdownGracefully()
+}
+bootstrap = null
+  }
+
+}
+
+private[spark] object RBackend extends Logging {
+  def main(args: Array[String]) {
+if (args.length < 1) {
+  System.err.println("Usage: RBackend ")
+  System.exit(-1)
+}
+val sparkRBackend = new RBackend()
+try {
+  // bind to random port
+  val boundPort = sparkRBackend.init()
+  val serverSocket = new ServerSocket(0, 1)
+  val listenPort = serverSocket.getLocalPort()
+
+  // tell the R process via temporary file
+  val path = args(0)
+  val f = new File(path + ".tmp")
+  val dos = new DataOutputStream(new FileOutputStream(f))
+  dos.writeInt(boundPort)
+  dos.writeInt(listenPort)
+  dos.close()
+  f.renameTo(new File(path))
+
+  // wait for the end of stdin, then exit
+  new Thread("wai

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27764957
  
--- Diff: bin/sparkR2.cmd ---
@@ -0,0 +1,26 @@
+@echo off
+
+rem
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements.  See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License.  You may obtain a copy of the License at
+rem
+remhttp://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+rem
+
+rem Figure out where the Spark framework is installed
+set SPARK_HOME=%~dp0..
+
+rem Load environment variables from conf\spark-env.cmd, if it exists
+if exist "%SPARK_HOME%\conf\spark-env.cmd" call 
"%SPARK_HOME%\conf\spark-env.cmd"
--- End diff --

This needs to be updated after #5328 goes in. SparkR on Windows will not 
work with these current changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-89422690
  
I took a pass on this. I still haven't wrapped my head totally around the 
packaging, but here are a few comments:

- I’m not quite sure what all the scripts are for. Can we document or 
remove some of these? It would be helpful to explain the general packaging 
mechanisms somewhere. For instance the install script seems important, but I 
don’t get what “install” means in this context (vs “package”).
- We should file as a blocker for 1.4 integrating the docs into our doc 
build and our published API docs.
- All of the examples use the Spark core API, but I imagine one of the 
biggest use cases would be using Data frames. What about an example that uses 
the data frame API?
- Changes to Spark submit look good to me.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27752993
  
--- Diff: R/README.md ---
@@ -0,0 +1,73 @@
+# R on Spark
+
+SparkR is an R package that provides a light-weight frontend to use Spark 
from R.
--- End diff --

Can you add a TODO comment here to merge this into the existing spark docs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27752135
  
--- Diff: R/create-docs.sh ---
@@ -0,0 +1,46 @@
+#!/bin/bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Script to create API docs for SparkR
+# This requires `devtools` and `knitr` to be installed on the machine.
+
+# After running this script the html docs can be found in 
+# $SPARK_HOME/R/pkg/html
+
+# Figure out where the script is
+export FWDIR="$(cd "`dirname "$0"`"; pwd)"
--- End diff --

Can this script be wired up so that our normal doc generation invokes it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-03 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27752106
  
--- Diff: R/SparkR_prep-0.1.sh ---
@@ -0,0 +1,52 @@
+#!/bin/sh
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Create and move to a new directory that can be easily cleaned up
+mkdir build_SparkR
--- End diff --

What is this script for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-87956612
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29444/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-87927735
  
  [Test build #29444 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29444/consoleFull)
 for   PR 5096 at commit 
[`0e788c0`](https://github.com/apache/spark/commit/0e788c08f3b418acc05d4d27298b65de8b6f8407).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-87881000
  
  [Test build #29425 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29425/consoleFull)
 for   PR 5096 at commit 
[`1d1802e`](https://github.com/apache/spark/commit/1d1802eb454ae3628ac22f38534b50c33345f2e1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-87881007
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29425/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-87845362
  
  [Test build #29425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29425/consoleFull)
 for   PR 5096 at commit 
[`1d1802e`](https://github.com/apache/spark/commit/1d1802eb454ae3628ac22f38534b50c33345f2e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-27 Thread redbaron
Github user redbaron commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27294415
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -469,6 +469,9 @@ private[spark] class ApplicationMaster(
   System.setProperty("spark.submit.pyFiles",
 PythonRunner.formatPaths(args.pyFiles).mkString(","))
 }
+if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
+  // TODO(davies): add R dependencies here
--- End diff --

Why it is restrciting it to .R extension anyway?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-26 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-86755263
  
What are the major areas where this needs review?  Is is pretty much just 
the `spark-submit` changes?  This patch doesn't significantly modify any 
existing Spark code, so it seems like it would be safe to merge soon in order 
to unblock collaboration on new features / tests.

It looks like R is already set up on AMPLab Jenkins, so it looks like we 
don't need do to anything else for testing infra.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/5096#discussion_r27266139
  
--- Diff: R/README.md ---
@@ -0,0 +1,73 @@
+# R on Spark
+
+SparkR is an R package that provides a light-weight frontend to use Spark 
from R.
+
+### SparkR development
+
+ Build Spark
+
+Build Spark with 
[Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn)
 and include the `-PsparkR` profile to build the R package. For example to use 
the default Hadoop versions you can run
+```
+  build/mvn -DskipTests -Psparkr package
+```
+
+ Running sparkR
+
+You can start using SparkR by launching the SparkR shell with
+
+./bin/sparkR
+
+The `sparkR` script automatically creates a SparkContext with Spark by 
default in
+local mode. To specify the Spark master of a cluster for the automatically 
created
+SparkContext, you can run
+
+./bin/sparkR --master "local[2]"
+
+To set other options like driver memory, executor memory etc. you can pass 
in the 
[spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html)
 arguments to `./bin/sparkR`
+
+ Using SparkR from RStudio
+
+If you wish to use SparkR from RStudio or other R frontends you will need 
to set some environment variables which point SparkR to your Spark 
installation. For example 
+```
+# Set this to where Spark is installed
+Sys.setenv(SPARK_HOME="/Users/shivaram/spark")
+# This line loads SparkR from the installed directory
+.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
+library(SparkR)
+sc <- sparkR.init(master="local")
+```
+
+ Making changes to SparkR
+
+The 
[instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
 for making contributions to Spark also apply to SparkR.
--- End diff --

Post-merge, we can update the wiki to include R-specific instructions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >