Spark assembly for YARN/CDH5

2014-10-16 Thread Philip Ogren
Does anyone know if there Spark assemblies are created and available for download that have been built for CDH5 and YARN? Thanks, Philip - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

Re: creating a distributed index

2014-08-04 Thread Philip Ogren
`(myquery)) I'm sure it won't take much imagination to figure out how to the the matching in a batch way. If anyone has done anything along these lines I'd love to have some feedback. Thanks, Philip On 08/04/2014 09:46 AM, Philip Ogren wrote: This looks like a really cool feature and it seems

relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren
It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren
-parameter-forwarding-possible-in-scala I'm not seeing a way to utilize implicit conversions in this case. Since Scala is statically (albeit inferred) typed, I don't see a way around having a common supertype. On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote: It is really

Re: Announcing Spark 1.0.1

2014-07-14 Thread Philip Ogren
Hi Patrick, This is great news but I nearly missed the announcement because it had scrolled off the folder view that I have Spark users list messages go to. 40+ new threads since you sent the email out on Friday evening. You might consider having someone on your team create a

Multiple SparkContexts with different configurations in same JVM

2014-07-10 Thread Philip Ogren
In various previous versions of Spark (and I believe the current version, 1.0.0, as well) we have noticed that it does not seem possible to have a local SparkContext and a SparkContext connected to a cluster via either a Spark Cluster (i.e. using the Spark resource manager) or a YARN cluster.

Re: Unit test failure: Address already in use

2014-06-18 Thread Philip Ogren
In my unit tests I have a base class that all my tests extend that has a setup and teardown method that they inherit. They look something like this: var spark: SparkContext = _ @Before def setUp() { Thread.sleep(100L) //this seems to give spark more time to reset from the

Re: Processing audio/video/images

2014-06-02 Thread Philip Ogren
I asked a question related to Marcelo's answer a few months ago. The discussion there may be useful: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html On 06/02/2014 06:09 PM, Marcelo Vanzin wrote: Hi Jamal, If what you want is to process lots of files in parallel, the

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Philip Ogren
Hi Pierre, I asked a similar question on this list about 6 weeks ago. Here is one answer http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E I got that is of particular note: In the upcoming release of

Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren
Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster

Re: Opinions stratosphere

2014-05-02 Thread Philip Ogren
Great reference! I just skimmed through the results without reading much of the methodology - but it looks like Spark outperforms Stratosphere fairly consistently in the experiments. It's too bad the data sources only range from 2GB to 8GB. Who knows if the apparent pattern would extend out

RDD.tail()

2014-04-14 Thread Philip Ogren
Has there been any thought to adding a tail() method to RDD? It would be really handy to skip over the first item in an RDD when it contains header information. Even better would be a drop(int) function that would allow you to skip over several lines of header information. Our attempts to

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
arbitrary format and will be deprecated soon. If you find this feature useful, you can test it out by building the master branch of Spark yourself, following the instructions in https://github.com/apache/spark/pull/42. Andrew On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
directly - I think it's been factored nicely so it's fairly decoupled from the UI. The concern is this is a semi-internal piece of functionality and something we might, e.g. want to change the API of over time. - Patrick On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Philip Ogren
to figure out how to do this or if it is possible. Any advice is appreciated. Thanks, Philip On 04/01/2014 09:43 AM, Philip Ogren wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings