Re: Anyone wants to look at SPARK-1123?

2014-02-23 Thread Nick Pentreath
Hi What KeyClass and ValueClass are you trying to save as the keys/values of your dataset? On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu wrote: > Hi, all > > I found the weird thing on saveAsNewAPIHadoopFile in > PairRDDFunctions.scala when working on the other issue, > > saveAsNewAPIHadoopFile

Re: Spark 0.8.1 on Amazon Elastic MapReduce

2014-02-14 Thread Nick Pentreath
Thanks Parviz, this looks great and good to see it getting updated. Look forward to 0.9.0! A perhaps stupid question - where does the KinesisWordCount example live? Is that an Amazon example, since I don't see it under the streaming examples included in the Spark project. If it's a third party exa

Re: [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread Nick Pentreath
@fommil @mengxr I think it's always worth a shot at a license change. Scikit learn devs have been successful before in getting such things over the line. Assuming we can make that happen, what do folks think about MTJ vs Breeze vs JBLAS + commons-math since these seem like the viable alternativ

Re: [VOTE] Graduation of Apache Spark from the Incubator

2014-02-11 Thread Nick Pentreath
;> >> related to fast and flexible large-scale data analysis > > >> >> on clusters; and be it further RESOLVED, that the office > > >> >> of "Vice President, Apache Spark" be and hereby is created, > > >> >> the person holding such

Fwd: Represent your project at ApacheCon

2014-01-27 Thread Nick Pentreath
Is Spark active in submitting anything for this? -- Forwarded message -- From: Rich Bowen Date: Mon, Jan 27, 2014 at 4:20 PM Subject: Represent your project at ApacheCon To: committ...@apache.org Folks, 5 days from the end of the CFP, we have only 50 talks submitted. We need t

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-01-26 Thread Nick Pentreath
If you want to spend the time running 50 iterations, you're better off re-running 5x10 iterations with different random start to get a better local minimum...— Sent from Mailbox for iPhone On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia wrote: > I looked into this after I opened that JIRA and i

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

2014-01-26 Thread Nick Pentreath
Agree that it should be fixed if possible. But why run ALS for 50 iterations? It tends to pretty much converge (to within 0.001 or so RMSE) after 5-10 and even 20 is probably overkill.— Sent from Mailbox for iPhone On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia wrote: > I looked into this afte

Re: [DISCUSS] Graduating as a TLP

2014-01-23 Thread Nick Pentreath
+1 fantastic news — Sent from Mailbox for iPhone On Fri, Jan 24, 2014 at 6:43 AM, Mridul Muralidharan wrote: > Great news ! > +1 > Regards, > Mridul > On Fri, Jan 24, 2014 at 4:15 AM, Matei Zaharia > wrote: >> Hi folks, >> >> We’ve been working on the transition to Apache for a while, and our

Re: Option folding idiom

2013-12-26 Thread Nick Pentreath
+1 for getOrElse When I was new to Scala I tended to use match almost like if/else statements with Option. These days I try to use map/flatMap instead and use getOrElse extensively and I for one find it very intuitive. I also agree that the fold syntax seems way less intuitive and I certain

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
t;>> directory. I would love it if Spark used the Tanuki Service Wrapper, >>>> which >>>>> is widely-used for Java service daemons, supports retries, >> installation >>>> as >>>>> init scripts that can be chkconfig'd, etc.

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Nick Pentreath
One option that is 3rd party that works nicely for the Hadoop project and it's related projects is http://search-hadoop.com - managed by sematext. Perhaps we can plead with Otis to add Spark lists to search-spark.com, or the existing site? Just throwing it out there as a potential solution to a

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
esome if you could use either the hostname or the FQDN or the > IP address in the Spark URL and not have Akka barf at you. > I've been telling myself I'd look into these at some point but just haven't > gotten around to them myself yet. Some day! I would prioritize

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-12-19 Thread Nick Pentreath
still needs a bit of clean up work, and I need to add the concept of "wrapper functions" to deserialize classes that MsgPack can't handle out the box. N — Sent from Mailbox for iPhone On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath wrote: > Wow Josh, that looks great. I

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Or if you're extremely ambitious work in implementing Spark Streaming in Python— Sent from Mailbox for iPhone On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia wrote: > Hi Matt, > If you want to get started looking at Spark, I recommend the following > resources: > - Our issue tracker at http://sp

Re: Intellij IDEA build issues

2013-12-16 Thread Nick Pentreath
? ie you no longer need to run gen-idea. > > > On Sat, Dec 7, 2013 at 4:15 AM, Nick Pentreath >wrote: > > > Hi Spark Devs, > > > > Hoping someone cane help me out. No matter what I do, I cannot get > Intellij > > to build Spark from source. I am using IDEA

Re: Scala 2.10 Merge

2013-12-14 Thread Nick Pentreath
Whoohoo! Great job everyone especially Prashant! — Sent from Mailbox for iPhone On Sat, Dec 14, 2013 at 10:59 AM, Patrick Wendell wrote: > Alright I just merged this in - so Spark is officially "Scala 2.10" > from here forward. > For reference I cut a new branch called scala-2.9 with the commi

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-11 Thread Nick Pentreath
- Successfully built via sbt/sbt assembly/assembly on Mac OS X, as well as on a dev Ubuntu EC2 box - Successfully tested via sbt/sbt test locally - Successfully built and tested using mvn package locally - I've tested my own Spark jobs (built against 0.8.0-incubating) on this RC a

Intellij IDEA build issues

2013-12-07 Thread Nick Pentreath
Hi Spark Devs, Hoping someone cane help me out. No matter what I do, I cannot get Intellij to build Spark from source. I am using IDEA 13. I run sbt gen-idea and everything seems to work fine. When I try to build using IDEA, everything compiles but I get the error below. Have any of you come acr

PySpark - Dill serialization

2013-12-05 Thread Nick Pentreath
Hi devs I came across Dill ( http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python serialization. Was wondering if it may be a replacement to the cloudpickle stuff (and remove that piece of code that needs to be maintained within PySpark)? Josh have you looked into Dill? Any th

PySpark / scikit-learn integration sprint at Cloudera - Strata Conference Friday 14th Feb 2014

2013-12-02 Thread Nick Pentreath
Hi Spark Devs An idea developed recently out of a scikit-learn mailing list discussion ( http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5HGKYH9Myp7imrJ-nU%3DpJgeGqcCn3JC0m4MmGWZi35Hw%40mail.gmail.com&forum_name=scikit-learn-general) to have a coding sprint around Strata in Feb, fo

Re: [Scikit-learn-general] Spark-backed implementations of scikit-learn estimators

2013-11-26 Thread Nick Pentreath
CC'ing Spark Dev list I have been thinking about this for quite a while and would really love to see this happen. Most of my pipeline ends up in Scala/Spark these days - which I love, but it is partly because I am reliant on custom Hadoop input formats that are just way easier to use from Scala/J

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-11-08 Thread Nick Pentreath
serializers based on each stage's > > input and output formats ( > > > https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42 > > ). > > > > At some point, I'd like to port my custom serializers

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-10-30 Thread Nick Pentreath
> the need for a delimiter by creating a PythonRDD from the newHadoopFile > > JavaPairRDD and adding a new method to writeAsPickle ( > > > > > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L224 > >

Julia bindings

2013-10-24 Thread Nick Pentreath
Hi Spark Devs If you could pick one language binding to add to Spark what would it be? Probably Clojure or JRuby if JVM is of interest. I'm quite excited about Julia as a language for scientific computing ( http://julialang.org). The Julia community have been very focused on things like interop w

[PySpark]: reading arbitrary Hadoop InputFormats

2013-10-24 Thread Nick Pentreath
Hi Spark Devs I was wondering what appetite there may be to add the ability for PySpark users to create RDDs from (somewhat) arbitrary Hadoop InputFormats. In my data pipeline for example, I'm currently just using Scala (partly because I love it but also because I am heavily reliant on quite cust

Re: Propose to Re-organize the scripts and configurations

2013-09-16 Thread Nick Pentreath
There was another discussion on the old dev list about this: https://groups.google.com/forum/#!msg/spark-developers/GL2_DwAeh5s/9rwQ3iDa2t4J I tend to agree with having configuration sitting in JSON (or properties files) and using the Typesafe Config library which can parse both. Something I've u

Re: MLI dependency exception

2013-09-11 Thread Nick Pentreath
Is mLI available? Where is the repo located? — Sent from Mailbox for iPhone On Tue, Sep 10, 2013 at 10:45 PM, Gowtham N wrote: > It worked. > I was using old master for spark, which I forked many days a ago. > On Tue, Sep 10, 2013 at 1:25 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley

Re: Adding support for implicit feedback to ALS

2013-09-09 Thread Nick Pentreath
for sufficiently large /skewed datasets. I > guess I am interested in GraphX release to replace reliance on Bagel. > 5. if the task reformulation is accepted, there are further > optimizations that could be applied to blocking -- but this > implementation gets the gist of it what i did in

Adding support for implicit feedback to ALS

2013-09-08 Thread Nick Pentreath
Hi I know everyone's pretty busy with getting 0.8.0 out, but as and when folks have time it would be great to get your feedback on this PR adding support for the 'implicit feedback' model variant to ALS: https://github.com/apache/incubator-spark/pull/4 In particular any potential efficiency impro

Apache account

2013-09-05 Thread Nick Pentreath
Hi I submitted my license agreement and account name request a while back, but still haven't received any correspondence. Just wondering what I need to do in order to follow this up? Thanks Nick

Scikit-learn API paper

2013-08-16 Thread Nick Pentreath
Quite interesting, and timely given current thinking around MLlib and MLI http://orbi.ulg.ac.be/bitstream/2268/154357/1/paper.pdf I do really like the way they have approached their API - and so far MLlib seems to be following a (roughly) similar approach. Interesting in particular they obviousl

Re: Machine Learning on Spark [long rambling discussion email]

2013-07-25 Thread Nick Pentreath
used in practice, and it >> would be great to add them to the MLI library (and perhaps also MLlib). >> >> -Ameet >> >> >> On Thu, Jul 25, 2013 at 6:44 AM, Nick Pentreath >> wrote: >> >>> Hi >>> >>> Ok, that all makes sense

Re: Machine Learning on Spark [long rambling discussion email]

2013-07-25 Thread Nick Pentreath
ibuting > to it. MLI is a private repository right now, but we'll make it public > soon though, and Evan Sparks or I will let you know when we do so. > > Thanks again for getting in touch with us! > > -Ameet > > > On Wed, Jul 24, 2013 at 11:47 AM, Rey

Machine Learning on Spark [long rambling discussion email]

2013-07-24 Thread Nick Pentreath
Hi dev team (Apologies for a long email!) Firstly great news about the inclusion of MLlib into the Spark project! I've been working on a concept and some code for a machine learning library on Spark, and so of course there is a lot of overlap between MLlib and what I've been doing. I wanted to