Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-13 Thread Debasish Das
Sorry I just saw Graham's email after sending my previous email about this bug... I have been seeing this same issue on our ALS runs last week but I thought it was due my hacky way to run mllib 1.1 snapshot on core 1.0... What's the status of this PR ? Will this fix be back-ported to 1.0.1 as we

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
@Ignacio, happy to share, here's a link to a library we've been developing (https://github.com/freeman-lab/thunder). As just a couple examples, we have pipelines that use fourier transforms and other signal processing from scipy, and others that do massively parallel model fitting via Scikit lea

Kryo serialization issues

2014-08-13 Thread Debasish Das
Hi, Is there a JIRA for this bug ? I have seen it multiple times during our ALS runs now...some runs don't show while some runs fail due to the error msg https://github.com/GrahamDennis/spark-kryo-serialisation/blob/master/README.md One way to circumvent this is to not use kryo but then I am no

proposal for pluggable block transfer interface

2014-08-13 Thread Reynold Xin
Hi devs, I posted a design doc proposing an interface for pluggable block transfer (used in shuffle, broadcast, block replication, etc). This is expected to be done in 1.2 time frame. It should make our code base cleaner, and enable us to provide alternative implementations of block transfers (e.

acquire and give back resources dynamically

2014-08-13 Thread 牛兆捷
Dear all: Does spark can acquire resources from and give back resources to YARN dynamically ? -- *Regards,* *Zhaojie*

Re: Need info on Spark's Communication/Networking layer...

2014-08-13 Thread Rajiv Abraham
Hi Aniket, Perhaps this video will help: https://www.youtube.com/watch?v=HG2Yd-3r4-M&list=PLTPXxbhUt-YWGNTaDj6HSjnHMxiTD1HCR&index=1 You can see other upto date videos and slides here at : http://spark-summit.org/2014/training Best regards, Rajiv 2014-08-13 19:36 GMT-04:00 aniketadnaik : > Hi,

Need info on Spark's Communication/Networking layer...

2014-08-13 Thread aniketadnaik
Hi, I am new to Spark and want to explore more on Spark's master-worker/Cluster manager communication architecture. Any documents ? or code pointers will be helpful to start with. Thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Need-info-on-Spa

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu wrote: > On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas > wrote: >> Yep, I thought it was a bogus comparison. >> >> I should rephrase my question as it was poorly phrased: on average, how >> much faster is Spark v. PySpark (I didn't really mean Scala

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-13 Thread Graham Dennis
I now have a complete pull request for this issue that I'd like to get reviewed and committed. The PR is available here: https://github.com/apache/spark/pull/1890 and includes a testcase for the issue I described. I've also submitted a related PR ( https://github.com/apache/spark/pull/1827) that

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas wrote: > Yep, I thought it was a bogus comparison. > > I should rephrase my question as it was poorly phrased: on average, how > much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? > I've only used Spark and don't have a chance

Re: Added support for :cp to the Spark Shell

2014-08-13 Thread Reynold Xin
I haven't read the code yet, but if it is what I think it is, this is SUPER, UBER, HUGELY useful. On a related note, I asked about this on the Scala dev list but never got a satisfactory answer https://groups.google.com/forum/#!msg/scala-internals/_cZ1pK7q6cU/xyBQA0DdcYwJ On Wed, Aug 13, 20

Added support for :cp to the Spark Shell

2014-08-13 Thread Robert C Senkbeil
I've created a new pull request, which can be found at https://github.com/apache/spark/pull/1929. Since Spark is using Scala 2.10.3 and there is a known issue with Scala 2.10.x not supporting the :cp command (https://issues.scala-lang.org/browse/SI-6502), the Spark shell does not have the ability

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Shivaram Venkataraman
Yeah I worked on DistributedR while I was an intern at HP Labs, but it has evolved a lot since then. I don't think its a direct comparison as DistributedR is a pure R implementation in a distributed setting while SparkR is a wrapper around the Scala / Java implementations in Spark. That said, it w

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
BTW you can find the original Presto (rebranded as Distributed R) paper here: http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin wrote: > Actually I believe the same person started both projects. > > The Distributed R project

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
Actually I believe the same person started both projects. The Distributed R project from HP was started by Shivaram Venkataraman when he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR was his latest project. On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas < nicholas.c

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Ignacio Zendejas
Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark and don't have a chance to test this at the moment so if anybody has these numbers or gener

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Nicholas Chammas
On a related note, I recently heard about Distributed R , which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also Spar

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
Our experience matches Reynold's comments; pure-Python implementations of anything are generally sub-optimal compared to pure Scala implementations, or Scala versions exposed to Python (which are faster, but still slower than pure Scala). It also seems on first glance that some of the implementatio

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if y

A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Ignacio Zendejas
Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-13 Thread Yu Ishikawa
Hi all, I am also interested in specifying a common framework. And I am trying to implement a hierarchical k-means and a hierarchical clustering like single-link method with LSH. https://issues.apache.org/jira/browse/SPARK-2966 If you have designed the standardized clustering algorithms API, plea