Spark provides a "lower-level" ML library called MLlib. MLI / MLBase is built on top of this and includes some high-level abstractions similar in nature to distributed matrices / dataframes. But it's still pretty new and rough at this point (https://github.com/amplab/MLI). MLlib already provides ( https://github.com/apache/incubator-spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib ): - regression / classification (Log loss, SVM, squared loss with L1 / L2 regularization) via SGD - soon will have decision trees / random forests - clustering (K-Means) - recommendations (Alternating Least Squares) - SVD
In terms of implementations, IMO: - ALS is superior to Mahout because it is block-distributed and so more efficient. Dmitriy has some things he has been working on that indicate that a GraphX implementation may be even more efficient. - big downside currently is lack of sparse support (coming soon hopefully in https://github.com/apache/incubator-spark/pull/575) - K-means is probably a wash in terms of algorithm, though Spark will probably be faster due to caching and maybe slightly better due to init algorithm - Mahout has the Streaming K-Means which is neat, and would be a cool addition to MLlib - Mahout has co-occurence based recommender stuff which, though I've not used, seems very good in practice and again should not be too crazy to port in principle - I think Mahout has a few more linear algebra implementations, though MLlib includes SVD - Mahout has the various integration layers for recommender stuff in particular - Mahout has more in terms of featurizing (text, analyzers, hashing etc) - though MLI provides some of this - Mahout has more in terms of analysis of model performance (like various evaluation metrics) - Mahout has more in terms of things like analysis/summarizers and Ted's new t-digest (though with some monoid-ification this can be applied in Spark fairly trivially) It would be really cool to see if a Spark backend for Mahout could be developed (I know Dmitriy has looked at this in respect of DistributedMatrix stuff), or at least parts ported over to Spark. A very big potential pain point is if Spark doesn't adopt mahout-math (which seems the case at the moment though undecided). Still, notwithstanding this I feel a lot of stuff from Mahout can be adapted to Spark without necessarily needing a total overhaul. My (admittedly heavily biased) view is Spark is a superior platform overall for ML. If the two communities can work together to leverage the strengths of Spark, and the large amount of good stuff in Mahout (as well as the fantastic depth of experience of Mahout devs) I think a lot can be achieved! N On Tue, Feb 18, 2014 at 11:17 PM, Mohit Singh <[email protected]> wrote: > In general, if you are interested in machine learning.. think there is > already a machine learning specific initiative on spark called Mlbase ( > http://www.mlbase.org/) > and graphx (http://amplab.github.io/graphx/) for graphlab style ml. > > > > > > On Tue, Feb 18, 2014 at 1:14 PM, Harshit Bapna <[email protected]> wrote: > > > I am very eager to know the same from the community. > > Thanks for bringing it up. > > > > --Harshit > > > > > > On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao <[email protected]> wrote: > > > > > Just wonder what is the future of Mahout. We are seeing new stuff from > > > 0xdata and skytree. And spark is also design for in-memory iterative > > > analysis. What about mahout? Will mahout run on top of spark in future? > > > > > > Thanks, > > > Ying Liao > > > > > > > > > > > -- > > --Harshit > > > > > > -- > Mohit > > "When you want success as badly as you want the air, then you will get it. > There is no other secret of success." > -Socrates >
