Re: Mahout on Spark?

Nick Pentreath Tue, 18 Feb 2014 13:59:23 -0800

Spark provides a "lower-level" ML library called MLlib. MLI / MLBase is
built on top of this and includes some high-level abstractions similar in
nature to distributed matrices / dataframes. But it's still pretty new and
rough at this point (https://github.com/amplab/MLI).
MLlib already provides (
https://github.com/apache/incubator-spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib
):
- regression / classification (Log loss, SVM, squared loss with L1 / L2
regularization) via SGD
- soon will have decision trees / random forests
- clustering (K-Means)
- recommendations (Alternating Least Squares)
- SVD


In terms of implementations, IMO:
- ALS is superior to Mahout because it is block-distributed and so more
efficient. Dmitriy has some things he has been working on that indicate
that a GraphX implementation may be even more efficient.
- big downside currently is lack of sparse support (coming soon hopefully
in https://github.com/apache/incubator-spark/pull/575)
- K-means is probably a wash in terms of algorithm, though Spark will
probably be faster due to caching and maybe slightly better due to init
algorithm
- Mahout has the Streaming K-Means which is neat, and would be a cool
addition to MLlib
- Mahout has co-occurence based recommender stuff which, though I've not
used, seems very good in practice and again should not be too crazy to port
in principle
- I think Mahout has a few more linear algebra implementations, though
MLlib includes SVD
- Mahout has the various integration layers for recommender stuff in
particular
- Mahout has more in terms of featurizing (text, analyzers, hashing etc) -
though MLI provides some of this
- Mahout has more in terms of analysis of model performance (like various
evaluation metrics)
- Mahout has more in terms of things like analysis/summarizers and Ted's
new t-digest (though with some monoid-ification this can be applied in
Spark fairly trivially)

It would be really cool to see if a Spark backend for Mahout could be
developed (I know Dmitriy has looked at this in respect of
DistributedMatrix stuff), or at least parts ported over to Spark. A very
big potential pain point is if Spark doesn't adopt mahout-math (which seems
the case at the moment though undecided). Still, notwithstanding this I
feel a lot of stuff from Mahout can be adapted to Spark without necessarily
needing a total overhaul.

My (admittedly heavily biased) view is Spark is a superior platform overall
for ML. If the two communities can work together to leverage the strengths
of Spark, and the large amount of good stuff in Mahout (as well as the
fantastic depth of experience of Mahout devs) I think a lot can be achieved!

N


On Tue, Feb 18, 2014 at 11:17 PM, Mohit Singh <[email protected]> wrote:

> In general, if you are interested in machine learning..  think there is
> already a machine learning specific initiative on spark called Mlbase (
> http://www.mlbase.org/)
> and graphx (http://amplab.github.io/graphx/) for graphlab style ml.
>
>
>
>
>
> On Tue, Feb 18, 2014 at 1:14 PM, Harshit Bapna <[email protected]> wrote:
>
> > I am very eager to know the same from the community.
> > Thanks for bringing it up.
> >
> > --Harshit
> >
> >
> > On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao <[email protected]> wrote:
> >
> > > Just wonder what is the future of Mahout. We are seeing new stuff from
> > > 0xdata and skytree. And spark is also design for in-memory iterative
> > > analysis. What about mahout? Will mahout run on top of spark in future?
> > >
> > > Thanks,
> > > Ying Liao
> > >
> >
> >
> >
> > --
> > --Harshit
> >
>
>
>
> --
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates
>

Re: Mahout on Spark?

Reply via email to