Re: Mahout Vs Spark

Dmitriy Lyubimov Wed, 22 Oct 2014 10:59:39 -0700

For the record, this is all false dilemma (at least w.r.t. spark vs mahout
spark bindings).

The spark bindings have never been concieved as one vs another.

Mahout scala bindings is on-top add-on to spark that just happens to rely
on some of things in mahout-math.

With spark one gets some major things being RDDS, mllib, spark QL and
GraphX.

Guess what, in Spark bindings one still gets all of those wonderful things,
plus the bindings and bindings shell.

Most add-on values in spark bindings are R-like notation for the algebra
and distributed algebraic optimizer.  Of course there are all those
wonderful distributed decompositions and pca things, naive base and i think
some of co-occurrence stuff too. (implicit ALS work for spark was never
committed, sadly, available on a PR branch only). internally my company
have built several x more methodology code on spark bindings than spark
binding has on its own.

Spark bindings are also 100% scala. The only thing that is non-scala (at
runtime)  is the in-memory Colt-derived matrix model, which is adapted to
r-like dsl with scala bindings. Oh well. can't have it all.

Bottom line,  for most part I feel you are building a straw man argument
here. You presenting a problem as being a constrained choice with
inevitable loss, whereas there has never been a loss of a choice. Even for
the sake of algebraic decompositions and optimizations i feel there's a
significant added value. (of course again this is only relevant to bindings
stuff, not the 0.9 MR stuff all of which is now deprecated).

The only two problems I see is that (1) Mahout takes in too much legacy
dependencies that are hard to sort thru if one is using it strictly in
spark base apps. Too many things to sort thru and throw away in that tree.
I actually use an opt-in approach (that is, i remove all transitive
dependencies by default and only add them one-by-one if there's actual
runtime dependency). This is something that could, and should be improved
incrementally.

Second design problem is that Mahout may be a bit of a problem for using
alongside other on-top-of-spark systems because it takes over some things
in Spark (e.g. it requires things to work with kryo). But this is more of
the Spark limitation itself.

But speaking of "survival" and "popularity" concerns, which are very valid
themselves, I think the major problem with Mahout is none of these alleged
vs things. Strictly IMO it is that being a ML project, unlike all those
other wonderful things, it is not widely backed by any major university or
academic community.  It has never been. And at this point it would seem it
will never be. As such, unlike with some other projects, there is no
perpetual source of ambitious researchers to contribute. And original
founders long since posted their last significant contribution.

On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <[email protected]>
wrote:

> Hi Team,
>
> Thanks for your replies, even if you consider the strong implementation of
> Recommendations and SVD in Mahout, I would still say that even in Spark
> 1.1.0 there is support for collaborative filtering (alternating least
> squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
> pace contributions, I believe Spark may NOT be far away to have new and
> stable algorithms added to it (Like ANN, HMM etc and support for scientific
> libraries).
>
> Ted, Even though Mahout (1.0) development code base support Scala and Spark
> bindings externally, Spark has this inbuilt support for Scala (as its been
> developed in Scala). And Numpy is a python based scientific library which
> need to be used for the support of Python based MLlib in Spark. Benefits
> are python is also supported in Spark for Python users.
>
> Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
> built-in support for Text processing. Ofcourse I do NOT believe its a
> strong point as I assume that, developers knowing Lucene can be able to
> easily use it with Spark through Java interface.
>
> Mahout currently stopped support for Hadoop (i.e., for further libraries)
> on the other hand Spark can re-use the data present in Hadoop/Hbase easily
> (May NOT be mapreduce functionality as Spark has its own computation
> layer).
>
> *As a user of Mahout since long time I strongly support Mahout (despite of
> poor visualization capabilities), at the same time, I am trying to
> understand if Spark continues to be evolved in MLLib package and being
> support for in-memory computation and with rich scientific libraries
> through Scala and support for languages like Java/Scala/Python will the
> survival of Mahout be questionable?*
>
> Best!
> Mahesh Balija.
>
>
>
> On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <[email protected]> wrote:
>
> > I know we lost the maintainer for fpgrowth somewhere along the line but
> > it's definitely something I'd love to see carried forward, too.
> >
> > Sent from my iPhone
> >
> > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <[email protected]>
> wrote:
> > >
> > > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> > are in, I'm hoping to work up some time series methods.
> > >
> > >> On Oct 21, 2014, at 8:00 PM, Lee S <[email protected]> wrote:
> > >>
> > >> As a developer, who is facing the library  chosen between mahout and
> > mllib,
> > >> I have some idea below.
> > >> Mahout has no any decision tree algorithm. But MLLIB has the
> components
> > of
> > >> constructing a decision tree algorithm such as gini index, information
> > >> gain. And also  I think mahout can add algorithm about frequency
> pattern
> > >> mining which is very import in feature selection and statistic
> analysis.
> > >> MLLIB has no frequent mining algorithms.
> > >> p.s Why fpgrowth algorithm is removed in version 0.9?
> > >>
> > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <[email protected]
> >:
> > >>
> > >>> actually spark is available in python also, so users of spark are
> > having an
> > >>> upper hand over users of traditional users of mahout. This is
> > applicable to
> > >>> all the libraries of python (including numpy).
> > >>>
> > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <[email protected]>
> > >>> wrote:
> > >>>
> > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > >>> [email protected]
> > >>>> wrote:
> > >>>>
> > >>>>> I am trying to differentiate between Mahout and Spark, here is the
> > >>> small
> > >>>>> list,
> > >>>>>
> > >>>>> Features Mahout Spark  Clustering Y Y  Classification Y Y
> > >>> Regression Y
> > >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y  Python N Y
> > >>> Numpy N
> > >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y N/A
> > >>> scalability Y
> > >>>>> Y
> > >>>>
> > >>>> Mahout doesn't actually have strong features for clustering,
> > >>> classification
> > >>>> and regression. Mahout is very strong in recommendations (which you
> > don't
> > >>>> mention) and dimensionality reduction.
> > >>>>
> > >>>> Mahout does support scala in the development version.
> > >>>>
> > >>>> What do you mean by support for Numpy?
> > >
> >
>

Re: Mahout Vs Spark

Reply via email to