For the record, this is all false dilemma (at least w.r.t. spark vs mahout spark bindings).
The spark bindings have never been concieved as one vs another. Mahout scala bindings is on-top add-on to spark that just happens to rely on some of things in mahout-math. With spark one gets some major things being RDDS, mllib, spark QL and GraphX. Guess what, in Spark bindings one still gets all of those wonderful things, plus the bindings and bindings shell. Most add-on values in spark bindings are R-like notation for the algebra and distributed algebraic optimizer. Of course there are all those wonderful distributed decompositions and pca things, naive base and i think some of co-occurrence stuff too. (implicit ALS work for spark was never committed, sadly, available on a PR branch only). internally my company have built several x more methodology code on spark bindings than spark binding has on its own. Spark bindings are also 100% scala. The only thing that is non-scala (at runtime) is the in-memory Colt-derived matrix model, which is adapted to r-like dsl with scala bindings. Oh well. can't have it all. Bottom line, for most part I feel you are building a straw man argument here. You presenting a problem as being a constrained choice with inevitable loss, whereas there has never been a loss of a choice. Even for the sake of algebraic decompositions and optimizations i feel there's a significant added value. (of course again this is only relevant to bindings stuff, not the 0.9 MR stuff all of which is now deprecated). The only two problems I see is that (1) Mahout takes in too much legacy dependencies that are hard to sort thru if one is using it strictly in spark base apps. Too many things to sort thru and throw away in that tree. I actually use an opt-in approach (that is, i remove all transitive dependencies by default and only add them one-by-one if there's actual runtime dependency). This is something that could, and should be improved incrementally. Second design problem is that Mahout may be a bit of a problem for using alongside other on-top-of-spark systems because it takes over some things in Spark (e.g. it requires things to work with kryo). But this is more of the Spark limitation itself. But speaking of "survival" and "popularity" concerns, which are very valid themselves, I think the major problem with Mahout is none of these alleged vs things. Strictly IMO it is that being a ML project, unlike all those other wonderful things, it is not widely backed by any major university or academic community. It has never been. And at this point it would seem it will never be. As such, unlike with some other projects, there is no perpetual source of ambitious researchers to contribute. And original founders long since posted their last significant contribution. On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <[email protected]> wrote: > Hi Team, > > Thanks for your replies, even if you consider the strong implementation of > Recommendations and SVD in Mahout, I would still say that even in Spark > 1.1.0 there is support for collaborative filtering (alternating least > squares (ALS)) and under dimensionality reduction SVD and PCA. With fast > pace contributions, I believe Spark may NOT be far away to have new and > stable algorithms added to it (Like ANN, HMM etc and support for scientific > libraries). > > Ted, Even though Mahout (1.0) development code base support Scala and Spark > bindings externally, Spark has this inbuilt support for Scala (as its been > developed in Scala). And Numpy is a python based scientific library which > need to be used for the support of Python based MLlib in Spark. Benefits > are python is also supported in Spark for Python users. > > Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has > built-in support for Text processing. Ofcourse I do NOT believe its a > strong point as I assume that, developers knowing Lucene can be able to > easily use it with Spark through Java interface. > > Mahout currently stopped support for Hadoop (i.e., for further libraries) > on the other hand Spark can re-use the data present in Hadoop/Hbase easily > (May NOT be mapreduce functionality as Spark has its own computation > layer). > > *As a user of Mahout since long time I strongly support Mahout (despite of > poor visualization capabilities), at the same time, I am trying to > understand if Spark continues to be evolved in MLLib package and being > support for in-memory computation and with rich scientific libraries > through Scala and support for languages like Java/Scala/Python will the > survival of Mahout be questionable?* > > Best! > Mahesh Balija. > > > > On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <[email protected]> wrote: > > > I know we lost the maintainer for fpgrowth somewhere along the line but > > it's definitely something I'd love to see carried forward, too. > > > > Sent from my iPhone > > > > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <[email protected]> > wrote: > > > > > > Sing it, brother! I miss FP Growth as well. Once the Scala bindings > > are in, I'm hoping to work up some time series methods. > > > > > >> On Oct 21, 2014, at 8:00 PM, Lee S <[email protected]> wrote: > > >> > > >> As a developer, who is facing the library chosen between mahout and > > mllib, > > >> I have some idea below. > > >> Mahout has no any decision tree algorithm. But MLLIB has the > components > > of > > >> constructing a decision tree algorithm such as gini index, information > > >> gain. And also I think mahout can add algorithm about frequency > pattern > > >> mining which is very import in feature selection and statistic > analysis. > > >> MLLIB has no frequent mining algorithms. > > >> p.s Why fpgrowth algorithm is removed in version 0.9? > > >> > > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <[email protected] > >: > > >> > > >>> actually spark is available in python also, so users of spark are > > having an > > >>> upper hand over users of traditional users of mahout. This is > > applicable to > > >>> all the libraries of python (including numpy). > > >>> > > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <[email protected]> > > >>> wrote: > > >>> > > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija < > > >>> [email protected] > > >>>> wrote: > > >>>> > > >>>>> I am trying to differentiate between Mahout and Spark, here is the > > >>> small > > >>>>> list, > > >>>>> > > >>>>> Features Mahout Spark Clustering Y Y Classification Y Y > > >>> Regression Y > > >>>>> Y Dimensionality Reduction Y Y Java Y Y Scala N Y Python N Y > > >>> Numpy N > > >>>>> Y Hadoop Y Y Text Mining Y N Scala/Spark Bindings Y N/A > > >>> scalability Y > > >>>>> Y > > >>>> > > >>>> Mahout doesn't actually have strong features for clustering, > > >>> classification > > >>>> and regression. Mahout is very strong in recommendations (which you > > don't > > >>>> mention) and dimensionality reduction. > > >>>> > > >>>> Mahout does support scala in the development version. > > >>>> > > >>>> What do you mean by support for Numpy? > > > > > >
