Hi Dmitriy, My apologizes if I have conveyed my questions incorrectly. Also my intentions are definitely NOT arguments.
I have experience with Mahout, I am also working on some content to make Mahout simplified due to which I needed this clarifications. I am also validating both the frameworks, just wanted to take some inputs from the active contributors. Best! Mahesh Balija. On Wed, Oct 22, 2014 at 6:57 PM, Dmitriy Lyubimov <[email protected]> wrote: > For the record, this is all false dilemma (at least w.r.t. spark vs mahout > spark bindings). > > The spark bindings have never been concieved as one vs another. > > Mahout scala bindings is on-top add-on to spark that just happens to rely > on some of things in mahout-math. > > With spark one gets some major things being RDDS, mllib, spark QL and > GraphX. > > Guess what, in Spark bindings one still gets all of those wonderful things, > plus the bindings and bindings shell. > > Most add-on values in spark bindings are R-like notation for the algebra > and distributed algebraic optimizer. Of course there are all those > wonderful distributed decompositions and pca things, naive base and i think > some of co-occurrence stuff too. (implicit ALS work for spark was never > committed, sadly, available on a PR branch only). internally my company > have built several x more methodology code on spark bindings than spark > binding has on its own. > > Spark bindings are also 100% scala. The only thing that is non-scala (at > runtime) is the in-memory Colt-derived matrix model, which is adapted to > r-like dsl with scala bindings. Oh well. can't have it all. > > Bottom line, for most part I feel you are building a straw man argument > here. You presenting a problem as being a constrained choice with > inevitable loss, whereas there has never been a loss of a choice. Even for > the sake of algebraic decompositions and optimizations i feel there's a > significant added value. (of course again this is only relevant to bindings > stuff, not the 0.9 MR stuff all of which is now deprecated). > > The only two problems I see is that (1) Mahout takes in too much legacy > dependencies that are hard to sort thru if one is using it strictly in > spark base apps. Too many things to sort thru and throw away in that tree. > I actually use an opt-in approach (that is, i remove all transitive > dependencies by default and only add them one-by-one if there's actual > runtime dependency). This is something that could, and should be improved > incrementally. > > Second design problem is that Mahout may be a bit of a problem for using > alongside other on-top-of-spark systems because it takes over some things > in Spark (e.g. it requires things to work with kryo). But this is more of > the Spark limitation itself. > > > But speaking of "survival" and "popularity" concerns, which are very valid > themselves, I think the major problem with Mahout is none of these alleged > vs things. Strictly IMO it is that being a ML project, unlike all those > other wonderful things, it is not widely backed by any major university or > academic community. It has never been. And at this point it would seem it > will never be. As such, unlike with some other projects, there is no > perpetual source of ambitious researchers to contribute. And original > founders long since posted their last significant contribution. > > On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <[email protected] > > > wrote: > > > Hi Team, > > > > Thanks for your replies, even if you consider the strong implementation > of > > Recommendations and SVD in Mahout, I would still say that even in Spark > > 1.1.0 there is support for collaborative filtering (alternating least > > squares (ALS)) and under dimensionality reduction SVD and PCA. With fast > > pace contributions, I believe Spark may NOT be far away to have new and > > stable algorithms added to it (Like ANN, HMM etc and support for > scientific > > libraries). > > > > Ted, Even though Mahout (1.0) development code base support Scala and > Spark > > bindings externally, Spark has this inbuilt support for Scala (as its > been > > developed in Scala). And Numpy is a python based scientific library which > > need to be used for the support of Python based MLlib in Spark. Benefits > > are python is also supported in Spark for Python users. > > > > Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has > > built-in support for Text processing. Ofcourse I do NOT believe its a > > strong point as I assume that, developers knowing Lucene can be able to > > easily use it with Spark through Java interface. > > > > Mahout currently stopped support for Hadoop (i.e., for further libraries) > > on the other hand Spark can re-use the data present in Hadoop/Hbase > easily > > (May NOT be mapreduce functionality as Spark has its own computation > > layer). > > > > *As a user of Mahout since long time I strongly support Mahout (despite > of > > poor visualization capabilities), at the same time, I am trying to > > understand if Spark continues to be evolved in MLLib package and being > > support for in-memory computation and with rich scientific libraries > > through Scala and support for languages like Java/Scala/Python will the > > survival of Mahout be questionable?* > > > > Best! > > Mahesh Balija. > > > > > > > > On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <[email protected]> wrote: > > > > > I know we lost the maintainer for fpgrowth somewhere along the line but > > > it's definitely something I'd love to see carried forward, too. > > > > > > Sent from my iPhone > > > > > > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <[email protected]> > > wrote: > > > > > > > > Sing it, brother! I miss FP Growth as well. Once the Scala bindings > > > are in, I'm hoping to work up some time series methods. > > > > > > > >> On Oct 21, 2014, at 8:00 PM, Lee S <[email protected]> wrote: > > > >> > > > >> As a developer, who is facing the library chosen between mahout and > > > mllib, > > > >> I have some idea below. > > > >> Mahout has no any decision tree algorithm. But MLLIB has the > > components > > > of > > > >> constructing a decision tree algorithm such as gini index, > information > > > >> gain. And also I think mahout can add algorithm about frequency > > pattern > > > >> mining which is very import in feature selection and statistic > > analysis. > > > >> MLLIB has no frequent mining algorithms. > > > >> p.s Why fpgrowth algorithm is removed in version 0.9? > > > >> > > > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad < > [email protected] > > >: > > > >> > > > >>> actually spark is available in python also, so users of spark are > > > having an > > > >>> upper hand over users of traditional users of mahout. This is > > > applicable to > > > >>> all the libraries of python (including numpy). > > > >>> > > > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning < > [email protected]> > > > >>> wrote: > > > >>> > > > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija < > > > >>> [email protected] > > > >>>> wrote: > > > >>>> > > > >>>>> I am trying to differentiate between Mahout and Spark, here is > the > > > >>> small > > > >>>>> list, > > > >>>>> > > > >>>>> Features Mahout Spark Clustering Y Y Classification Y Y > > > >>> Regression Y > > > >>>>> Y Dimensionality Reduction Y Y Java Y Y Scala N Y Python N Y > > > >>> Numpy N > > > >>>>> Y Hadoop Y Y Text Mining Y N Scala/Spark Bindings Y N/A > > > >>> scalability Y > > > >>>>> Y > > > >>>> > > > >>>> Mahout doesn't actually have strong features for clustering, > > > >>> classification > > > >>>> and regression. Mahout is very strong in recommendations (which > you > > > don't > > > >>>> mention) and dimensionality reduction. > > > >>>> > > > >>>> Mahout does support scala in the development version. > > > >>>> > > > >>>> What do you mean by support for Numpy? > > > > > > > > > >
