>From my point of view, mahout as a whole has shifted from what it was in 2009-2012: At the time, Mahout (and Mahout in Action is a great testimony of that era) was a sum of bricks, full of relatively high-level mathematics concepts but useable by what I'd call (myself included) wanna-be datascientists. With an approach akin to "datascience for hackers", it was possible to build a crude but working ML tool, such as a recommender. My memory is lacking, but I think my first experiments with Taste date from 2008. I had, at the time, no intimate mathematical knowledge of what the code wrote by Ted, Sean & many others did. I fed order histories into it and got back recommendations that "made sense". I then did the same with knn & more. Over the years, I got better at understanding the mathematical concepts, thanks in particular to scientists that took the time of explaining to me, a tech/data guy, what were the mathematical concepts behind the blocks I blindly used.
I'd say that "mahout today" is a distributed mathematical toolbox. Nothing wrong with that, absolutely nothing. It has its purposes but I feel it's no longer aimed at "tech people wanting to have a go at machine learning".. When I take a look at my company code repository, even though I'm less and less involved in day to day design decisions, I see that its "lively" components are indeed using stuff like Tensorflow & dl4j. My scientific credentials are obviously way less impressive than Ted's, whom I had the pleasure to meet a few times as well as quite a few of MapR employees, but I make exactly the same analysis coming from a tech/functionnal background: for recommendation, don't bother reinventing the wheel or using "fancy" ALS stuff (been there, done that, shown no impressive gain in practical use-cases): buy an off the shelf solution (disclaimer: I sell one ;-) ) or build it from Mahout Taste and do some data wrangling with a search engine (but if you're in a hurry, definitely go and talk to vendors, a few caveats apply :-) ). For everything else ML related, have a go at tensorflow implementations related to your use case, you will find books which are as didactic as Mahout in Action was 6 years ago. All in all: congrats to the Mahout team, past and current contributors, you achieved a good damn job and got me into this field, for which I am very grateful! 2017-01-31 18:30 GMT+01:00 Ted Dunning <ted.dunn...@gmail.com>: > From my perspective, the state of the art of machine learning is with > systems like Tensorflow and dl4j. If you can deal with the limits of a > non-clustered GPU system, then Theano and Cafe are very useful. Keras > papers over the difference between different back-ends nicely. > > Tensorflow and Theano can do a lot of mathematical and linear (tensor, > actually) algebra work nicely, especially if there is an optimization > problem lurking. > > NVidia also has a very strong commercial offering that supports their GPU > clustering well. > > Spark ML lags this state of the art very far behind, but is still useful > for simpler situations. > > For recommendations, the situation is very different. Almost all > applications are most easily and often most accurately solved using an > indicator-based approach and the go-to implementation of this is Mahout. > > There is a lot of noise in the world about factorization-based > recommendation using ALS and such, but the noise is not warranted. > Deploying a recommender in a search engine is just better. > > I have not personally used Samsara much, but the idea of a strong optimizer > over the top of a nice syntax for linear algebra is a good one. > > On Tue, Jan 31, 2017 at 9:21 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > > My perspective comes from the data side. I work in recommenders and that > > means log analysis for huge amounts of data. Even a small shop doing this > > will immediately run our of the capacity in Python or R on a single node. > > MLlib is a set of prepackaged algorithms that will work (mostly) with big > > data. Mahout Samsara is the only general linear algebra tool I know of > that > > will natively let you interactively run R-like code on any size cluster, > > then polish it for production all without changing tools, or language. > > > > Going from analytics to recommenders means a jump in data size of several > > orders of magnitude and this is just one example. > > > > > > On Jan 31, 2017, at 6:50 AM, Trevor Grant <trevor.d.gr...@gmail.com> > > wrote: > > > > Hello Isabel and Florent, > > > > I'm currently working on a side-by-side demo of R / Python / > SparkML(Mllib) > > / Mahout, but in very broad strokes here is how I would compare them: > > > > R- Most statistical functionality. Most flexibility. Implement your own > > algorithms- mathematically expressive language. Worst performance- > handles > > only "small" data sets. Language is 'math centric'. Easy to extend / > > create new algos > > > > Python (sklearn/scikit) - Some mathematical / statistical functionality, > > more focused on machine learning. Machine learning library very > > sophisticated though. Much better performance than R, still only single > > node. "small to medium" data sets. Language is 'programmer centric'. > > Somewhat difficult to extend / create new algos > > > > SparkML / Mllib - Very Limited Mathematical functionality (usually > collects > > to driver to do anything of substance). Machine learning rudimentary > > compared to sklearn, but still non-trivial one of the best available. > > Exceeding performance, well suited to "big" data sets. Language is > > 'programmer centric'. Very difficult to extend / create new algos. > > > > (FlinkML - Fits in same spot as SparkML, but significantly less > developed) > > > > Mahout - Good mathematical functionality. Good performance relative to > > underlying engine (possibly superior with MAHOUT-1885). Language is > 'math > > centric'. Well suited to "medium and big" data sets. Fairly easy to > extend > > / create new algos (MAHOUT-1856) > > > > I hope that provides a high level comparison. > > > > Re use cases- the tool to use depends on the job at hand. > > Highly advanced mathematical model, small dataset or sampling from full > > dataset OK -> Use R > > Machine learning on small to medium data set or sampling from full > dataset > > OK -> Use Python / sklearn > > Less sophisticated machine learning on Large dataset -> SparkML > > Custom mathematical/statistical model on medium to large data -> Mahout > > > > ^^ All of this is just my opinion. > > > > Re: integration- > > > > We're working on that too. Recently MAHOUT-1896 added convenience > methods > > for interacting with MLLib type RDDs, and DataFrames > > https://issues.apache.org/jira/browse/MAHOUT-1896 > > > > (No support yet for SparkML type dataframes, or spitting DRMs back out > into > > RDDs/DataFrames). > > > > Finally Docs: There has been some talk for sometime of migrating the > > website from CMS to Jekyll and its something I strongly support. The CMS > > makes it difficult to keep up with documentation, and Jekyll would open > up > > documentation /website maintenance to contributors. > > > > Trevor Grant > > Data Scientist > > https://github.com/rawkintrevo > > http://stackexchange.com/users/3002022/rawkintrevo > > http://trevorgrant.org > > > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > > > > On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis <florent.em...@gmail.com> > > wrote: > > > > > Hi, > > > > > > I am in the same spot as Isabel. > > > Used to use/understand most of the «old» standalone mahout, now doing > > some > > > data transformation with spark, but I am not sure where Samsara fits in > > the > > > ecosystem. > > > We also do quite a bit of computation in R. > > > Basically we are willing to learn and support the project by for > instance > > > buying the books Rob mentioned, but a short doc with the outline Isabel > > > describes would be great! > > > > > > Many thanks, > > > > > > Florent > > > > > > > > > Le 31 janv. 2017 12:01, "Isabel Drost-Fromm" <isa...@apache.org> a > > écrit : > > > > > > > > > Hi, > > > > > > On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote: > > >> and we're thinking about just how many pre-built algorithms we > > >> should include in the library versus working on performance behind the > > >> scenes. > > > > > > To pick this question up: I've been watching Mahout from a distance for > > > quite > > > some time. So from what limited background I have of Samsara I really > > like > > > it's > > > approach to be able to run on more than one execution engine. > > > > > > To give some advise to downstream users in the field - what would be > your > > > advise > > > for people tasked with concrete use cases (stuff like fraud detection, > > > anomaly > > > detection, learning search ranking functions, building a recommender > > > system)? Is > > > that something that can still be done with Mahout? What would it take > to > > > get > > > from raw data to finished system? Is there something we can do to help > > > users get > > > that accomplished? Is there even interest from users in such a use case > > > based > > > perspective? If so, would there be interest among the Mahout committers > > to > > > help > > > users publicly create docs/examples/modules to support these use cases? > > > > > > > > > Isabel > > > > > > > >