Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Florent Empis Tue, 31 Jan 2017 12:55:02 -0800

>From my point of view, mahout as a whole has shifted from what it was in
2009-2012:
At the time, Mahout (and Mahout in Action is a great testimony of that era)
was a sum of bricks, full of relatively high-level mathematics concepts but
useable by what I'd call (myself included) wanna-be datascientists.
With an approach akin to "datascience for hackers", it was possible to
build a crude but working ML tool, such as a recommender. My memory is
lacking, but I think my first experiments with Taste date from 2008. I had,
at the time, no intimate mathematical knowledge of what the code wrote by
Ted, Sean & many others did. I fed order histories into it and got back
recommendations that "made sense". I then did the same with knn & more.
Over the years, I got better at understanding the mathematical concepts,
thanks in particular to scientists that took the time of explaining to me,
a tech/data guy, what were the mathematical concepts behind the blocks I
blindly used.


I'd say that "mahout today" is a distributed mathematical toolbox. Nothing
wrong with that, absolutely nothing. It has its purposes but I feel it's no
longer aimed at "tech people wanting to have a go at machine learning"..

When I take a look at my company code repository, even though I'm less and
less involved in day to day design decisions, I see that its "lively"
components are indeed using stuff like Tensorflow & dl4j.

My scientific credentials are obviously way less impressive than Ted's,
whom I had the pleasure to meet a few times as well as quite a few of MapR
employees, but I make exactly the same analysis coming from a
tech/functionnal background: for recommendation, don't bother reinventing
the wheel or using "fancy" ALS stuff (been there, done that, shown no
impressive gain in practical use-cases): buy an off the shelf solution
(disclaimer: I sell one ;-) ) or build it from Mahout Taste and do some
data wrangling with a search engine (but if you're in a hurry, definitely
go and talk to vendors, a few caveats apply :-) ). For everything else ML
related, have a go at tensorflow implementations related to your use case,
you will find books which are as didactic as Mahout in Action was 6 years
ago.

All in all: congrats to the Mahout team, past and current contributors, you
achieved a good damn job and got me into this field, for which I am very
grateful!









2017-01-31 18:30 GMT+01:00 Ted Dunning <ted.dunn...@gmail.com>:

> From my perspective, the state of the art of machine learning is with
> systems like Tensorflow and dl4j. If you can deal with the limits of a
> non-clustered GPU system, then Theano and Cafe are very useful. Keras
> papers over the difference between different back-ends nicely.
>
> Tensorflow and Theano can do a lot of mathematical and linear (tensor,
> actually) algebra work nicely, especially if there is an optimization
> problem lurking.
>
> NVidia also has a very strong commercial offering that supports their GPU
> clustering well.
>
> Spark ML lags this state of the art very far behind, but is still useful
> for simpler situations.
>
> For recommendations, the situation is very different.  Almost all
> applications are most easily and often most accurately solved using an
> indicator-based approach and the go-to implementation of this is Mahout.
>
> There is a lot of noise in the world about factorization-based
> recommendation using ALS and such, but the noise is not warranted.
> Deploying a recommender in a search engine is just better.
>
> I have not personally used Samsara much, but the idea of a strong optimizer
> over the top of a nice syntax for linear algebra is a good one.
>
> On Tue, Jan 31, 2017 at 9:21 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> > My perspective comes from the data side. I work in recommenders and that
> > means log analysis for huge amounts of data. Even a small shop doing this
> > will immediately run our of the capacity in Python or R on a single node.
> > MLlib is a set of prepackaged algorithms that will work (mostly) with big
> > data. Mahout Samsara is the only general linear algebra tool I know of
> that
> > will natively let you interactively run R-like code on any size cluster,
> > then polish it for production all without changing tools, or language.
> >
> > Going from analytics to recommenders means a jump in data size of several
> > orders of magnitude and this is just one example.
> >
> >
> > On Jan 31, 2017, at 6:50 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> > wrote:
> >
> > Hello Isabel and Florent,
> >
> > I'm currently working on a side-by-side demo of R / Python /
> SparkML(Mllib)
> > / Mahout, but in very broad strokes here is how I would compare them:
> >
> > R- Most statistical functionality.  Most flexibility.  Implement your own
> > algorithms- mathematically expressive language.  Worst performance-
> handles
> > only "small" data sets.  Language is 'math centric'. Easy to extend /
> > create new algos
> >
> > Python (sklearn/scikit) - Some mathematical / statistical functionality,
> > more focused on machine learning. Machine learning library very
> > sophisticated though.  Much better performance than R, still only single
> > node. "small to medium" data sets. Language is 'programmer centric'.
> > Somewhat difficult to extend / create new algos
> >
> > SparkML / Mllib - Very Limited Mathematical functionality (usually
> collects
> > to driver to do anything of substance).  Machine learning rudimentary
> > compared to sklearn, but still non-trivial one of the best available.
> > Exceeding performance, well suited to "big" data sets. Language is
> > 'programmer centric'. Very difficult to extend / create new algos.
> >
> > (FlinkML - Fits in same spot as SparkML, but significantly less
> developed)
> >
> > Mahout - Good mathematical functionality.  Good performance relative to
> > underlying engine (possibly superior with MAHOUT-1885).  Language is
> 'math
> > centric'.  Well suited to "medium and big" data sets. Fairly easy to
> extend
> > / create new algos (MAHOUT-1856)
> >
> > I hope that provides a high level comparison.
> >
> > Re use cases- the tool to use depends on the job at hand.
> > Highly advanced mathematical model, small dataset or sampling from full
> > dataset OK -> Use R
> > Machine learning on small to medium data set or sampling from full
> dataset
> > OK -> Use Python / sklearn
> > Less sophisticated machine learning on Large dataset -> SparkML
> > Custom mathematical/statistical model on medium to large data -> Mahout
> >
> > ^^ All of this is just my opinion.
> >
> > Re: integration-
> >
> > We're working on that too.  Recently MAHOUT-1896 added convenience
> methods
> > for interacting with MLLib type RDDs, and DataFrames
> > https://issues.apache.org/jira/browse/MAHOUT-1896
> >
> > (No support yet for SparkML type dataframes, or spitting DRMs back out
> into
> > RDDs/DataFrames).
> >
> > Finally Docs: There has been some talk for sometime of migrating the
> > website from CMS to Jekyll and its something I strongly support.  The CMS
> > makes it difficult to keep up with documentation, and Jekyll would open
> up
> > documentation /website maintenance to contributors.
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis <florent.em...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I am in the same spot as Isabel.
> > > Used to use/understand most of the «old» standalone mahout, now doing
> > some
> > > data transformation with spark, but I am not sure where Samsara fits in
> > the
> > > ecosystem.
> > > We also do quite a bit of computation in R.
> > > Basically we are willing to learn and support the project by for
> instance
> > > buying the books Rob mentioned, but a short doc with the outline Isabel
> > > describes would be great!
> > >
> > > Many thanks,
> > >
> > > Florent
> > >
> > >
> > > Le 31 janv. 2017 12:01, "Isabel Drost-Fromm" <isa...@apache.org> a
> > écrit :
> > >
> > >
> > > Hi,
> > >
> > > On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
> > >> and we're thinking about just how many pre-built algorithms we
> > >> should include in the library versus working on performance behind the
> > >> scenes.
> > >
> > > To pick this question up: I've been watching Mahout from a distance for
> > > quite
> > > some time. So from what limited background I have of Samsara I really
> > like
> > > it's
> > > approach to be able to run on more than one execution engine.
> > >
> > > To give some advise to downstream users in the field - what would be
> your
> > > advise
> > > for people tasked with concrete use cases (stuff like fraud detection,
> > > anomaly
> > > detection, learning search ranking functions, building a recommender
> > > system)? Is
> > > that something that can still be done with Mahout? What would it take
> to
> > > get
> > > from raw data to finished system? Is there something we can do to help
> > > users get
> > > that accomplished? Is there even interest from users in such a use case
> > > based
> > > perspective? If so, would there be interest among the Mahout committers
> > to
> > > help
> > > users publicly create docs/examples/modules to support these use cases?
> > >
> > >
> > > Isabel
> > >
> >
> >
>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Reply via email to