Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Keith Aumiller Tue, 31 Jan 2017 12:43:49 -0800

I was just watching it. ;)

https://trevorgrant.org/


Thanks Trevor!

On Tue, Jan 31, 2017 at 3:41 PM, scott cote <scottcc...@gmail.com> wrote:

> Trevor gave a great presentation at our user group.  It was live streamed
> on Periscope.  Trevor - maybe you could share the url?  I don’t have it
> handy at the moment.
>
> SCott
> > On Jan 31, 2017, at 8:50 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
> >
> > Hello Isabel and Florent,
> >
> > I'm currently working on a side-by-side demo of R / Python /
> SparkML(Mllib)
> > / Mahout, but in very broad strokes here is how I would compare them:
> >
> > R- Most statistical functionality.  Most flexibility.  Implement your own
> > algorithms- mathematically expressive language.  Worst performance-
> handles
> > only "small" data sets.  Language is 'math centric'. Easy to extend /
> > create new algos
> >
> > Python (sklearn/scikit) - Some mathematical / statistical functionality,
> > more focused on machine learning. Machine learning library very
> > sophisticated though.  Much better performance than R, still only single
> > node. "small to medium" data sets. Language is 'programmer centric'.
> > Somewhat difficult to extend / create new algos
> >
> > SparkML / Mllib - Very Limited Mathematical functionality (usually
> collects
> > to driver to do anything of substance).  Machine learning rudimentary
> > compared to sklearn, but still non-trivial one of the best available.
> > Exceeding performance, well suited to "big" data sets. Language is
> > 'programmer centric'. Very difficult to extend / create new algos.
> >
> > (FlinkML - Fits in same spot as SparkML, but significantly less
> developed)
> >
> > Mahout - Good mathematical functionality.  Good performance relative to
> > underlying engine (possibly superior with MAHOUT-1885).  Language is
> 'math
> > centric'.  Well suited to "medium and big" data sets. Fairly easy to
> extend
> > / create new algos (MAHOUT-1856)
> >
> > I hope that provides a high level comparison.
> >
> > Re use cases- the tool to use depends on the job at hand.
> > Highly advanced mathematical model, small dataset or sampling from full
> > dataset OK -> Use R
> > Machine learning on small to medium data set or sampling from full
> dataset
> > OK -> Use Python / sklearn
> > Less sophisticated machine learning on Large dataset -> SparkML
> > Custom mathematical/statistical model on medium to large data -> Mahout
> >
> > ^^ All of this is just my opinion.
> >
> > Re: integration-
> >
> > We're working on that too.  Recently MAHOUT-1896 added convenience
> methods
> > for interacting with MLLib type RDDs, and DataFrames
> > https://issues.apache.org/jira/browse/MAHOUT-1896
> >
> > (No support yet for SparkML type dataframes, or spitting DRMs back out
> into
> > RDDs/DataFrames).
> >
> > Finally Docs: There has been some talk for sometime of migrating the
> > website from CMS to Jekyll and its something I strongly support.  The CMS
> > makes it difficult to keep up with documentation, and Jekyll would open
> up
> > documentation /website maintenance to contributors.
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis <florent.em...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I am in the same spot as Isabel.
> >> Used to use/understand most of the «old» standalone mahout, now doing
> some
> >> data transformation with spark, but I am not sure where Samsara fits in
> the
> >> ecosystem.
> >> We also do quite a bit of computation in R.
> >> Basically we are willing to learn and support the project by for
> instance
> >> buying the books Rob mentioned, but a short doc with the outline Isabel
> >> describes would be great!
> >>
> >> Many thanks,
> >>
> >> Florent
> >>
> >>
> >> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm" <isa...@apache.org> a
> écrit :
> >>
> >>
> >> Hi,
> >>
> >> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
> >>> and we're thinking about just how many pre-built algorithms we
> >>> should include in the library versus working on performance behind the
> >>> scenes.
> >>
> >> To pick this question up: I've been watching Mahout from a distance for
> >> quite
> >> some time. So from what limited background I have of Samsara I really
> like
> >> it's
> >> approach to be able to run on more than one execution engine.
> >>
> >> To give some advise to downstream users in the field - what would be
> your
> >> advise
> >> for people tasked with concrete use cases (stuff like fraud detection,
> >> anomaly
> >> detection, learning search ranking functions, building a recommender
> >> system)? Is
> >> that something that can still be done with Mahout? What would it take to
> >> get
> >> from raw data to finished system? Is there something we can do to help
> >> users get
> >> that accomplished? Is there even interest from users in such a use case
> >> based
> >> perspective? If so, would there be interest among the Mahout committers
> to
> >> help
> >> users publicly create docs/examples/modules to support these use cases?
> >>
> >>
> >> Isabel
> >>
>
>


-- 
Thanks,

Keith Aumiller
MBA - IT Professional
Lafayette Hill PA
314-369-0811

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Reply via email to