Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Pat Ferrel Tue, 31 Jan 2017 09:22:16 -0800

My perspective comes from the data side. I work in recommenders and that means 
log analysis for huge amounts of data. Even a small shop doing this will 
immediately run our of the capacity in Python or R on a single node. MLlib is a 
set of prepackaged algorithms that will work (mostly) with big data. Mahout 
Samsara is the only general linear algebra tool I know of that will natively 
let you interactively run R-like code on any size cluster, then polish it for 
production all without changing tools, or language.

Going from analytics to recommenders means a jump in data size of several 
orders of magnitude and this is just one example.

On Jan 31, 2017, at 6:50 AM, Trevor Grant <trevor.d.gr...@gmail.com> wrote:

Hello Isabel and Florent,

I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
/ Mahout, but in very broad strokes here is how I would compare them:

R- Most statistical functionality.  Most flexibility.  Implement your own
algorithms- mathematically expressive language.  Worst performance- handles
only "small" data sets.  Language is 'math centric'. Easy to extend /
create new algos

Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though.  Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos

SparkML / Mllib - Very Limited Mathematical functionality (usually collects
to driver to do anything of substance).  Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.

(FlinkML - Fits in same spot as SparkML, but significantly less developed)

Mahout - Good mathematical functionality.  Good performance relative to
underlying engine (possibly superior with MAHOUT-1885).  Language is 'math
centric'.  Well suited to "medium and big" data sets. Fairly easy to extend
/ create new algos (MAHOUT-1856)

I hope that provides a high level comparison.

Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full dataset
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout

^^ All of this is just my opinion.

Re: integration-

We're working on that too.  Recently MAHOUT-1896 added convenience methods
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896

(No support yet for SparkML type dataframes, or spitting DRMs back out into
RDDs/DataFrames).

Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support.  The CMS
makes it difficult to keep up with documentation, and Jekyll would open up
documentation /website maintenance to contributors.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*

On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis <florent.em...@gmail.com>
wrote:

> Hi,
> 
> I am in the same spot as Isabel.
> Used to use/understand most of the «old» standalone mahout, now doing some
> data transformation with spark, but I am not sure where Samsara fits in the
> ecosystem.
> We also do quite a bit of computation in R.
> Basically we are willing to learn and support the project by for instance
> buying the books Rob mentioned, but a short doc with the outline Isabel
> describes would be great!
> 
> Many thanks,
> 
> Florent
> 
> 
> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm" <isa...@apache.org> a écrit :
> 
> 
> Hi,
> 
> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
>> and we're thinking about just how many pre-built algorithms we
>> should include in the library versus working on performance behind the
>> scenes.
> 
> To pick this question up: I've been watching Mahout from a distance for
> quite
> some time. So from what limited background I have of Samsara I really like
> it's
> approach to be able to run on more than one execution engine.
> 
> To give some advise to downstream users in the field - what would be your
> advise
> for people tasked with concrete use cases (stuff like fraud detection,
> anomaly
> detection, learning search ranking functions, building a recommender
> system)? Is
> that something that can still be done with Mahout? What would it take to
> get
> from raw data to finished system? Is there something we can do to help
> users get
> that accomplished? Is there even interest from users in such a use case
> based
> perspective? If so, would there be interest among the Mahout committers to
> help
> users publicly create docs/examples/modules to support these use cases?
> 
> 
> Isabel
>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Reply via email to