Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

scott cote Tue, 31 Jan 2017 12:42:07 -0800

Trevor gave a great presentation at our user group.  It was live streamed on 
Periscope.  Trevor - maybe you could share the url?  I don’t have it handy at 
the moment.


SCott
> On Jan 31, 2017, at 8:50 AM, Trevor Grant <[email protected]> wrote:
> 
> Hello Isabel and Florent,
> 
> I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
> / Mahout, but in very broad strokes here is how I would compare them:
> 
> R- Most statistical functionality.  Most flexibility.  Implement your own
> algorithms- mathematically expressive language.  Worst performance- handles
> only "small" data sets.  Language is 'math centric'. Easy to extend /
> create new algos
> 
> Python (sklearn/scikit) - Some mathematical / statistical functionality,
> more focused on machine learning. Machine learning library very
> sophisticated though.  Much better performance than R, still only single
> node. "small to medium" data sets. Language is 'programmer centric'.
> Somewhat difficult to extend / create new algos
> 
> SparkML / Mllib - Very Limited Mathematical functionality (usually collects
> to driver to do anything of substance).  Machine learning rudimentary
> compared to sklearn, but still non-trivial one of the best available.
> Exceeding performance, well suited to "big" data sets. Language is
> 'programmer centric'. Very difficult to extend / create new algos.
> 
> (FlinkML - Fits in same spot as SparkML, but significantly less developed)
> 
> Mahout - Good mathematical functionality.  Good performance relative to
> underlying engine (possibly superior with MAHOUT-1885).  Language is 'math
> centric'.  Well suited to "medium and big" data sets. Fairly easy to extend
> / create new algos (MAHOUT-1856)
> 
> I hope that provides a high level comparison.
> 
> Re use cases- the tool to use depends on the job at hand.
> Highly advanced mathematical model, small dataset or sampling from full
> dataset OK -> Use R
> Machine learning on small to medium data set or sampling from full dataset
> OK -> Use Python / sklearn
> Less sophisticated machine learning on Large dataset -> SparkML
> Custom mathematical/statistical model on medium to large data -> Mahout
> 
> ^^ All of this is just my opinion.
> 
> Re: integration-
> 
> We're working on that too.  Recently MAHOUT-1896 added convenience methods
> for interacting with MLLib type RDDs, and DataFrames
> https://issues.apache.org/jira/browse/MAHOUT-1896
> 
> (No support yet for SparkML type dataframes, or spitting DRMs back out into
> RDDs/DataFrames).
> 
> Finally Docs: There has been some talk for sometime of migrating the
> website from CMS to Jekyll and its something I strongly support.  The CMS
> makes it difficult to keep up with documentation, and Jekyll would open up
> documentation /website maintenance to contributors.
> 
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
> 
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> 
> 
> On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis <[email protected]>
> wrote:
> 
>> Hi,
>> 
>> I am in the same spot as Isabel.
>> Used to use/understand most of the «old» standalone mahout, now doing some
>> data transformation with spark, but I am not sure where Samsara fits in the
>> ecosystem.
>> We also do quite a bit of computation in R.
>> Basically we are willing to learn and support the project by for instance
>> buying the books Rob mentioned, but a short doc with the outline Isabel
>> describes would be great!
>> 
>> Many thanks,
>> 
>> Florent
>> 
>> 
>> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm" <[email protected]> a écrit :
>> 
>> 
>> Hi,
>> 
>> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
>>> and we're thinking about just how many pre-built algorithms we
>>> should include in the library versus working on performance behind the
>>> scenes.
>> 
>> To pick this question up: I've been watching Mahout from a distance for
>> quite
>> some time. So from what limited background I have of Samsara I really like
>> it's
>> approach to be able to run on more than one execution engine.
>> 
>> To give some advise to downstream users in the field - what would be your
>> advise
>> for people tasked with concrete use cases (stuff like fraud detection,
>> anomaly
>> detection, learning search ranking functions, building a recommender
>> system)? Is
>> that something that can still be done with Mahout? What would it take to
>> get
>> from raw data to finished system? Is there something we can do to help
>> users get
>> that accomplished? Is there even interest from users in such a use case
>> based
>> perspective? If so, would there be interest among the Mahout committers to
>> help
>> users publicly create docs/examples/modules to support these use cases?
>> 
>> 
>> Isabel
>>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Reply via email to