Trevor gave a great presentation at our user group. It was live streamed on Periscope. Trevor - maybe you could share the url? I don’t have it handy at the moment.
SCott > On Jan 31, 2017, at 8:50 AM, Trevor Grant <trevor.d.gr...@gmail.com> wrote: > > Hello Isabel and Florent, > > I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib) > / Mahout, but in very broad strokes here is how I would compare them: > > R- Most statistical functionality. Most flexibility. Implement your own > algorithms- mathematically expressive language. Worst performance- handles > only "small" data sets. Language is 'math centric'. Easy to extend / > create new algos > > Python (sklearn/scikit) - Some mathematical / statistical functionality, > more focused on machine learning. Machine learning library very > sophisticated though. Much better performance than R, still only single > node. "small to medium" data sets. Language is 'programmer centric'. > Somewhat difficult to extend / create new algos > > SparkML / Mllib - Very Limited Mathematical functionality (usually collects > to driver to do anything of substance). Machine learning rudimentary > compared to sklearn, but still non-trivial one of the best available. > Exceeding performance, well suited to "big" data sets. Language is > 'programmer centric'. Very difficult to extend / create new algos. > > (FlinkML - Fits in same spot as SparkML, but significantly less developed) > > Mahout - Good mathematical functionality. Good performance relative to > underlying engine (possibly superior with MAHOUT-1885). Language is 'math > centric'. Well suited to "medium and big" data sets. Fairly easy to extend > / create new algos (MAHOUT-1856) > > I hope that provides a high level comparison. > > Re use cases- the tool to use depends on the job at hand. > Highly advanced mathematical model, small dataset or sampling from full > dataset OK -> Use R > Machine learning on small to medium data set or sampling from full dataset > OK -> Use Python / sklearn > Less sophisticated machine learning on Large dataset -> SparkML > Custom mathematical/statistical model on medium to large data -> Mahout > > ^^ All of this is just my opinion. > > Re: integration- > > We're working on that too. Recently MAHOUT-1896 added convenience methods > for interacting with MLLib type RDDs, and DataFrames > https://issues.apache.org/jira/browse/MAHOUT-1896 > > (No support yet for SparkML type dataframes, or spitting DRMs back out into > RDDs/DataFrames). > > Finally Docs: There has been some talk for sometime of migrating the > website from CMS to Jekyll and its something I strongly support. The CMS > makes it difficult to keep up with documentation, and Jekyll would open up > documentation /website maintenance to contributors. > > Trevor Grant > Data Scientist > https://github.com/rawkintrevo > http://stackexchange.com/users/3002022/rawkintrevo > http://trevorgrant.org > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis <florent.em...@gmail.com> > wrote: > >> Hi, >> >> I am in the same spot as Isabel. >> Used to use/understand most of the «old» standalone mahout, now doing some >> data transformation with spark, but I am not sure where Samsara fits in the >> ecosystem. >> We also do quite a bit of computation in R. >> Basically we are willing to learn and support the project by for instance >> buying the books Rob mentioned, but a short doc with the outline Isabel >> describes would be great! >> >> Many thanks, >> >> Florent >> >> >> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm" <isa...@apache.org> a écrit : >> >> >> Hi, >> >> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote: >>> and we're thinking about just how many pre-built algorithms we >>> should include in the library versus working on performance behind the >>> scenes. >> >> To pick this question up: I've been watching Mahout from a distance for >> quite >> some time. So from what limited background I have of Samsara I really like >> it's >> approach to be able to run on more than one execution engine. >> >> To give some advise to downstream users in the field - what would be your >> advise >> for people tasked with concrete use cases (stuff like fraud detection, >> anomaly >> detection, learning search ranking functions, building a recommender >> system)? Is >> that something that can still be done with Mahout? What would it take to >> get >> from raw data to finished system? Is there something we can do to help >> users get >> that accomplished? Is there even interest from users in such a use case >> based >> perspective? If so, would there be interest among the Mahout committers to >> help >> users publicly create docs/examples/modules to support these use cases? >> >> >> Isabel >>