Mahout has changed a lot in the past couple years, becoming more focused on
serving the needs of data workers and scientists who need to experiment
with large matrix math problems. To that end we've broadened the execution
engines that perform the distribution of computation to include Spark and
Flink, and we're thinking about just how many pre-built algorithms we
should include in the library versus working on performance behind the
There is a new declarative language that is R/MATLAB-like and allows for
interactive sessions at scale; see the "Mahout-Samsara" tab in the
navigation on the home page http://mahout.apache.org.
This book was written by two of the major contributors to the new
declarative language, worth taking a look:
Thanks for your interest; we'll be happy to help you as you proceed if you
have any other questions.
On Fri, Sep 16, 2016 at 5:03 PM, Reth RM <reth.ik...@gmail.com> wrote:
> I am trying to learn the key differences between mahout ML and spark ML and
> then the mahout-spark integration specifically for clustering algorithms. I
> learned through forms and blogposts that one of the major difference is
> mahout runs as batch process and spark backed by streaming apis. But I do
> see mahout-spark integration as well. So I'm slightly confused and would
> like to know the major differences that should be considered(looked into)?
> I'm working on a new research project that requires clustering of
> documents( 50M webpages for now) and focus is only towards using clustering
> algorithms and the LSH implementation. Right now, I started with
> experimenting mahout-kmean (standalone not the streaming-kmean) and also
> looked in to LSH, which is again available in both frameworks, so the above
> questions rising at this point.
> Looking forward to hear thoughts and insights from all users here.
> Thank you.