I am trying to learn the key differences between mahout ML and spark ML and
then the mahout-spark integration specifically for clustering algorithms. I
learned through forms and blogposts that one of the major difference is
mahout runs as batch process and spark backed by streaming apis. But I do
see mahout-spark integration as well. So I'm slightly confused and would
like to know the major differences that should be considered(looked into)?
I'm working on a new research project that requires clustering of
documents( 50M webpages for now) and focus is only towards using clustering
algorithms and the LSH implementation. Right now, I started with
experimenting mahout-kmean (standalone not the streaming-kmean) and also
looked in to LSH, which is again available in both frameworks, so the above
questions rising at this point.
Looking forward to hear thoughts and insights from all users here.