Mahout used MR and made one MR on every iteration. It worked as predicted. My question more about why spark was so slow. I would try MEMORY_AND_DISK_SER
2014-03-25 17:58 GMT+04:00 Suneel Marthi <suneel_mar...@yahoo.com>: > Mahout does have a kmeans which can be executed in mapreduce and iterative > modes. > > Sent from my iPhone > > On Mar 25, 2014, at 9:25 AM, Prashant Sharma <scrapco...@gmail.com> wrote: > > I think Mahout uses FuzzyKmeans, which is different algorithm and it is > not iterative. > > Prashant Sharma > > > On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov <pahomov.e...@gmail.com>wrote: > >> Hi, I'm running benchmark, which compares Mahout and SparkML. For now I >> have next results for k-means: >> Number of iterations= 10, number of elements = 10000000, mahouttime= 602, >> spark time = 138 >> Number of iterations= 40, number of elements = 10000000, mahouttime= >> 1917, spark time = 330 >> Number of iterations= 70, number of elements = 10000000, mahouttime= >> 3203, spark time = 388 >> Number of iterations= 10, number of elements = 100000000, mahouttime= >> 1235, spark time = 2226 >> Number of iterations= 40, number of elements = 100000000, mahouttime= >> 2755, spark time = 6388 >> Number of iterations= 70, number of elements = 100000000, mahouttime= >> 4107, spark time = 10967 >> Number of iterations= 10, number of elements = 1000000000, mahouttime= >> 7070, spark time = 25268 >> >> Time in seconds. It runs on Yarn cluster with about 40 machines. Elements >> for clusterization are randomly created. When I changed persistence level >> from Memory to Memory_and_disk, on big data spark started to work faster. >> >> What am I missing? >> >> See my benchmarking code in attachment. >> >> >> -- >> >> >> >> *Sincerely yours Egor PakhomovScala Developer, Yandex* >> > > -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*