MLib KMeans on large dataset issues
Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the end there is a LocalKMeans class which runs KMeansPlusPlus on the Spark driver. Why isn't this distributed? It's spending a long time on here and this has the same problem as point 1 requires loading the data to the driver. Also when LocakKMeans is running on driver also seeing lots of : 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus initialization ran out of distinct points for centers. Using duplicate point for center k = 222 - Has the above behaviour been like this in previous releases? I remember running KMeans before without too much problems. Looking forward to hear you point out my stupidity or provide work-arounds that could make Spark KMeans work well on large datasets. Regards, Sam Stoelinga
Re: MLib KMeans on large dataset issues
How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the end there is a LocalKMeans class which runs KMeansPlusPlus on the Spark driver. Why isn't this distributed? It's spending a long time on here and this has the same problem as point 1 requires loading the data to the driver. Also when LocakKMeans is running on driver also seeing lots of : 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus initialization ran out of distinct points for centers. Using duplicate point for center k = 222 - Has the above behaviour been like this in previous releases? I remember running KMeans before without too much problems. Looking forward to hear you point out my stupidity or provide work-arounds that could make Spark KMeans work well on large datasets. Regards, Sam Stoelinga
Re: MLib KMeans on large dataset issues
I'm mostly using example code, see here: http://paste.openstack.org/show/211966/ The data has 799305 dimensions and is separated by space Please note the issues I'm seeing is because of the scala implementation imo as it happens also when using the Python wrappers. On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele gangele...@gmail.com wrote: How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the end there is a LocalKMeans class which runs KMeansPlusPlus on the Spark driver. Why isn't this distributed? It's spending a long time on here and this has the same problem as point 1 requires loading the data to the driver. Also when LocakKMeans is running on driver also seeing lots of : 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus initialization ran out of distinct points for centers. Using duplicate point for center k = 222 - Has the above behaviour been like this in previous releases? I remember running KMeans before without too much problems. Looking forward to hear you point out my stupidity or provide work-arounds that could make Spark KMeans work well on large datasets. Regards, Sam Stoelinga
Re: MLib KMeans on large dataset issues
Guys, great feedback by pointing out my stupidity :D Rows and columns got intermixed hence the weird results I was seeing. Ignore my previous issues will reformat my data first. On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga sammiest...@gmail.com wrote: I'm mostly using example code, see here: http://paste.openstack.org/show/211966/ The data has 799305 dimensions and is separated by space Please note the issues I'm seeing is because of the scala implementation imo as it happens also when using the Python wrappers. On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele gangele...@gmail.com wrote: How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the end there is a LocalKMeans class which runs KMeansPlusPlus on the Spark driver. Why isn't this distributed? It's spending a long time on here and this has the same problem as point 1 requires loading the data to the driver. Also when LocakKMeans is running on driver also seeing lots of : 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus initialization ran out of distinct points for centers. Using duplicate point for center k = 222 - Has the above behaviour been like this in previous releases? I remember running KMeans before without too much problems. Looking forward to hear you point out my stupidity or provide work-arounds that could make Spark KMeans work well on large datasets. Regards, Sam Stoelinga