MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Hi Sparkers,

I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
large K but I've encountered the following issues:


   - Spark driver gets out of memory and dies because collect gets called
   as part of KMeans, which loads all data back to the driver's memory.
   - At the end there is a LocalKMeans class which runs KMeansPlusPlus on
   the Spark driver. Why isn't this distributed? It's spending a long time on
   here and this has the same problem as point 1 requires loading the data to
   the driver.
   Also when LocakKMeans is running on driver also seeing lots of :
   15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
   initialization ran out of distinct points for centers. Using duplicate
   point for center k = 222
   - Has the above behaviour been like this in previous releases? I
   remember running KMeans before without too much problems.

Looking forward to hear you point out my stupidity or provide work-arounds
that could make Spark KMeans work well on large datasets.

Regards,
Sam Stoelinga


Re: MLib KMeans on large dataset issues

2015-04-29 Thread Jeetendra Gangele
How you are passing feature vector to K means?
its in 2-D space of 1-D array?

Did you try using Streaming Kmeans?

will you be able to paste code here?

On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote:

 Hi Sparkers,

 I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
 large K but I've encountered the following issues:


- Spark driver gets out of memory and dies because collect gets called
as part of KMeans, which loads all data back to the driver's memory.
- At the end there is a LocalKMeans class which runs KMeansPlusPlus on
the Spark driver. Why isn't this distributed? It's spending a long time on
here and this has the same problem as point 1 requires loading the data to
the driver.
Also when LocakKMeans is running on driver also seeing lots of :
15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
initialization ran out of distinct points for centers. Using duplicate
point for center k = 222
- Has the above behaviour been like this in previous releases? I
remember running KMeans before without too much problems.

 Looking forward to hear you point out my stupidity or provide work-arounds
 that could make Spark KMeans work well on large datasets.

 Regards,
 Sam Stoelinga



Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
I'm mostly using example code, see here:
http://paste.openstack.org/show/211966/
The data has 799305 dimensions and is separated by space

Please note the issues I'm seeing is because of the scala implementation
imo as it happens also when using the Python wrappers.



On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele gangele...@gmail.com
wrote:

 How you are passing feature vector to K means?
 its in 2-D space of 1-D array?

 Did you try using Streaming Kmeans?

 will you be able to paste code here?

 On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote:

 Hi Sparkers,

 I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
 large K but I've encountered the following issues:


- Spark driver gets out of memory and dies because collect gets
called as part of KMeans, which loads all data back to the driver's 
 memory.
- At the end there is a LocalKMeans class which runs KMeansPlusPlus
on the Spark driver. Why isn't this distributed? It's spending a long time
on here and this has the same problem as point 1 requires loading the data
to the driver.
Also when LocakKMeans is running on driver also seeing lots of :
15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
initialization ran out of distinct points for centers. Using duplicate
point for center k = 222
- Has the above behaviour been like this in previous releases? I
remember running KMeans before without too much problems.

 Looking forward to hear you point out my stupidity or provide
 work-arounds that could make Spark KMeans work well on large datasets.

 Regards,
 Sam Stoelinga







Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Guys, great feedback by pointing out my stupidity :D

Rows and columns got intermixed hence the weird results I was seeing.
Ignore my previous issues will reformat my data first.

On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga sammiest...@gmail.com
wrote:

 I'm mostly using example code, see here:
 http://paste.openstack.org/show/211966/
 The data has 799305 dimensions and is separated by space

 Please note the issues I'm seeing is because of the scala implementation
 imo as it happens also when using the Python wrappers.



 On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 How you are passing feature vector to K means?
 its in 2-D space of 1-D array?

 Did you try using Streaming Kmeans?

 will you be able to paste code here?

 On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote:

 Hi Sparkers,

 I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
 large K but I've encountered the following issues:


- Spark driver gets out of memory and dies because collect gets
called as part of KMeans, which loads all data back to the driver's 
 memory.
- At the end there is a LocalKMeans class which runs KMeansPlusPlus
on the Spark driver. Why isn't this distributed? It's spending a long 
 time
on here and this has the same problem as point 1 requires loading the 
 data
to the driver.
Also when LocakKMeans is running on driver also seeing lots of :
15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
initialization ran out of distinct points for centers. Using duplicate
point for center k = 222
- Has the above behaviour been like this in previous releases? I
remember running KMeans before without too much problems.

 Looking forward to hear you point out my stupidity or provide
 work-arounds that could make Spark KMeans work well on large datasets.

 Regards,
 Sam Stoelinga