Re: K-means with large K

2014-04-28 Thread Dean Wampler
You might also investigate other clustering algorithms, such as canopy
clustering and nearest neighbors. Some of them are less accurate, but more
computationally efficient. Often they are used to compute approximate
clusters followed by k-means (or a variant thereof) for greater accuracy.

dean


On Mon, Apr 28, 2014 at 11:41 AM, Buttler, David  wrote:

>  One thing I have used this for was to create codebooks for SIFT features
> in images.  It is a common, though fairly naïve, method for converting high
> dimensional features into a simple word-like features.  Thus, if you have
> 200 SIFT features for an image, you can reduce that to 200 ‘words’ that can
> be directly compared across your entire image set.  The drawback is that
> usually some parts of the feature space are much more dense than other
> parts and distinctive features could be lost.  You can try to minimize that
> by increasing K, but there are diminishing returns.  If you measure the
> quality of your clusters in such situations, you will find that the quality
> levels off between 1000 and 4000 clusters (at least it did for my SIFT
> feature set, YMMV on other data sets).
>
>
>
> Dave
>
>
>
> *From:* Chester Chen [mailto:chesterxgc...@yahoo.com]
> *Sent:* Monday, April 28, 2014 9:31 AM
> *To:* user@spark.apache.org
> *Cc:* user@spark.apache.org
> *Subject:* Re: K-means with large K
>
>
>
> David,
>
>   Just curious to know what kind of use cases demand such large k clusters
>
>
>
> Chester
>
> Sent from my iPhone
>
>
> On Apr 28, 2014, at 9:19 AM, "Buttler, David"  wrote:
>
>  Hi,
>
> I am trying to run the K-means code in mllib, and it works very nicely
> with small K (less than 1000).  However, when I try for a larger K (I am
> looking for 2000-4000 clusters), it seems like the code gets part way
> through (perhaps just the initialization step) and freezes.  The compute
> nodes stop doing any CPU / network / IO and nothing happens for hours.  I
> had done something similar back in the days of Spark 0.6, and I didn’t have
> any trouble going up to 4000 clusters with similar data.
>
>
>
> This happens with both a standalone cluster, and in local multi-core mode
> (with the node given 200GB of heap), but eventually completes in local
> single-core mode.
>
>
>
> Data statistics:
>
> Rows: 166248
>
> Columns: 108
>
>
>
> This is a test run before trying it out on much larger data
>
>
>
> Any ideas on what might be the cause of this?
>
>
>
> Thanks,
>
> Dave
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: K-means with large K

2014-04-28 Thread Matei Zaharia
Try turning on the Kryo serializer as described at 
http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions 
in the driver program’s log before this happens?

Matei

On Apr 28, 2014, at 9:19 AM, Buttler, David  wrote:

> Hi,
> I am trying to run the K-means code in mllib, and it works very nicely with 
> small K (less than 1000).  However, when I try for a larger K (I am looking 
> for 2000-4000 clusters), it seems like the code gets part way through 
> (perhaps just the initialization step) and freezes.  The compute nodes stop 
> doing any CPU / network / IO and nothing happens for hours.  I had done 
> something similar back in the days of Spark 0.6, and I didn’t have any 
> trouble going up to 4000 clusters with similar data.
>  
> This happens with both a standalone cluster, and in local multi-core mode 
> (with the node given 200GB of heap), but eventually completes in local 
> single-core mode.
>  
> Data statistics:
> Rows: 166248
> Columns: 108
>  
> This is a test run before trying it out on much larger data
>  
> Any ideas on what might be the cause of this?
>  
> Thanks,
> Dave



RE: K-means with large K

2014-04-28 Thread Buttler, David
One thing I have used this for was to create codebooks for SIFT features in 
images.  It is a common, though fairly naïve, method for converting high 
dimensional features into a simple word-like features.  Thus, if you have 200 
SIFT features for an image, you can reduce that to 200 ‘words’ that can be 
directly compared across your entire image set.  The drawback is that usually 
some parts of the feature space are much more dense than other parts and 
distinctive features could be lost.  You can try to minimize that by increasing 
K, but there are diminishing returns.  If you measure the quality of your 
clusters in such situations, you will find that the quality levels off between 
1000 and 4000 clusters (at least it did for my SIFT feature set, YMMV on other 
data sets).

Dave

From: Chester Chen [mailto:chesterxgc...@yahoo.com]
Sent: Monday, April 28, 2014 9:31 AM
To: user@spark.apache.org
Cc: user@spark.apache.org
Subject: Re: K-means with large K

David,
  Just curious to know what kind of use cases demand such large k clusters

Chester

Sent from my iPhone

On Apr 28, 2014, at 9:19 AM, "Buttler, David" 
mailto:buttl...@llnl.gov>> wrote:
Hi,
I am trying to run the K-means code in mllib, and it works very nicely with 
small K (less than 1000).  However, when I try for a larger K (I am looking for 
2000-4000 clusters), it seems like the code gets part way through (perhaps just 
the initialization step) and freezes.  The compute nodes stop doing any CPU / 
network / IO and nothing happens for hours.  I had done something similar back 
in the days of Spark 0.6, and I didn’t have any trouble going up to 4000 
clusters with similar data.

This happens with both a standalone cluster, and in local multi-core mode (with 
the node given 200GB of heap), but eventually completes in local single-core 
mode.

Data statistics:
Rows: 166248
Columns: 108

This is a test run before trying it out on much larger data

Any ideas on what might be the cause of this?

Thanks,
Dave


Re: K-means with large K

2014-04-28 Thread Chester Chen
David, 
  Just curious to know what kind of use cases demand such large k clusters

Chester

Sent from my iPhone

On Apr 28, 2014, at 9:19 AM, "Buttler, David"  wrote:

> Hi,
> I am trying to run the K-means code in mllib, and it works very nicely with 
> small K (less than 1000).  However, when I try for a larger K (I am looking 
> for 2000-4000 clusters), it seems like the code gets part way through 
> (perhaps just the initialization step) and freezes.  The compute nodes stop 
> doing any CPU / network / IO and nothing happens for hours.  I had done 
> something similar back in the days of Spark 0.6, and I didn’t have any 
> trouble going up to 4000 clusters with similar data.
>  
> This happens with both a standalone cluster, and in local multi-core mode 
> (with the node given 200GB of heap), but eventually completes in local 
> single-core mode.
>  
> Data statistics:
> Rows: 166248
> Columns: 108
>  
> This is a test run before trying it out on much larger data
>  
> Any ideas on what might be the cause of this?
>  
> Thanks,
> Dave


K-means with large K

2014-04-28 Thread Buttler, David
Hi,
I am trying to run the K-means code in mllib, and it works very nicely with 
small K (less than 1000).  However, when I try for a larger K (I am looking for 
2000-4000 clusters), it seems like the code gets part way through (perhaps just 
the initialization step) and freezes.  The compute nodes stop doing any CPU / 
network / IO and nothing happens for hours.  I had done something similar back 
in the days of Spark 0.6, and I didn't have any trouble going up to 4000 
clusters with similar data.

This happens with both a standalone cluster, and in local multi-core mode (with 
the node given 200GB of heap), but eventually completes in local single-core 
mode.

Data statistics:
Rows: 166248
Columns: 108

This is a test run before trying it out on much larger data

Any ideas on what might be the cause of this?

Thanks,
Dave