Re: K-means with large K
You might also investigate other clustering algorithms, such as canopy clustering and nearest neighbors. Some of them are less accurate, but more computationally efficient. Often they are used to compute approximate clusters followed by k-means (or a variant thereof) for greater accuracy. dean On Mon, Apr 28, 2014 at 11:41 AM, Buttler, David wrote: > One thing I have used this for was to create codebooks for SIFT features > in images. It is a common, though fairly naïve, method for converting high > dimensional features into a simple word-like features. Thus, if you have > 200 SIFT features for an image, you can reduce that to 200 ‘words’ that can > be directly compared across your entire image set. The drawback is that > usually some parts of the feature space are much more dense than other > parts and distinctive features could be lost. You can try to minimize that > by increasing K, but there are diminishing returns. If you measure the > quality of your clusters in such situations, you will find that the quality > levels off between 1000 and 4000 clusters (at least it did for my SIFT > feature set, YMMV on other data sets). > > > > Dave > > > > *From:* Chester Chen [mailto:chesterxgc...@yahoo.com] > *Sent:* Monday, April 28, 2014 9:31 AM > *To:* user@spark.apache.org > *Cc:* user@spark.apache.org > *Subject:* Re: K-means with large K > > > > David, > > Just curious to know what kind of use cases demand such large k clusters > > > > Chester > > Sent from my iPhone > > > On Apr 28, 2014, at 9:19 AM, "Buttler, David" wrote: > > Hi, > > I am trying to run the K-means code in mllib, and it works very nicely > with small K (less than 1000). However, when I try for a larger K (I am > looking for 2000-4000 clusters), it seems like the code gets part way > through (perhaps just the initialization step) and freezes. The compute > nodes stop doing any CPU / network / IO and nothing happens for hours. I > had done something similar back in the days of Spark 0.6, and I didn’t have > any trouble going up to 4000 clusters with similar data. > > > > This happens with both a standalone cluster, and in local multi-core mode > (with the node given 200GB of heap), but eventually completes in local > single-core mode. > > > > Data statistics: > > Rows: 166248 > > Columns: 108 > > > > This is a test run before trying it out on much larger data > > > > Any ideas on what might be the cause of this? > > > > Thanks, > > Dave > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: K-means with large K
Try turning on the Kryo serializer as described at http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions in the driver program’s log before this happens? Matei On Apr 28, 2014, at 9:19 AM, Buttler, David wrote: > Hi, > I am trying to run the K-means code in mllib, and it works very nicely with > small K (less than 1000). However, when I try for a larger K (I am looking > for 2000-4000 clusters), it seems like the code gets part way through > (perhaps just the initialization step) and freezes. The compute nodes stop > doing any CPU / network / IO and nothing happens for hours. I had done > something similar back in the days of Spark 0.6, and I didn’t have any > trouble going up to 4000 clusters with similar data. > > This happens with both a standalone cluster, and in local multi-core mode > (with the node given 200GB of heap), but eventually completes in local > single-core mode. > > Data statistics: > Rows: 166248 > Columns: 108 > > This is a test run before trying it out on much larger data > > Any ideas on what might be the cause of this? > > Thanks, > Dave
RE: K-means with large K
One thing I have used this for was to create codebooks for SIFT features in images. It is a common, though fairly naïve, method for converting high dimensional features into a simple word-like features. Thus, if you have 200 SIFT features for an image, you can reduce that to 200 ‘words’ that can be directly compared across your entire image set. The drawback is that usually some parts of the feature space are much more dense than other parts and distinctive features could be lost. You can try to minimize that by increasing K, but there are diminishing returns. If you measure the quality of your clusters in such situations, you will find that the quality levels off between 1000 and 4000 clusters (at least it did for my SIFT feature set, YMMV on other data sets). Dave From: Chester Chen [mailto:chesterxgc...@yahoo.com] Sent: Monday, April 28, 2014 9:31 AM To: user@spark.apache.org Cc: user@spark.apache.org Subject: Re: K-means with large K David, Just curious to know what kind of use cases demand such large k clusters Chester Sent from my iPhone On Apr 28, 2014, at 9:19 AM, "Buttler, David" mailto:buttl...@llnl.gov>> wrote: Hi, I am trying to run the K-means code in mllib, and it works very nicely with small K (less than 1000). However, when I try for a larger K (I am looking for 2000-4000 clusters), it seems like the code gets part way through (perhaps just the initialization step) and freezes. The compute nodes stop doing any CPU / network / IO and nothing happens for hours. I had done something similar back in the days of Spark 0.6, and I didn’t have any trouble going up to 4000 clusters with similar data. This happens with both a standalone cluster, and in local multi-core mode (with the node given 200GB of heap), but eventually completes in local single-core mode. Data statistics: Rows: 166248 Columns: 108 This is a test run before trying it out on much larger data Any ideas on what might be the cause of this? Thanks, Dave
Re: K-means with large K
David, Just curious to know what kind of use cases demand such large k clusters Chester Sent from my iPhone On Apr 28, 2014, at 9:19 AM, "Buttler, David" wrote: > Hi, > I am trying to run the K-means code in mllib, and it works very nicely with > small K (less than 1000). However, when I try for a larger K (I am looking > for 2000-4000 clusters), it seems like the code gets part way through > (perhaps just the initialization step) and freezes. The compute nodes stop > doing any CPU / network / IO and nothing happens for hours. I had done > something similar back in the days of Spark 0.6, and I didn’t have any > trouble going up to 4000 clusters with similar data. > > This happens with both a standalone cluster, and in local multi-core mode > (with the node given 200GB of heap), but eventually completes in local > single-core mode. > > Data statistics: > Rows: 166248 > Columns: 108 > > This is a test run before trying it out on much larger data > > Any ideas on what might be the cause of this? > > Thanks, > Dave
K-means with large K
Hi, I am trying to run the K-means code in mllib, and it works very nicely with small K (less than 1000). However, when I try for a larger K (I am looking for 2000-4000 clusters), it seems like the code gets part way through (perhaps just the initialization step) and freezes. The compute nodes stop doing any CPU / network / IO and nothing happens for hours. I had done something similar back in the days of Spark 0.6, and I didn't have any trouble going up to 4000 clusters with similar data. This happens with both a standalone cluster, and in local multi-core mode (with the node given 200GB of heap), but eventually completes in local single-core mode. Data statistics: Rows: 166248 Columns: 108 This is a test run before trying it out on much larger data Any ideas on what might be the cause of this? Thanks, Dave