That's really the big challenge using kmeans (and probably any of the other clustering algorithms too) for text clustering: the centroids tend to become dense and the memory consumption skyrockets. I wonder if the centroid calculation could be made smarter by setting an underflow limit and forcing close-to-zero terms to be exactly zero? I guess the challenge would be to dynamically select this limit. Or, perhaps implementing an approximating vector which only retains its n most significant terms? Thin ice here...
-----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Friday, February 04, 2011 7:54 AM To: [email protected] Subject: Re: Memory Issue with KMeans clustering 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend to become dense) I recommend you decrease your input dimensionality to 10^5 - 10^6. This could decrease your memory needs to 4GB at the low end. What kind of input do you have? On Fri, Feb 4, 2011 at 7:50 AM, james q <[email protected]> wrote: > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a > dimension of 6838856. > > -- james > > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <[email protected]> wrote: > > > How many clusters? > > > > How large is the dimension of your input data? > > > > On Thu, Feb 3, 2011 at 9:05 PM, james q <[email protected]> > > wrote: > > > > > Hello, > > > > > > New user to mahout and hadoop here. Isabel Drost suggested to a > colleague > > I > > > should post to the mahout user list, as I am having some general > > > difficulties with memory consumption and KMeans clustering. > > > > > > So a general question first and foremost: what determines how much > memory > > > does a map task consume during a KMeans clustering job? Increasing the > > > number of map tasks by adjusting dfs.block.size and > mapred.max.split.size > > > doesn't seem to make the map task consume less memory. Or at least not > a > > > very noticeable amount. I figured if there are more map tasks, each > > > individual map task evaluates less input keys and hence there would be > > less > > > memory consumption. Is there any way to predict memory usage of map > tasks > > > in > > > KMeans? > > > > > > The cluster I am running consists of 10 machines, each with 8 cores and > > 68G > > > of ram. I've configured the cluster to have each machine, at maximum, > run > > 7 > > > map or reduce tasks. I set the map and reduce tasks to have virtually > no > > > limit on memory consumption ... so with 7 processes each, at around 9 - > > 10G > > > per process, the machines will crap out. I can reduce the number of map > > > tasks per machine, but something tells me that that level of memory > > > consumption is wrong. > > > > > > If any more information is needed to help debug this, please let me > know! > > > Thanks! > > > > > > -- james > > > > > >
