hello - i'm a mahout newbie as well.
In the case outlined below, does that mean each node of a hadoop cluster would need to have the centroid information fully in memory for k-means, or is this spread over the cluster in some way? if each node has to have the centroid information fully in memory, are there any other data structures which need to be fully in memory in each node, and if so, what are they proportional to (again, specifically for k-means)? i.e. is anything memory resident related to the number of documents? If the centroid information (dependent on the number of features and clusters) needs to be fully in memory in all hadoop nodes, but not anything related to the number of documents, then the k-means algorithm would be scalable in the number of documents (just add more hadoop nodes to increase document throughput), but *not* scalable in the number of clusters / features since the algorithm requires a full copy of this information in each node. is this accurate? -- marc 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend > to become dense) > > I recommend you decrease your input dimensionality to 10^5 - 10^6. This > could decrease your memory needs to 4GB at the low end. > > What kind of input do you have? > > On Fri, Feb 4, 2011 at 7:50 AM, james q <[email protected]> wrote: > > > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a > > dimension of 6838856. > > > > -- james > > > > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <[email protected]> wrote: > > > > > How many clusters? > > > > > > How large is the dimension of your input data? > > > > > > On Thu, Feb 3, 2011 at 9:05 PM, james q <[email protected]> > > > wrote: > > > > > > > Hello, > > > > > > > > New user to mahout and hadoop here. Isabel Drost suggested to a > > colleague > > > I > > > > should post to the mahout user list, as I am having some general > > > > difficulties with memory consumption and KMeans clustering. > > > > > > > > So a general question first and foremost: what determines how much > > memory > > > > does a map task consume during a KMeans clustering job? Increasing the > > > > number of map tasks by adjusting dfs.block.size and > > mapred.max.split.size > > > > doesn't seem to make the map task consume less memory. Or at least not > > a > > > > very noticeable amount. I figured if there are more map tasks, each > > > > individual map task evaluates less input keys and hence there would be > > > less > > > > memory consumption. Is there any way to predict memory usage of map > > tasks > > > > in > > > > KMeans? > > > > > > > > The cluster I am running consists of 10 machines, each with 8 cores and > > > 68G > > > > of ram. I've configured the cluster to have each machine, at maximum, > > run > > > 7 > > > > map or reduce tasks. I set the map and reduce tasks to have virtually > > no > > > > limit on memory consumption ... so with 7 processes each, at around 9 - > > > 10G > > > > per process, the machines will crap out. I can reduce the number of map > > > > tasks per machine, but something tells me that that level of memory > > > > consumption is wrong. > > > > > > > > If any more information is needed to help debug this, please let me > > know! > > > > Thanks! > > > > > > > > -- james > > > > > > > > > > >
