Nearest point to the centroid instead of average of points* On Tue, Feb 8, 2011 at 12:35 AM, Robin Anil <[email protected]> wrote:
> We can prolly find the nearest centroid, instead of averaging it out. This > way centroid vector wont grow big? What do you think about that Ted, Jeff? > > > On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <[email protected]> wrote: > >> 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will >> tend >> to become dense) >> >> I recommend you decrease your input dimensionality to 10^5 - 10^6. This >> could decrease your memory needs to 4GB at the low end. >> >> What kind of input do you have? >> >> On Fri, Feb 4, 2011 at 7:50 AM, james q <[email protected]> >> wrote: >> >> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had >> a >> > dimension of 6838856. >> > >> > -- james >> > >> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <[email protected]> >> wrote: >> > >> > > How many clusters? >> > > >> > > How large is the dimension of your input data? >> > > >> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <[email protected]> >> > > wrote: >> > > >> > > > Hello, >> > > > >> > > > New user to mahout and hadoop here. Isabel Drost suggested to a >> > colleague >> > > I >> > > > should post to the mahout user list, as I am having some general >> > > > difficulties with memory consumption and KMeans clustering. >> > > > >> > > > So a general question first and foremost: what determines how much >> > memory >> > > > does a map task consume during a KMeans clustering job? Increasing >> the >> > > > number of map tasks by adjusting dfs.block.size and >> > mapred.max.split.size >> > > > doesn't seem to make the map task consume less memory. Or at least >> not >> > a >> > > > very noticeable amount. I figured if there are more map tasks, each >> > > > individual map task evaluates less input keys and hence there would >> be >> > > less >> > > > memory consumption. Is there any way to predict memory usage of map >> > tasks >> > > > in >> > > > KMeans? >> > > > >> > > > The cluster I am running consists of 10 machines, each with 8 cores >> and >> > > 68G >> > > > of ram. I've configured the cluster to have each machine, at >> maximum, >> > run >> > > 7 >> > > > map or reduce tasks. I set the map and reduce tasks to have >> virtually >> > no >> > > > limit on memory consumption ... so with 7 processes each, at around >> 9 - >> > > 10G >> > > > per process, the machines will crap out. I can reduce the number of >> map >> > > > tasks per machine, but something tells me that that level of memory >> > > > consumption is wrong. >> > > > >> > > > If any more information is needed to help debug this, please let me >> > know! >> > > > Thanks! >> > > > >> > > > -- james >> > > > >> > > >> > >> > >
