On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <[email protected]> wrote:
> The problem is that the centroids are the average of many documents. This > means that the number o non-zero elements in each centroid vector increases > as the number of documents increases. > > If we approximate the centroid by the point nearest to the centroid. Considering we have a lot of input data. I see the centroids being real points(part of the input dataset) instead of imaginary ones(average). Some loss is incurred here Hashed encoding would be a easier solution. The same or similar loss is incurred here as well due to collisions. > On Mon, Feb 7, 2011 at 11:05 AM, Robin Anil <[email protected]> wrote: > > > We can prolly find the nearest centroid, instead of averaging it out. > This > > way centroid vector wont grow big? What do you think about that Ted, > Jeff? > > > > On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <[email protected]> > wrote: > > > > > 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will > > > tend > > > to become dense) > > > > > > I recommend you decrease your input dimensionality to 10^5 - 10^6. > This > > > could decrease your memory needs to 4GB at the low end. > > > > > > What kind of input do you have? > > > > > > On Fri, Feb 4, 2011 at 7:50 AM, james q <[email protected]> > > > wrote: > > > > > > > I think the job had 5000 - 6000 clusters. The input (sparse) vectors > > had > > > a > > > > dimension of 6838856. > > > > > > > > -- james > > > > > > > > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > > > How many clusters? > > > > > > > > > > How large is the dimension of your input data? > > > > > > > > > > On Thu, Feb 3, 2011 at 9:05 PM, james q < > [email protected]> > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > New user to mahout and hadoop here. Isabel Drost suggested to a > > > > colleague > > > > > I > > > > > > should post to the mahout user list, as I am having some general > > > > > > difficulties with memory consumption and KMeans clustering. > > > > > > > > > > > > So a general question first and foremost: what determines how > much > > > > memory > > > > > > does a map task consume during a KMeans clustering job? > Increasing > > > the > > > > > > number of map tasks by adjusting dfs.block.size and > > > > mapred.max.split.size > > > > > > doesn't seem to make the map task consume less memory. Or at > least > > > not > > > > a > > > > > > very noticeable amount. I figured if there are more map tasks, > each > > > > > > individual map task evaluates less input keys and hence there > would > > > be > > > > > less > > > > > > memory consumption. Is there any way to predict memory usage of > map > > > > tasks > > > > > > in > > > > > > KMeans? > > > > > > > > > > > > The cluster I am running consists of 10 machines, each with 8 > cores > > > and > > > > > 68G > > > > > > of ram. I've configured the cluster to have each machine, at > > maximum, > > > > run > > > > > 7 > > > > > > map or reduce tasks. I set the map and reduce tasks to have > > virtually > > > > no > > > > > > limit on memory consumption ... so with 7 processes each, at > around > > 9 > > > - > > > > > 10G > > > > > > per process, the machines will crap out. I can reduce the number > of > > > map > > > > > > tasks per machine, but something tells me that that level of > memory > > > > > > consumption is wrong. > > > > > > > > > > > > If any more information is needed to help debug this, please let > me > > > > know! > > > > > > Thanks! > > > > > > > > > > > > -- james > > > > > > > > > > > > > > > > > > > > >
