Re: Memory Issue with KMeans clustering

Robin Anil Mon, 07 Feb 2011 11:36:28 -0800

On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <[email protected]> wrote:


> The problem is that the centroids are the average of many documents.  This
> means that the number o non-zero elements in each centroid vector increases
> as the number of documents increases.
>
> If we approximate the centroid by the point nearest to the centroid.
Considering we have a lot of input data. I see the centroids being real
points(part of the input dataset) instead of imaginary ones(average).  Some
loss is incurred here

Hashed encoding would be a easier solution. The same or similar loss is
incurred here as well due to collisions.



>  On Mon, Feb 7, 2011 at 11:05 AM, Robin Anil <[email protected]> wrote:
>
> > We can prolly find the nearest centroid, instead of averaging it out.
> This
> > way centroid vector wont grow big? What do you think about that Ted,
> Jeff?
> >
> > On Fri, Feb 4, 2011 at 9:23 PM, Ted Dunning <[email protected]>
> wrote:
> >
> > > 5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will
> > > tend
> > > to become dense)
> > >
> > > I recommend you decrease your input dimensionality to 10^5 - 10^6.
>  This
> > > could decrease your memory needs to 4GB at the low end.
> > >
> > > What kind of input do you have?
> > >
> > > On Fri, Feb 4, 2011 at 7:50 AM, james q <[email protected]>
> > > wrote:
> > >
> > > > I think the job had 5000 - 6000 clusters. The input (sparse) vectors
> > had
> > > a
> > > > dimension of 6838856.
> > > >
> > > > -- james
> > > >
> > > > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <[email protected]>
> > > wrote:
> > > >
> > > > > How many clusters?
> > > > >
> > > > > How large is the dimension of your input data?
> > > > >
> > > > > On Thu, Feb 3, 2011 at 9:05 PM, james q <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > > > colleague
> > > > > I
> > > > > > should post to the mahout user list, as I am having some general
> > > > > > difficulties with memory consumption and KMeans clustering.
> > > > > >
> > > > > > So a general question first and foremost: what determines how
> much
> > > > memory
> > > > > > does a map task consume during a KMeans clustering job?
> Increasing
> > > the
> > > > > > number of map tasks by adjusting dfs.block.size and
> > > > mapred.max.split.size
> > > > > > doesn't seem to make the map task consume less memory. Or at
> least
> > > not
> > > > a
> > > > > > very noticeable amount. I figured if there are more map tasks,
> each
> > > > > > individual map task evaluates less input keys and hence there
> would
> > > be
> > > > > less
> > > > > > memory consumption. Is there any way to predict memory usage of
> map
> > > > tasks
> > > > > > in
> > > > > > KMeans?
> > > > > >
> > > > > > The cluster I am running consists of 10 machines, each with 8
> cores
> > > and
> > > > > 68G
> > > > > > of ram. I've configured the cluster to have each machine, at
> > maximum,
> > > > run
> > > > > 7
> > > > > > map or reduce tasks. I set the map and reduce tasks to have
> > virtually
> > > > no
> > > > > > limit on memory consumption ... so with 7 processes each, at
> around
> > 9
> > > -
> > > > > 10G
> > > > > > per process, the machines will crap out. I can reduce the number
> of
> > > map
> > > > > > tasks per machine, but something tells me that that level of
> memory
> > > > > > consumption is wrong.
> > > > > >
> > > > > > If any more information is needed to help debug this, please let
> me
> > > > know!
> > > > > > Thanks!
> > > > > >
> > > > > > -- james
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Memory Issue with KMeans clustering

Reply via email to