Re: Memory Issue with KMeans clustering

Marc Hadfield Mon, 07 Feb 2011 10:43:43 -0800

hello -

i'm a mahout newbie as well.


In the case outlined below, does that mean each node of a hadoop cluster
would need to have the centroid information fully in memory for k-means, or
is this spread over the cluster in some way?

if each node has to have the centroid information fully in memory, are there
any other data structures which need to be fully in memory in each node, and
if so, what are they proportional to (again, specifically for k-means)?
i.e. is anything memory resident related to the number of documents?

If the centroid information (dependent on the number of features and
clusters) needs to be fully in memory in all hadoop nodes, but not anything
related to the number of documents, then the k-means algorithm would be
scalable in the number of documents (just add more hadoop nodes to increase
document throughput), but *not* scalable in the number of clusters /
features since the algorithm requires a full copy of this information in
each node.  is this accurate?

-- marc



5000 x 6838856 x 8 = 273GB of memory just for the centroids (which will tend
> to become dense)
>
> I recommend you decrease your input dimensionality to 10^5 - 10^6.  This
> could decrease your memory needs to 4GB at the low end.
>
> What kind of input do you have?
>
> On Fri, Feb 4, 2011 at 7:50 AM, james q <[email protected]> wrote:
>
> > I think the job had 5000 - 6000 clusters. The input (sparse) vectors had a
> > dimension of 6838856.
> >
> > -- james
> >
> > On Fri, Feb 4, 2011 at 1:55 AM, Ted Dunning <[email protected]> wrote:
> >
> > > How many clusters?
> > >
> > > How large is the dimension of your input data?
> > >
> > > On Thu, Feb 3, 2011 at 9:05 PM, james q <[email protected]>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > New user to mahout and hadoop here. Isabel Drost suggested to a
> > colleague
> > > I
> > > > should post to the mahout user list, as I am having some general
> > > > difficulties with memory consumption and KMeans clustering.
> > > >
> > > > So a general question first and foremost: what determines how much
> > memory
> > > > does a map task consume during a KMeans clustering job? Increasing the
> > > > number of map tasks by adjusting dfs.block.size and
> > mapred.max.split.size
> > > > doesn't seem to make the map task consume less memory. Or at least not
> > a
> > > > very noticeable amount. I figured if there are more map tasks, each
> > > > individual map task evaluates less input keys and hence there would be
> > > less
> > > > memory consumption. Is there any way to predict memory usage of map
> > tasks
> > > > in
> > > > KMeans?
> > > >
> > > > The cluster I am running consists of 10 machines, each with 8 cores and
> > > 68G
> > > > of ram. I've configured the cluster to have each machine, at maximum,
> > run
> > > 7
> > > > map or reduce tasks. I set the map and reduce tasks to have virtually
> > no
> > > > limit on memory consumption ... so with 7 processes each, at around 9 -
> > > 10G
> > > > per process, the machines will crap out. I can reduce the number of map
> > > > tasks per machine, but something tells me that that level of memory
> > > > consumption is wrong.
> > > >
> > > > If any more information is needed to help debug this, please let me
> > know!
> > > > Thanks!
> > > >
> > > > -- james
> > > >
> > >
> >
>
>

Re: Memory Issue with KMeans clustering

Reply via email to