Looking at ClusterDumper readPoints, it is reading all the points into main 
memory in order to be able to output the points for each cluster in the report. 
This clearly won't scale, but does anybody really want a huge dataset written 
to their console? If you want to sort your points by cluster membership, I'd 
suggest writing a simple job to read the WeightedVectorWritables from 
clusteredPoints and output to folders for each clusterId.

Alternatively, the ClusterDumper could be rewritten to perform multiple passes 
over the clusteredPoints, one for each cluster.

-----Original Message-----
From: Jeffrey [mailto:[email protected]] 
Sent: Friday, August 12, 2011 8:19 AM
To: [email protected]
Subject: Re: Clusterdumper OOM

i am having the exact same problem when trying to dump results for fkmeans :) 
thinking of doing the degree of membership calculation manually for each point 
if i still can't find a workaround :/



>________________________________
>From: "Sengupta, Sohini IN BLR SISL" <[email protected]>
>To: "[email protected]" <[email protected]>
>Sent: Friday, August 12, 2011 10:37 PM
>Subject: Clusterdumper OOM
>
>Hi,
>I get out of memory error everytime I try to include "-pointsDir" in the list 
>of parameter to Clusterdumper. Is there any other alternate way to read the 
>points belonging to the clusters without increasing the heapsize? Any 
>suggestions? I have already tried by increasing : "JAVA_HEAP_MAX and 
>MAHOUT_HEAPSIZE" in bin/mahout but is not helping.
>
>Thanks and regards,
>Sohini
>
>
>
>

Reply via email to