You are beginning to exit the realm of reasonable applicability for normal k-means algorithms here.
How much data do you have? On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> wrote: > Hi All, > > When I run KMeans clustering on a cluster, i notice that when I have > "large" values for k (i.e approx >1000) I get loads of hadoop write > errors: > > INFO hdfs.DFSClient: Exception in createBlockOutputStream > java.net.SocketTimeoutException: 69000 millis timeout while waiting > for channel to be ready for read. ch : java.nio.channels.SocketChannel > > This continues indefinitely and lots of part-0xxxxx files are produced > of sizes of around 30kbs. > > If I reduce the value for k it runs fine. Furthermore If I run it in > local mode with high values of k it runs fine. > > The command I am using is as follows: > > mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults > --clusters tmp -dm > org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd > 1.0 -x 20 -cl -k 10000 > > I am running mahout 0.7. > > Are there some performance parameters I need to tune for mahout when > dealing with large volumes of data? > > Thanks, > Colum >
