Hi All, When I run KMeans clustering on a cluster, i notice that when I have "large" values for k (i.e approx >1000) I get loads of hadoop write errors:
INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel This continues indefinitely and lots of part-0xxxxx files are produced of sizes of around 30kbs. If I reduce the value for k it runs fine. Furthermore If I run it in local mode with high values of k it runs fine. The command I am using is as follows: mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults --clusters tmp -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -x 20 -cl -k 10000 I am running mahout 0.7. Are there some performance parameters I need to tune for mahout when dealing with large volumes of data? Thanks, Colum
