KMeans Throwing Hadoop write errors for large values of K

Colum Foley Fri, 08 Mar 2013 06:46:57 -0800

Hi All,

When I run KMeans clustering on a cluster, i notice that when I have
"large" values for k (i.e approx >1000) I get loads of hadoop write
errors:


 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel

This continues indefinitely and lots of part-0xxxxx files are produced
of sizes of around 30kbs.

If I reduce the value for k it runs fine. Furthermore If I run it in
local mode with high values of k it runs fine.

The command I am using is as follows:

mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
--clusters tmp -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
1.0 -x 20 -cl -k 10000

I am running mahout 0.7.

Are there some performance parameters I need to tune for mahout when
dealing with large volumes of data?

Thanks,
Colum

KMeans Throwing Hadoop write errors for large values of K

Reply via email to