You are beginning to exit the realm of reasonable applicability for normal
k-means algorithms here.

How much data do you have?

On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> wrote:

> Hi All,
>
> When I run KMeans clustering on a cluster, i notice that when I have
> "large" values for k (i.e approx >1000) I get loads of hadoop write
> errors:
>
>  INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> for channel to be ready for read. ch : java.nio.channels.SocketChannel
>
> This continues indefinitely and lots of part-0xxxxx files are produced
> of sizes of around 30kbs.
>
> If I reduce the value for k it runs fine. Furthermore If I run it in
> local mode with high values of k it runs fine.
>
> The command I am using is as follows:
>
> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> --clusters tmp -dm
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
> 1.0 -x 20 -cl -k 10000
>
> I am running mahout 0.7.
>
> Are there some performance parameters I need to tune for mahout when
> dealing with large volumes of data?
>
> Thanks,
> Colum
>

Reply via email to