I have approximately 20million items and a feature vector of approx 30 million 
in length,very sparse. 

Would you have any suggestions for other clustering algorithms I should look at 
?

Thanks,
Colum 

On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote:

> You are beginning to exit the realm of reasonable applicability for normal
> k-means algorithms here.
> 
> How much data do you have?
> 
> On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> wrote:
> 
>> Hi All,
>> 
>> When I run KMeans clustering on a cluster, i notice that when I have
>> "large" values for k (i.e approx >1000) I get loads of hadoop write
>> errors:
>> 
>> INFO hdfs.DFSClient: Exception in createBlockOutputStream
>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>> for channel to be ready for read. ch : java.nio.channels.SocketChannel
>> 
>> This continues indefinitely and lots of part-0xxxxx files are produced
>> of sizes of around 30kbs.
>> 
>> If I reduce the value for k it runs fine. Furthermore If I run it in
>> local mode with high values of k it runs fine.
>> 
>> The command I am using is as follows:
>> 
>> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
>> --clusters tmp -dm
>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
>> 1.0 -x 20 -cl -k 10000
>> 
>> I am running mahout 0.7.
>> 
>> Are there some performance parameters I need to tune for mahout when
>> dealing with large volumes of data?
>> 
>> Thanks,
>> Colum
>> 

Reply via email to