why not change the clusterID from int to long
I have a data about 30 billion rows,when i used createCanopyFromVectors in 
meanshift.
the clusterid,is not big enough.
second  ,in MeanShiftCanopyCreatorMapper class,     
 nextCanopyId = ((1 << 31) / 50000) * (Integer.parseInt(parts[4])%50000); in 
setup function
means on map only have 40000 ids, That is not big enough, hadoop default block 
size is 64M ,somt times it will more then 50000rows

 

Reply via email to