why not change the clusterID from int to long I have a data about 30 billion rows,when i used createCanopyFromVectors in meanshift. the clusterid,is not big enough. second ,in MeanShiftCanopyCreatorMapper class, nextCanopyId = ((1 << 31) / 50000) * (Integer.parseInt(parts[4])%50000); in setup function means on map only have 40000 ids, That is not big enough, hadoop default block size is 64M ,somt times it will more then 50000rows
