Thanks for the quick response! I will do some testing tomorrow with various numbers of clusters and create a JIRA once I have those results. I might be able to contribute a patch for this if I have the time.
On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan <[email protected]> wrote: > "So if that's correct, is that what's happening to me? Half of my > clusters are being sent to the overlapping reducers? That seems like a > big issue, making clusterpp pretty much useless for my purposes. I > can't have documents randomly being sent to the wrong cluster's > directory, especially not 50+% of them." > > This might be correct. I think this can occur if the number of clusters is > large, and the testing was not done with so many clusters. > Can you help a bit in testing some scenarios? > > a) Try reducing the number of clusters to 100 and then 50. The motto is to > find the breaking point (number of clusters) after which the clusters start > converging. If this is found, then we would be sure that the problem lies > in the partitioner. > b) If you want, try to use a different partitioner/s. The idea is to create > as many reducer tasks as the number of ( non empty ) clusters found, so > that vectors present in each cluster is in a separate file and later they > are moved to their respective directories ( named on cluster id ). > > Please also create a JIRA for this. > https://issues.apache.org/jira/browse/MAHOUT. > And if you are interested, this would be a good starting point to > contribute to Mahout also. > > On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote: > >> First off, thank you everyone for your help so far. This mailing list >> has been a great help getting me up and running with Mahout >> >> Right now, I'm clustering a set of ~3M documents into 300 clusters. >> Then I'm using clusterpp to split the documents up into directories >> containing the vectors belonging to each cluster. After I perform the >> clustering, clusterdump shows that each cluster has between ~800 and >> ~200,000 documents. This isn't a great spread, but the point is that >> none of the clusters are empty. >> >> Here are my commands: >> >> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters >> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 >> -k 300 -x 15 -cl -ow >> >> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt >> >> bin/mahout clusterpp -i pca-clusters -o bottom >> >> >> Since none of my clusters are empty, I would expect clusterpp to >> create 300 directories in "bottom", one for each cluster. Instead, >> only 147 directories are created. The other 153 outputs are just empty >> part-r-* files sitting in the "bottom" directory. >> >> I haven't found too much information when searching on this issue but >> I did come across one mailing list post from a while back: >> >> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E >> >> In that discussion someone said, "If that is the only thing that is >> contained in the part-r-* file [it had no vectors], then the reducer >> responsible to write to that part-r-* file did not receive any input >> records to write to it. This happens because the program uses the >> default hash partitioner which sometimes maps records belonging to >> different clusters to a same reducer; thus leaving some reducers >> without any input records." >> >> So if that's correct, is that what's happening to me? Half of my >> clusters are being sent to the overlapping reducers? That seems like a >> big issue, making clusterpp pretty much useless for my purposes. I >> can't have documents randomly being sent to the wrong cluster's >> directory, especially not 50+% of them. >> >> One final detail: I'm not sure if this matters, but the clusters >> output by kmeans are not numbered 1 to 300. They have an odd looking, >> nonsequential numbering sequence. The first 5 clusters are: >> VL-3740844 >> VL-3741044 >> VL-3741140 >> VL-3741161 >> VL-3741235 >> >> I haven't done much with kmeans before, so I wasn't sure if this was >> an unexpected behavior or not. >>
