I've done some more testing and submitted a JIRA: https://issues.apache.org/jira/browse/MAHOUT-1103
On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <[email protected]> wrote: > Thanks for the quick response! > > I will do some testing tomorrow with various numbers of clusters and > create a JIRA once I have those results. I might be able to contribute > a patch for this if I have the time. > > On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan > <[email protected]> wrote: >> "So if that's correct, is that what's happening to me? Half of my >> clusters are being sent to the overlapping reducers? That seems like a >> big issue, making clusterpp pretty much useless for my purposes. I >> can't have documents randomly being sent to the wrong cluster's >> directory, especially not 50+% of them." >> >> This might be correct. I think this can occur if the number of clusters is >> large, and the testing was not done with so many clusters. >> Can you help a bit in testing some scenarios? >> >> a) Try reducing the number of clusters to 100 and then 50. The motto is to >> find the breaking point (number of clusters) after which the clusters start >> converging. If this is found, then we would be sure that the problem lies >> in the partitioner. >> b) If you want, try to use a different partitioner/s. The idea is to create >> as many reducer tasks as the number of ( non empty ) clusters found, so >> that vectors present in each cluster is in a separate file and later they >> are moved to their respective directories ( named on cluster id ). >> >> Please also create a JIRA for this. >> https://issues.apache.org/jira/browse/MAHOUT. >> And if you are interested, this would be a good starting point to >> contribute to Mahout also. >> >> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote: >> >>> First off, thank you everyone for your help so far. This mailing list >>> has been a great help getting me up and running with Mahout >>> >>> Right now, I'm clustering a set of ~3M documents into 300 clusters. >>> Then I'm using clusterpp to split the documents up into directories >>> containing the vectors belonging to each cluster. After I perform the >>> clustering, clusterdump shows that each cluster has between ~800 and >>> ~200,000 documents. This isn't a great spread, but the point is that >>> none of the clusters are empty. >>> >>> Here are my commands: >>> >>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters >>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 >>> -k 300 -x 15 -cl -ow >>> >>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt >>> >>> bin/mahout clusterpp -i pca-clusters -o bottom >>> >>> >>> Since none of my clusters are empty, I would expect clusterpp to >>> create 300 directories in "bottom", one for each cluster. Instead, >>> only 147 directories are created. The other 153 outputs are just empty >>> part-r-* files sitting in the "bottom" directory. >>> >>> I haven't found too much information when searching on this issue but >>> I did come across one mailing list post from a while back: >>> >>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E >>> >>> In that discussion someone said, "If that is the only thing that is >>> contained in the part-r-* file [it had no vectors], then the reducer >>> responsible to write to that part-r-* file did not receive any input >>> records to write to it. This happens because the program uses the >>> default hash partitioner which sometimes maps records belonging to >>> different clusters to a same reducer; thus leaving some reducers >>> without any input records." >>> >>> So if that's correct, is that what's happening to me? Half of my >>> clusters are being sent to the overlapping reducers? That seems like a >>> big issue, making clusterpp pretty much useless for my purposes. I >>> can't have documents randomly being sent to the wrong cluster's >>> directory, especially not 50+% of them. >>> >>> One final detail: I'm not sure if this matters, but the clusters >>> output by kmeans are not numbered 1 to 300. They have an odd looking, >>> nonsequential numbering sequence. The first 5 clusters are: >>> VL-3740844 >>> VL-3741044 >>> VL-3741140 >>> VL-3741161 >>> VL-3741235 >>> >>> I haven't done much with kmeans before, so I wasn't sure if this was >>> an unexpected behavior or not. >>>
