"So if that's correct, is that what's happening to me? Half of my
clusters are being sent to the overlapping reducers? That seems like a
big issue, making clusterpp pretty much useless for my purposes. I
can't have documents randomly being sent to the wrong cluster's
directory, especially not 50+% of them."

This might be correct. I think this can occur if the number of clusters is
large, and the testing was not done with so many clusters.
Can you help a bit in testing some scenarios?

a) Try reducing the number of clusters to 100 and then 50. The motto is to
find the breaking point (number of clusters) after which the clusters start
converging. If this is found, then we would be sure that the problem lies
in the partitioner.
b) If you want, try to use a different partitioner/s. The idea is to create
as many reducer tasks as the number of ( non empty ) clusters found, so
that vectors present in each cluster is in a separate file and later they
are moved to their respective directories ( named on cluster id ).

Please also create a JIRA for this.
https://issues.apache.org/jira/browse/MAHOUT.
And if you are interested, this would be a good starting point to
contribute to Mahout also.

On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote:

> First off, thank you everyone for your help so far. This mailing list
> has been a great help getting me up and running with Mahout
>
> Right now, I'm clustering a set of ~3M documents into 300 clusters.
> Then I'm using clusterpp to split the documents up into directories
> containing the vectors belonging to each cluster. After I perform the
> clustering, clusterdump shows that each cluster has between ~800 and
> ~200,000 documents. This isn't a great spread, but the point is that
> none of the clusters are empty.
>
> Here are my commands:
>
> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
> -k 300 -x 15 -cl -ow
>
> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>
> bin/mahout clusterpp -i pca-clusters -o bottom
>
>
> Since none of my clusters are empty, I would expect clusterpp to
> create 300 directories in "bottom", one for each cluster. Instead,
> only 147 directories are created. The other 153 outputs are just empty
> part-r-* files sitting in the "bottom" directory.
>
> I haven't found too much information when searching on this issue but
> I did come across one mailing list post from a while back:
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E
>
> In that discussion someone said, "If that is the only thing that is
> contained in the part-r-* file [it had no vectors], then the reducer
> responsible to write to that part-r-* file did not receive any input
> records to write to it. This happens because the program uses the
> default hash partitioner which sometimes maps records belonging to
> different clusters to a same reducer; thus leaving some reducers
> without any input records."
>
> So if that's correct, is that what's happening to me? Half of my
> clusters are being sent to the overlapping reducers? That seems like a
> big issue, making clusterpp pretty much useless for my purposes. I
> can't have documents randomly being sent to the wrong cluster's
> directory, especially not 50+% of them.
>
> One final detail: I'm not sure if this matters, but the clusters
> output by kmeans are not numbered 1 to 300. They have an odd looking,
> nonsequential numbering sequence. The first 5 clusters are:
> VL-3740844
> VL-3741044
> VL-3741140
> VL-3741161
> VL-3741235
>
> I haven't done much with kmeans before, so I wasn't sure if this was
> an unexpected behavior or not.
>

Reply via email to