Re: clusterpp is only writing directories for about half of my clusters.

Matt Molek Mon, 22 Oct 2012 07:29:36 -0700

I've done some more testing and submitted a JIRA:
https://issues.apache.org/jira/browse/MAHOUT-1103


On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <[email protected]> wrote:
> Thanks for the quick response!
>
> I will do some testing tomorrow with various numbers of clusters and
> create a JIRA once I have those results. I might be able to contribute
> a patch for this if I have the time.
>
> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
> <[email protected]> wrote:
>> "So if that's correct, is that what's happening to me? Half of my
>> clusters are being sent to the overlapping reducers? That seems like a
>> big issue, making clusterpp pretty much useless for my purposes. I
>> can't have documents randomly being sent to the wrong cluster's
>> directory, especially not 50+% of them."
>>
>> This might be correct. I think this can occur if the number of clusters is
>> large, and the testing was not done with so many clusters.
>> Can you help a bit in testing some scenarios?
>>
>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>> find the breaking point (number of clusters) after which the clusters start
>> converging. If this is found, then we would be sure that the problem lies
>> in the partitioner.
>> b) If you want, try to use a different partitioner/s. The idea is to create
>> as many reducer tasks as the number of ( non empty ) clusters found, so
>> that vectors present in each cluster is in a separate file and later they
>> are moved to their respective directories ( named on cluster id ).
>>
>> Please also create a JIRA for this.
>> https://issues.apache.org/jira/browse/MAHOUT.
>> And if you are interested, this would be a good starting point to
>> contribute to Mahout also.
>>
>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote:
>>
>>> First off, thank you everyone for your help so far. This mailing list
>>> has been a great help getting me up and running with Mahout
>>>
>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>> Then I'm using clusterpp to split the documents up into directories
>>> containing the vectors belonging to each cluster. After I perform the
>>> clustering, clusterdump shows that each cluster has between ~800 and
>>> ~200,000 documents. This isn't a great spread, but the point is that
>>> none of the clusters are empty.
>>>
>>> Here are my commands:
>>>
>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>> -k 300 -x 15 -cl -ow
>>>
>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>>
>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>
>>>
>>> Since none of my clusters are empty, I would expect clusterpp to
>>> create 300 directories in "bottom", one for each cluster. Instead,
>>> only 147 directories are created. The other 153 outputs are just empty
>>> part-r-* files sitting in the "bottom" directory.
>>>
>>> I haven't found too much information when searching on this issue but
>>> I did come across one mailing list post from a while back:
>>>
>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E
>>>
>>> In that discussion someone said, "If that is the only thing that is
>>> contained in the part-r-* file [it had no vectors], then the reducer
>>> responsible to write to that part-r-* file did not receive any input
>>> records to write to it. This happens because the program uses the
>>> default hash partitioner which sometimes maps records belonging to
>>> different clusters to a same reducer; thus leaving some reducers
>>> without any input records."
>>>
>>> So if that's correct, is that what's happening to me? Half of my
>>> clusters are being sent to the overlapping reducers? That seems like a
>>> big issue, making clusterpp pretty much useless for my purposes. I
>>> can't have documents randomly being sent to the wrong cluster's
>>> directory, especially not 50+% of them.
>>>
>>> One final detail: I'm not sure if this matters, but the clusters
>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>> nonsequential numbering sequence. The first 5 clusters are:
>>> VL-3740844
>>> VL-3741044
>>> VL-3741140
>>> VL-3741161
>>> VL-3741235
>>>
>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>> an unexpected behavior or not.
>>>

Re: clusterpp is only writing directories for about half of my clusters.

Reply via email to