Re: clusterpp is only writing directories for about half of my clusters.

Dmitriy Lyubimov Mon, 22 Oct 2012 11:13:40 -0700

i meant, "soft clustering"


On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <[email protected]> wrote:
> from Jira:
>
>> Hi Dmitriy, sorry for going a little off topic here, but could you elaborate 
>> on this? I've been experimenting with using either cosine or tanimoto 
>> distance on the USigma output of ssvd with -pca true. Are those not 
>> appropriate distance measures for the -pca output?
>
> Let somebody correct me if i am talking nonsense here...
>
> Strictly speaking, you can find clusters using L2 distance (i.e.
> euclidean distance). In that case, PCA helps you by reducing
> functionality, and then USigma output will preserve original distances
> (or at least proportions of those). K means with L2 will then work a
> little faster.
>
> But... with cosine and Tanimoto, PCA does not preserve those due to
> recentering of the original data, therefore, dimensionality reduction
> doesn't work as much for these types of things. Here you basically
> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca
> false and take U output for document topic space), or 2) perhaps do
> sphere projection first and then do dimensionality reduction with
> --pca true. the latter will at least preserve cosine distances as far
> as i can tell. But standard way to address topical "sort clustering"
> with text is still LSA. (if that's your goal, within Mahout realm i
> probably also need to draw your attention to LDA-cvb method in Mahout,
> various researches say LDA actually does better job in finding topic
> mixtures).
>
> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <[email protected]> wrote:
>> I've done some more testing and submitted a JIRA:
>> https://issues.apache.org/jira/browse/MAHOUT-1103
>>
>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <[email protected]> wrote:
>>> Thanks for the quick response!
>>>
>>> I will do some testing tomorrow with various numbers of clusters and
>>> create a JIRA once I have those results. I might be able to contribute
>>> a patch for this if I have the time.
>>>
>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
>>> <[email protected]> wrote:
>>>> "So if that's correct, is that what's happening to me? Half of my
>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>> can't have documents randomly being sent to the wrong cluster's
>>>> directory, especially not 50+% of them."
>>>>
>>>> This might be correct. I think this can occur if the number of clusters is
>>>> large, and the testing was not done with so many clusters.
>>>> Can you help a bit in testing some scenarios?
>>>>
>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to
>>>> find the breaking point (number of clusters) after which the clusters start
>>>> converging. If this is found, then we would be sure that the problem lies
>>>> in the partitioner.
>>>> b) If you want, try to use a different partitioner/s. The idea is to create
>>>> as many reducer tasks as the number of ( non empty ) clusters found, so
>>>> that vectors present in each cluster is in a separate file and later they
>>>> are moved to their respective directories ( named on cluster id ).
>>>>
>>>> Please also create a JIRA for this.
>>>> https://issues.apache.org/jira/browse/MAHOUT.
>>>> And if you are interested, this would be a good starting point to
>>>> contribute to Mahout also.
>>>>
>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote:
>>>>
>>>>> First off, thank you everyone for your help so far. This mailing list
>>>>> has been a great help getting me up and running with Mahout
>>>>>
>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>>>> Then I'm using clusterpp to split the documents up into directories
>>>>> containing the vectors belonging to each cluster. After I perform the
>>>>> clustering, clusterdump shows that each cluster has between ~800 and
>>>>> ~200,000 documents. This isn't a great spread, but the point is that
>>>>> none of the clusters are empty.
>>>>>
>>>>> Here are my commands:
>>>>>
>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>>>> -k 300 -x 15 -cl -ow
>>>>>
>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o 
>>>>> clusterdump.txt
>>>>>
>>>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>>>
>>>>>
>>>>> Since none of my clusters are empty, I would expect clusterpp to
>>>>> create 300 directories in "bottom", one for each cluster. Instead,
>>>>> only 147 directories are created. The other 153 outputs are just empty
>>>>> part-r-* files sitting in the "bottom" directory.
>>>>>
>>>>> I haven't found too much information when searching on this issue but
>>>>> I did come across one mailing list post from a while back:
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E
>>>>>
>>>>> In that discussion someone said, "If that is the only thing that is
>>>>> contained in the part-r-* file [it had no vectors], then the reducer
>>>>> responsible to write to that part-r-* file did not receive any input
>>>>> records to write to it. This happens because the program uses the
>>>>> default hash partitioner which sometimes maps records belonging to
>>>>> different clusters to a same reducer; thus leaving some reducers
>>>>> without any input records."
>>>>>
>>>>> So if that's correct, is that what's happening to me? Half of my
>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>> directory, especially not 50+% of them.
>>>>>
>>>>> One final detail: I'm not sure if this matters, but the clusters
>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>>>> nonsequential numbering sequence. The first 5 clusters are:
>>>>> VL-3740844
>>>>> VL-3741044
>>>>> VL-3741140
>>>>> VL-3741161
>>>>> VL-3741235
>>>>>
>>>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>>>> an unexpected behavior or not.
>>>>>

Re: clusterpp is only writing directories for about half of my clusters.

Reply via email to