i meant, "soft clustering"
On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <[email protected]> wrote: > from Jira: > >> Hi Dmitriy, sorry for going a little off topic here, but could you elaborate >> on this? I've been experimenting with using either cosine or tanimoto >> distance on the USigma output of ssvd with -pca true. Are those not >> appropriate distance measures for the -pca output? > > Let somebody correct me if i am talking nonsense here... > > Strictly speaking, you can find clusters using L2 distance (i.e. > euclidean distance). In that case, PCA helps you by reducing > functionality, and then USigma output will preserve original distances > (or at least proportions of those). K means with L2 will then work a > little faster. > > But... with cosine and Tanimoto, PCA does not preserve those due to > recentering of the original data, therefore, dimensionality reduction > doesn't work as much for these types of things. Here you basically > have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca > false and take U output for document topic space), or 2) perhaps do > sphere projection first and then do dimensionality reduction with > --pca true. the latter will at least preserve cosine distances as far > as i can tell. But standard way to address topical "sort clustering" > with text is still LSA. (if that's your goal, within Mahout realm i > probably also need to draw your attention to LDA-cvb method in Mahout, > various researches say LDA actually does better job in finding topic > mixtures). > > On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <[email protected]> wrote: >> I've done some more testing and submitted a JIRA: >> https://issues.apache.org/jira/browse/MAHOUT-1103 >> >> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <[email protected]> wrote: >>> Thanks for the quick response! >>> >>> I will do some testing tomorrow with various numbers of clusters and >>> create a JIRA once I have those results. I might be able to contribute >>> a patch for this if I have the time. >>> >>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan >>> <[email protected]> wrote: >>>> "So if that's correct, is that what's happening to me? Half of my >>>> clusters are being sent to the overlapping reducers? That seems like a >>>> big issue, making clusterpp pretty much useless for my purposes. I >>>> can't have documents randomly being sent to the wrong cluster's >>>> directory, especially not 50+% of them." >>>> >>>> This might be correct. I think this can occur if the number of clusters is >>>> large, and the testing was not done with so many clusters. >>>> Can you help a bit in testing some scenarios? >>>> >>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to >>>> find the breaking point (number of clusters) after which the clusters start >>>> converging. If this is found, then we would be sure that the problem lies >>>> in the partitioner. >>>> b) If you want, try to use a different partitioner/s. The idea is to create >>>> as many reducer tasks as the number of ( non empty ) clusters found, so >>>> that vectors present in each cluster is in a separate file and later they >>>> are moved to their respective directories ( named on cluster id ). >>>> >>>> Please also create a JIRA for this. >>>> https://issues.apache.org/jira/browse/MAHOUT. >>>> And if you are interested, this would be a good starting point to >>>> contribute to Mahout also. >>>> >>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote: >>>> >>>>> First off, thank you everyone for your help so far. This mailing list >>>>> has been a great help getting me up and running with Mahout >>>>> >>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters. >>>>> Then I'm using clusterpp to split the documents up into directories >>>>> containing the vectors belonging to each cluster. After I perform the >>>>> clustering, clusterdump shows that each cluster has between ~800 and >>>>> ~200,000 documents. This isn't a great spread, but the point is that >>>>> none of the clusters are empty. >>>>> >>>>> Here are my commands: >>>>> >>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters >>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 >>>>> -k 300 -x 15 -cl -ow >>>>> >>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o >>>>> clusterdump.txt >>>>> >>>>> bin/mahout clusterpp -i pca-clusters -o bottom >>>>> >>>>> >>>>> Since none of my clusters are empty, I would expect clusterpp to >>>>> create 300 directories in "bottom", one for each cluster. Instead, >>>>> only 147 directories are created. The other 153 outputs are just empty >>>>> part-r-* files sitting in the "bottom" directory. >>>>> >>>>> I haven't found too much information when searching on this issue but >>>>> I did come across one mailing list post from a while back: >>>>> >>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E >>>>> >>>>> In that discussion someone said, "If that is the only thing that is >>>>> contained in the part-r-* file [it had no vectors], then the reducer >>>>> responsible to write to that part-r-* file did not receive any input >>>>> records to write to it. This happens because the program uses the >>>>> default hash partitioner which sometimes maps records belonging to >>>>> different clusters to a same reducer; thus leaving some reducers >>>>> without any input records." >>>>> >>>>> So if that's correct, is that what's happening to me? Half of my >>>>> clusters are being sent to the overlapping reducers? That seems like a >>>>> big issue, making clusterpp pretty much useless for my purposes. I >>>>> can't have documents randomly being sent to the wrong cluster's >>>>> directory, especially not 50+% of them. >>>>> >>>>> One final detail: I'm not sure if this matters, but the clusters >>>>> output by kmeans are not numbered 1 to 300. They have an odd looking, >>>>> nonsequential numbering sequence. The first 5 clusters are: >>>>> VL-3740844 >>>>> VL-3741044 >>>>> VL-3741140 >>>>> VL-3741161 >>>>> VL-3741235 >>>>> >>>>> I haven't done much with kmeans before, so I wasn't sure if this was >>>>> an unexpected behavior or not. >>>>>
