Regardless of what you are trying to do, the best practice is actually prototype the process in R or matlab first to make sure you are getting results that make sense to you. Then if you have figured out what seems to be working, you can turn to large scale. SSVD is just svd in R, and i haven't used k-means or any other clustering there but i am sure it is available there too.
Same goes for the sphere projections and pca. On Mon, Oct 22, 2012 at 11:13 AM, Dmitriy Lyubimov <[email protected]> wrote: > i meant, "soft clustering" > > On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <[email protected]> wrote: >> from Jira: >> >>> Hi Dmitriy, sorry for going a little off topic here, but could you >>> elaborate on this? I've been experimenting with using either cosine or >>> tanimoto distance on the USigma output of ssvd with -pca true. Are those >>> not appropriate distance measures for the -pca output? >> >> Let somebody correct me if i am talking nonsense here... >> >> Strictly speaking, you can find clusters using L2 distance (i.e. >> euclidean distance). In that case, PCA helps you by reducing >> functionality, and then USigma output will preserve original distances >> (or at least proportions of those). K means with L2 will then work a >> little faster. >> >> But... with cosine and Tanimoto, PCA does not preserve those due to >> recentering of the original data, therefore, dimensionality reduction >> doesn't work as much for these types of things. Here you basically >> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca >> false and take U output for document topic space), or 2) perhaps do >> sphere projection first and then do dimensionality reduction with >> --pca true. the latter will at least preserve cosine distances as far >> as i can tell. But standard way to address topical "sort clustering" >> with text is still LSA. (if that's your goal, within Mahout realm i >> probably also need to draw your attention to LDA-cvb method in Mahout, >> various researches say LDA actually does better job in finding topic >> mixtures). >> >> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <[email protected]> wrote: >>> I've done some more testing and submitted a JIRA: >>> https://issues.apache.org/jira/browse/MAHOUT-1103 >>> >>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <[email protected]> wrote: >>>> Thanks for the quick response! >>>> >>>> I will do some testing tomorrow with various numbers of clusters and >>>> create a JIRA once I have those results. I might be able to contribute >>>> a patch for this if I have the time. >>>> >>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan >>>> <[email protected]> wrote: >>>>> "So if that's correct, is that what's happening to me? Half of my >>>>> clusters are being sent to the overlapping reducers? That seems like a >>>>> big issue, making clusterpp pretty much useless for my purposes. I >>>>> can't have documents randomly being sent to the wrong cluster's >>>>> directory, especially not 50+% of them." >>>>> >>>>> This might be correct. I think this can occur if the number of clusters is >>>>> large, and the testing was not done with so many clusters. >>>>> Can you help a bit in testing some scenarios? >>>>> >>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is to >>>>> find the breaking point (number of clusters) after which the clusters >>>>> start >>>>> converging. If this is found, then we would be sure that the problem lies >>>>> in the partitioner. >>>>> b) If you want, try to use a different partitioner/s. The idea is to >>>>> create >>>>> as many reducer tasks as the number of ( non empty ) clusters found, so >>>>> that vectors present in each cluster is in a separate file and later they >>>>> are moved to their respective directories ( named on cluster id ). >>>>> >>>>> Please also create a JIRA for this. >>>>> https://issues.apache.org/jira/browse/MAHOUT. >>>>> And if you are interested, this would be a good starting point to >>>>> contribute to Mahout also. >>>>> >>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote: >>>>> >>>>>> First off, thank you everyone for your help so far. This mailing list >>>>>> has been a great help getting me up and running with Mahout >>>>>> >>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters. >>>>>> Then I'm using clusterpp to split the documents up into directories >>>>>> containing the vectors belonging to each cluster. After I perform the >>>>>> clustering, clusterdump shows that each cluster has between ~800 and >>>>>> ~200,000 documents. This isn't a great spread, but the point is that >>>>>> none of the clusters are empty. >>>>>> >>>>>> Here are my commands: >>>>>> >>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters >>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 >>>>>> -k 300 -x 15 -cl -ow >>>>>> >>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o >>>>>> clusterdump.txt >>>>>> >>>>>> bin/mahout clusterpp -i pca-clusters -o bottom >>>>>> >>>>>> >>>>>> Since none of my clusters are empty, I would expect clusterpp to >>>>>> create 300 directories in "bottom", one for each cluster. Instead, >>>>>> only 147 directories are created. The other 153 outputs are just empty >>>>>> part-r-* files sitting in the "bottom" directory. >>>>>> >>>>>> I haven't found too much information when searching on this issue but >>>>>> I did come across one mailing list post from a while back: >>>>>> >>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E >>>>>> >>>>>> In that discussion someone said, "If that is the only thing that is >>>>>> contained in the part-r-* file [it had no vectors], then the reducer >>>>>> responsible to write to that part-r-* file did not receive any input >>>>>> records to write to it. This happens because the program uses the >>>>>> default hash partitioner which sometimes maps records belonging to >>>>>> different clusters to a same reducer; thus leaving some reducers >>>>>> without any input records." >>>>>> >>>>>> So if that's correct, is that what's happening to me? Half of my >>>>>> clusters are being sent to the overlapping reducers? That seems like a >>>>>> big issue, making clusterpp pretty much useless for my purposes. I >>>>>> can't have documents randomly being sent to the wrong cluster's >>>>>> directory, especially not 50+% of them. >>>>>> >>>>>> One final detail: I'm not sure if this matters, but the clusters >>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking, >>>>>> nonsequential numbering sequence. The first 5 clusters are: >>>>>> VL-3740844 >>>>>> VL-3741044 >>>>>> VL-3741140 >>>>>> VL-3741161 >>>>>> VL-3741235 >>>>>> >>>>>> I haven't done much with kmeans before, so I wasn't sure if this was >>>>>> an unexpected behavior or not. >>>>>>
