That's all very helpful. Thanks for you input!
On Mon, Oct 22, 2012 at 2:35 PM, Dmitriy Lyubimov <[email protected]> wrote: > PPS finally if you decide to prototype stuff in R with exact SSVD and > PCA analogue of Mahout's SSVD with R, we have prototyped them first > too before moving to MR implementation so you can use that in your > prototype too if you want to make sure you have very similar > stochasticity effects, see "R simulation" paragraph here > https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition > to download the R prototype code of single-threaded SSVD/PCA versions > of Mahout. > > hope that helps. > > On Mon, Oct 22, 2012 at 11:18 AM, Dmitriy Lyubimov <[email protected]> wrote: >> Regardless of what you are trying to do, the best practice is actually >> prototype the process in R or matlab first to make sure you are >> getting results that make sense to you. Then if you have figured out >> what seems to be working, you can turn to large scale. SSVD is just >> svd in R, and i haven't used k-means or any other clustering there but >> i am sure it is available there too. >> >> Same goes for the sphere projections and pca. >> >> >> >> On Mon, Oct 22, 2012 at 11:13 AM, Dmitriy Lyubimov <[email protected]> wrote: >>> i meant, "soft clustering" >>> >>> On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>>> from Jira: >>>> >>>>> Hi Dmitriy, sorry for going a little off topic here, but could you >>>>> elaborate on this? I've been experimenting with using either cosine or >>>>> tanimoto distance on the USigma output of ssvd with -pca true. Are those >>>>> not appropriate distance measures for the -pca output? >>>> >>>> Let somebody correct me if i am talking nonsense here... >>>> >>>> Strictly speaking, you can find clusters using L2 distance (i.e. >>>> euclidean distance). In that case, PCA helps you by reducing >>>> functionality, and then USigma output will preserve original distances >>>> (or at least proportions of those). K means with L2 will then work a >>>> little faster. >>>> >>>> But... with cosine and Tanimoto, PCA does not preserve those due to >>>> recentering of the original data, therefore, dimensionality reduction >>>> doesn't work as much for these types of things. Here you basically >>>> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca >>>> false and take U output for document topic space), or 2) perhaps do >>>> sphere projection first and then do dimensionality reduction with >>>> --pca true. the latter will at least preserve cosine distances as far >>>> as i can tell. But standard way to address topical "sort clustering" >>>> with text is still LSA. (if that's your goal, within Mahout realm i >>>> probably also need to draw your attention to LDA-cvb method in Mahout, >>>> various researches say LDA actually does better job in finding topic >>>> mixtures). >>>> >>>> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <[email protected]> wrote: >>>>> I've done some more testing and submitted a JIRA: >>>>> https://issues.apache.org/jira/browse/MAHOUT-1103 >>>>> >>>>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <[email protected]> wrote: >>>>>> Thanks for the quick response! >>>>>> >>>>>> I will do some testing tomorrow with various numbers of clusters and >>>>>> create a JIRA once I have those results. I might be able to contribute >>>>>> a patch for this if I have the time. >>>>>> >>>>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan >>>>>> <[email protected]> wrote: >>>>>>> "So if that's correct, is that what's happening to me? Half of my >>>>>>> clusters are being sent to the overlapping reducers? That seems like a >>>>>>> big issue, making clusterpp pretty much useless for my purposes. I >>>>>>> can't have documents randomly being sent to the wrong cluster's >>>>>>> directory, especially not 50+% of them." >>>>>>> >>>>>>> This might be correct. I think this can occur if the number of clusters >>>>>>> is >>>>>>> large, and the testing was not done with so many clusters. >>>>>>> Can you help a bit in testing some scenarios? >>>>>>> >>>>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is >>>>>>> to >>>>>>> find the breaking point (number of clusters) after which the clusters >>>>>>> start >>>>>>> converging. If this is found, then we would be sure that the problem >>>>>>> lies >>>>>>> in the partitioner. >>>>>>> b) If you want, try to use a different partitioner/s. The idea is to >>>>>>> create >>>>>>> as many reducer tasks as the number of ( non empty ) clusters found, so >>>>>>> that vectors present in each cluster is in a separate file and later >>>>>>> they >>>>>>> are moved to their respective directories ( named on cluster id ). >>>>>>> >>>>>>> Please also create a JIRA for this. >>>>>>> https://issues.apache.org/jira/browse/MAHOUT. >>>>>>> And if you are interested, this would be a good starting point to >>>>>>> contribute to Mahout also. >>>>>>> >>>>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote: >>>>>>> >>>>>>>> First off, thank you everyone for your help so far. This mailing list >>>>>>>> has been a great help getting me up and running with Mahout >>>>>>>> >>>>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters. >>>>>>>> Then I'm using clusterpp to split the documents up into directories >>>>>>>> containing the vectors belonging to each cluster. After I perform the >>>>>>>> clustering, clusterdump shows that each cluster has between ~800 and >>>>>>>> ~200,000 documents. This isn't a great spread, but the point is that >>>>>>>> none of the clusters are empty. >>>>>>>> >>>>>>>> Here are my commands: >>>>>>>> >>>>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters >>>>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 >>>>>>>> -k 300 -x 15 -cl -ow >>>>>>>> >>>>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o >>>>>>>> clusterdump.txt >>>>>>>> >>>>>>>> bin/mahout clusterpp -i pca-clusters -o bottom >>>>>>>> >>>>>>>> >>>>>>>> Since none of my clusters are empty, I would expect clusterpp to >>>>>>>> create 300 directories in "bottom", one for each cluster. Instead, >>>>>>>> only 147 directories are created. The other 153 outputs are just empty >>>>>>>> part-r-* files sitting in the "bottom" directory. >>>>>>>> >>>>>>>> I haven't found too much information when searching on this issue but >>>>>>>> I did come across one mailing list post from a while back: >>>>>>>> >>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E >>>>>>>> >>>>>>>> In that discussion someone said, "If that is the only thing that is >>>>>>>> contained in the part-r-* file [it had no vectors], then the reducer >>>>>>>> responsible to write to that part-r-* file did not receive any input >>>>>>>> records to write to it. This happens because the program uses the >>>>>>>> default hash partitioner which sometimes maps records belonging to >>>>>>>> different clusters to a same reducer; thus leaving some reducers >>>>>>>> without any input records." >>>>>>>> >>>>>>>> So if that's correct, is that what's happening to me? Half of my >>>>>>>> clusters are being sent to the overlapping reducers? That seems like a >>>>>>>> big issue, making clusterpp pretty much useless for my purposes. I >>>>>>>> can't have documents randomly being sent to the wrong cluster's >>>>>>>> directory, especially not 50+% of them. >>>>>>>> >>>>>>>> One final detail: I'm not sure if this matters, but the clusters >>>>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking, >>>>>>>> nonsequential numbering sequence. The first 5 clusters are: >>>>>>>> VL-3740844 >>>>>>>> VL-3741044 >>>>>>>> VL-3741140 >>>>>>>> VL-3741161 >>>>>>>> VL-3741235 >>>>>>>> >>>>>>>> I haven't done much with kmeans before, so I wasn't sure if this was >>>>>>>> an unexpected behavior or not. >>>>>>>>
