That's all very helpful. Thanks for you input!

On Mon, Oct 22, 2012 at 2:35 PM, Dmitriy Lyubimov <[email protected]> wrote:
> PPS finally if you decide to prototype stuff in R with exact SSVD and
> PCA analogue of Mahout's SSVD with R, we have prototyped them first
> too before moving to MR implementation so you can use that in your
> prototype too if you want to make sure you have very similar
> stochasticity effects, see "R simulation" paragraph here
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
> to download the R prototype code of single-threaded SSVD/PCA versions
> of Mahout.
>
> hope that helps.
>
> On Mon, Oct 22, 2012 at 11:18 AM, Dmitriy Lyubimov <[email protected]> wrote:
>> Regardless of what you are trying to do, the best practice is actually
>> prototype the process in R or matlab first to make sure you are
>> getting results that make sense to you. Then if you have figured out
>> what seems to be working, you can turn to large scale. SSVD is just
>> svd in R, and i haven't used k-means or any other clustering there but
>> i am sure it is available there too.
>>
>> Same goes for the sphere projections and pca.
>>
>>
>>
>> On Mon, Oct 22, 2012 at 11:13 AM, Dmitriy Lyubimov <[email protected]> wrote:
>>> i meant, "soft clustering"
>>>
>>> On Mon, Oct 22, 2012 at 11:06 AM, Dmitriy Lyubimov <[email protected]> 
>>> wrote:
>>>> from Jira:
>>>>
>>>>> Hi Dmitriy, sorry for going a little off topic here, but could you 
>>>>> elaborate on this? I've been experimenting with using either cosine or 
>>>>> tanimoto distance on the USigma output of ssvd with -pca true. Are those 
>>>>> not appropriate distance measures for the -pca output?
>>>>
>>>> Let somebody correct me if i am talking nonsense here...
>>>>
>>>> Strictly speaking, you can find clusters using L2 distance (i.e.
>>>> euclidean distance). In that case, PCA helps you by reducing
>>>> functionality, and then USigma output will preserve original distances
>>>> (or at least proportions of those). K means with L2 will then work a
>>>> little faster.
>>>>
>>>> But... with cosine and Tanimoto, PCA does not preserve those due to
>>>> recentering of the original data, therefore, dimensionality reduction
>>>> doesn't work as much for these types of things. Here you basically
>>>> have just to recourses: 1) do LSA ( in terms of SSVD, it means --pca
>>>> false and take U output for document topic space), or 2) perhaps do
>>>> sphere projection first and then do dimensionality reduction with
>>>> --pca true. the latter will at least preserve cosine distances as far
>>>> as i can tell. But standard way to address topical "sort clustering"
>>>> with text is still LSA. (if that's your goal, within Mahout realm i
>>>> probably also need to draw your attention to LDA-cvb method in Mahout,
>>>> various researches say LDA actually does better job in finding topic
>>>> mixtures).
>>>>
>>>> On Mon, Oct 22, 2012 at 7:29 AM, Matt Molek <[email protected]> wrote:
>>>>> I've done some more testing and submitted a JIRA:
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1103
>>>>>
>>>>> On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek <[email protected]> wrote:
>>>>>> Thanks for the quick response!
>>>>>>
>>>>>> I will do some testing tomorrow with various numbers of clusters and
>>>>>> create a JIRA once I have those results. I might be able to contribute
>>>>>> a patch for this if I have the time.
>>>>>>
>>>>>> On Sat, Oct 20, 2012 at 4:24 PM, paritosh ranjan
>>>>>> <[email protected]> wrote:
>>>>>>> "So if that's correct, is that what's happening to me? Half of my
>>>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>>>> directory, especially not 50+% of them."
>>>>>>>
>>>>>>> This might be correct. I think this can occur if the number of clusters 
>>>>>>> is
>>>>>>> large, and the testing was not done with so many clusters.
>>>>>>> Can you help a bit in testing some scenarios?
>>>>>>>
>>>>>>> a) Try reducing the number of clusters to 100 and then 50. The motto is 
>>>>>>> to
>>>>>>> find the breaking point (number of clusters) after which the clusters 
>>>>>>> start
>>>>>>> converging. If this is found, then we would be sure that the problem 
>>>>>>> lies
>>>>>>> in the partitioner.
>>>>>>> b) If you want, try to use a different partitioner/s. The idea is to 
>>>>>>> create
>>>>>>> as many reducer tasks as the number of ( non empty ) clusters found, so
>>>>>>> that vectors present in each cluster is in a separate file and later 
>>>>>>> they
>>>>>>> are moved to their respective directories ( named on cluster id ).
>>>>>>>
>>>>>>> Please also create a JIRA for this.
>>>>>>> https://issues.apache.org/jira/browse/MAHOUT.
>>>>>>> And if you are interested, this would be a good starting point to
>>>>>>> contribute to Mahout also.
>>>>>>>
>>>>>>> On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek <[email protected]> wrote:
>>>>>>>
>>>>>>>> First off, thank you everyone for your help so far. This mailing list
>>>>>>>> has been a great help getting me up and running with Mahout
>>>>>>>>
>>>>>>>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>>>>>>>> Then I'm using clusterpp to split the documents up into directories
>>>>>>>> containing the vectors belonging to each cluster. After I perform the
>>>>>>>> clustering, clusterdump shows that each cluster has between ~800 and
>>>>>>>> ~200,000 documents. This isn't a great spread, but the point is that
>>>>>>>> none of the clusters are empty.
>>>>>>>>
>>>>>>>> Here are my commands:
>>>>>>>>
>>>>>>>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>>>>>>>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>>>>>>>> -k 300 -x 15 -cl -ow
>>>>>>>>
>>>>>>>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o 
>>>>>>>> clusterdump.txt
>>>>>>>>
>>>>>>>> bin/mahout clusterpp -i pca-clusters -o bottom
>>>>>>>>
>>>>>>>>
>>>>>>>> Since none of my clusters are empty, I would expect clusterpp to
>>>>>>>> create 300 directories in "bottom", one for each cluster. Instead,
>>>>>>>> only 147 directories are created. The other 153 outputs are just empty
>>>>>>>> part-r-* files sitting in the "bottom" directory.
>>>>>>>>
>>>>>>>> I haven't found too much information when searching on this issue but
>>>>>>>> I did come across one mailing list post from a while back:
>>>>>>>>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E
>>>>>>>>
>>>>>>>> In that discussion someone said, "If that is the only thing that is
>>>>>>>> contained in the part-r-* file [it had no vectors], then the reducer
>>>>>>>> responsible to write to that part-r-* file did not receive any input
>>>>>>>> records to write to it. This happens because the program uses the
>>>>>>>> default hash partitioner which sometimes maps records belonging to
>>>>>>>> different clusters to a same reducer; thus leaving some reducers
>>>>>>>> without any input records."
>>>>>>>>
>>>>>>>> So if that's correct, is that what's happening to me? Half of my
>>>>>>>> clusters are being sent to the overlapping reducers? That seems like a
>>>>>>>> big issue, making clusterpp pretty much useless for my purposes. I
>>>>>>>> can't have documents randomly being sent to the wrong cluster's
>>>>>>>> directory, especially not 50+% of them.
>>>>>>>>
>>>>>>>> One final detail: I'm not sure if this matters, but the clusters
>>>>>>>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>>>>>>>> nonsequential numbering sequence. The first 5 clusters are:
>>>>>>>> VL-3740844
>>>>>>>> VL-3741044
>>>>>>>> VL-3741140
>>>>>>>> VL-3741161
>>>>>>>> VL-3741235
>>>>>>>>
>>>>>>>> I haven't done much with kmeans before, so I wasn't sure if this was
>>>>>>>> an unexpected behavior or not.
>>>>>>>>

Reply via email to