RE: How to SSVD output to generate Clusters

Stuti Awasthi Thu, 01 Aug 2013 05:51:21 -0700

Thanks Ted, Dmitriy

Il check the Spectral Clustering as well PCA option but first with normal 
approach I want to execute it once.


Here is what I am doing with Mahout 0.7:
1. seqdirectory :
 ~/mahout-distribution-0.7/bin/mahout seqdirectory -i 
/stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq

2.seq2sparse
~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq -o 
/stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70

3. ssvd
~/mahout-distribution-0.7/bin/mahout ssvd -i 
/stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true -V true 
--reduceTasks 1

4.kmeans: with U as input
~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c 
/stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl -k 10

5. Clusterdump
~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i 
/stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d 
/stuti/SSVD/data-vectors/dictionary.file-* -o ~/ClusterOutput/SSVD/KMeans_10 -p 
/stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV

Output :
Normally if I use Clusterdump with CSV option, the I receive the ClusterId and 
associated documents names but this time Im getting the output like :

120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_,
 .......

I think there is a problem because of NamedVector as after some search I get 
this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 

My Queries :
1. Is the process which Im doing is correct ? should U be directly fed as input 
to Clustering Algorithm

2. The Output issue is because of NamedVector ?? If yes , then if I use Mahout 
0.8 will the issue be resolved ?

3. Im confused between parameter "-k" in SSVD and "-k" in Clustering(KMeans). 
How these are different ? As -k in Clustering means Number of cluster to be 
created . What is the purpose of -k(rank) in SSVD
(My apologies, but I am having some problem in grasping the SSVD algorithm. The 
concept of Rank is not clear to me)

4. If I generate -k =100 in SSVD, will I still be able to create say 10 
Clusters using the clustering with this data.

Thanks
Stuti Awasthi

-----Original Message-----
From: Dmitriy Lyubimov [mailto:[email protected]] 
Sent: Wednesday, July 31, 2013 11:15 PM
To: [email protected]
Subject: Re: How to SSVD output to generate Clusters

many people also use PCA options workflow with SSVD and then try clusterize the 
output U*Sigma which is dimensionally reduced representation of original 
row-wise dataset. To enable PCA and U*Sigma output, use

ssvd -pca true -us true -u false -v false -k=... -q=1 ...

-q=1 recommended for accuracy.



On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi <[email protected]> wrote:

> Hi All,
>
> I wanted to group the documents with same context but which belongs to 
> one single domain together. I have tried KMeans and LDA provided in 
> Mahout to perform the clustering but the groups which are generated 
> are not very good. Hence I thought to use LSA to indentify the context 
> related to the word and then perform the Clustering.
>
> I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as 
> output of SSVD.
> I am not sure how to use the output of SSVD to fed to the Clustering 
> Algorithm so that we can generate the clusters of the documents which 
> might be talking about same context.
>
> Any pointers how can I achieve this ?
>
> Regards
> Stuti Awasthi
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------
>
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as 
> information could be intercepted, corrupted, lost, destroyed, arrive 
> late or incomplete, or may contain viruses in transmission. The e mail 
> and its contents (with or without referred errors) shall therefore not 
> attach any liability on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of 
> the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, 
> copying, disclosure, modification, distribution and / or publication 
> of this message without the prior written consent of authorized 
> representative of HCL is strictly prohibited. If you have received 
> this email in error please delete it and notify the sender 
> immediately.
> Before opening any email and/or attachments, please check them for 
> viruses and other defects.
>
>
> ----------------------------------------------------------------------
> ------------------------------------------------------------------------------
>

RE: How to SSVD output to generate Clusters

Reply via email to