Re: How to SSVD output to generate Clusters

Ted Dunning Thu, 01 Aug 2013 08:10:24 -0700

The original motivation of spectral clustering talks about graphs.

But the idea of clustering the reduced dimension form of a matrix simply
depends on the fact[1] that the metric is approximately preserved by the
reduced form and is thus applicable to any matrix.



[1] Johnson-Lindenstrauss yet again.


On Thu, Aug 1, 2013 at 6:22 AM, Chirag Lakhani <[email protected]> wrote:

> Maybe someone can clarify this issue but the spectral clustering
> implementation assumes an affinity graph, am I correct?  Are there direct
> ways of going from a list of feature vectors to an affinity matrix in order
> to then implement spectral clustering?
>
>
> On Thu, Aug 1, 2013 at 8:49 AM, Stuti Awasthi <[email protected]>
> wrote:
>
> > Thanks Ted, Dmitriy
> >
> > Il check the Spectral Clustering as well PCA option but first with normal
> > approach I want to execute it once.
> >
> > Here is what I am doing with Mahout 0.7:
> > 1. seqdirectory :
> >  ~/mahout-distribution-0.7/bin/mahout seqdirectory -i
> > /stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq
> >
> > 2.seq2sparse
> > ~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq
> -o
> > /stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70
> >
> > 3. ssvd
> > ~/mahout-distribution-0.7/bin/mahout ssvd -i
> > /stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true
> -V
> > true --reduceTasks 1
> >
> > 4.kmeans: with U as input
> > ~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c
> > /stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm
> > org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl
> > -k 10
> >
> > 5. Clusterdump
> > ~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i
> > /stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d
> > /stuti/SSVD/data-vectors/dictionary.file-* -o
> > ~/ClusterOutput/SSVD/KMeans_10 -p
> > /stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV
> >
> > Output :
> > Normally if I use Clusterdump with CSV option, the I receive the
> ClusterId
> > and associated documents names but this time Im getting the output like :
> >
> >
> 120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_,
> > .......
> >
> > I think there is a problem because of NamedVector as after some search I
> > get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067
> >
> > My Queries :
> > 1. Is the process which Im doing is correct ? should U be directly fed as
> > input to Clustering Algorithm
> >
> > 2. The Output issue is because of NamedVector ?? If yes , then if I use
> > Mahout 0.8 will the issue be resolved ?
> >
> > 3. Im confused between parameter "-k" in SSVD and "-k" in
> > Clustering(KMeans). How these are different ? As -k in Clustering means
> > Number of cluster to be created . What is the purpose of -k(rank) in SSVD
> > (My apologies, but I am having some problem in grasping the SSVD
> > algorithm. The concept of Rank is not clear to me)
> >
> > 4. If I generate -k =100 in SSVD, will I still be able to create say 10
> > Clusters using the clustering with this data.
> >
> > Thanks
> > Stuti Awasthi
> >
> > -----Original Message-----
> > From: Dmitriy Lyubimov [mailto:[email protected]]
> > Sent: Wednesday, July 31, 2013 11:15 PM
> > To: [email protected]
> > Subject: Re: How to SSVD output to generate Clusters
> >
> > many people also use PCA options workflow with SSVD and then try
> > clusterize the output U*Sigma which is dimensionally reduced
> representation
> > of original row-wise dataset. To enable PCA and U*Sigma output, use
> >
> > ssvd -pca true -us true -u false -v false -k=... -q=1 ...
> >
> > -q=1 recommended for accuracy.
> >
> >
> >
> > On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi <[email protected]>
> > wrote:
> >
> > > Hi All,
> > >
> > > I wanted to group the documents with same context but which belongs to
> > > one single domain together. I have tried KMeans and LDA provided in
> > > Mahout to perform the clustering but the groups which are generated
> > > are not very good. Hence I thought to use LSA to indentify the context
> > > related to the word and then perform the Clustering.
> > >
> > > I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as
> > > output of SSVD.
> > > I am not sure how to use the output of SSVD to fed to the Clustering
> > > Algorithm so that we can generate the clusters of the documents which
> > > might be talking about same context.
> > >
> > > Any pointers how can I achieve this ?
> > >
> > > Regards
> > > Stuti Awasthi
> > >
> > >
> > > ::DISCLAIMER::
> > >
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > --------
> > >
> > > The contents of this e-mail and any attachment(s) are confidential and
> > > intended for the named recipient(s) only.
> > > E-mail transmission is not guaranteed to be secure or error-free as
> > > information could be intercepted, corrupted, lost, destroyed, arrive
> > > late or incomplete, or may contain viruses in transmission. The e mail
> > > and its contents (with or without referred errors) shall therefore not
> > > attach any liability on the originator or HCL or its affiliates.
> > > Views or opinions, if any, presented in this email are solely those of
> > > the author and may not necessarily reflect the views or opinions of
> > > HCL or its affiliates. Any form of reproduction, dissemination,
> > > copying, disclosure, modification, distribution and / or publication
> > > of this message without the prior written consent of authorized
> > > representative of HCL is strictly prohibited. If you have received
> > > this email in error please delete it and notify the sender
> > > immediately.
> > > Before opening any email and/or attachments, please check them for
> > > viruses and other defects.
> > >
> > >
> > > ----------------------------------------------------------------------
> > >
> >
> ------------------------------------------------------------------------------
> > >
> >
>
>
>
> --
>
> *Chirag Lakhani*
>
> Data Scientist
>
> Zaloni, Inc. | www.zaloni.com
>
> 633 Davis Dr., Suite 200
>
> Durham, NC 27713
> e: [email protected]
> p: 919.602.4965 x7020
>

Re: How to SSVD output to generate Clusters

Reply via email to