Re: Fwd: Re: convert input for SVD

Rajesh Nikam Fri, 24 May 2013 09:39:16 -0700

Thanks Dmitriy & Suneel for comments. As you suggested I need to use U *
Sigma.


It means Need to get multiplication of these matrices.

Which Mahout props to use for this?

Other question was how to get features that are selected in U?
On May 24, 2013 8:45 PM, "Suneel Marthi" <[email protected]> wrote:

> Rajesh,
>
> I am working off of trunk and this works fine.
>
> As Dmitriy says u do need USigma.
>
> It would help to paste the entire stacktrace you are seeing with
> MatrixColumnMeansJob.
>
> If you are still seeing an issue, I would suggest that you work off of
> trunk.
>
>
>
>
> ________________________________
>  From: Dmitriy Lyubimov <[email protected]>
> To: [email protected]
> Sent: Friday, May 24, 2013 9:52 AM
> Subject: Re: Fwd: Re: convert input for SVD
>
>
> I think last time i verified this flow was as of
> https://issues.apache.org/jira/browse/MAHOUT-1097. It was woking then. Did
> not look at it since.
> On May 24, 2013 6:42 AM, "Dmitriy Lyubimov" <[email protected]> wrote:
>
> > Rajesh, you will get more help if you stay on the list.
> >
> > you do need u *sigma output. there is no substitute.
> >
> > If this option is indeed no longer there, i have no knowledge of it.
> Maybe
> > there was some work committed that screwed that  but at the moment i have
> > no time to look at it. Obviously it was there at the time documentation
> was
> > written. I guess you may obtain an earlier snapshot as interim solution
> if
> > it is indeed the case.
> >
> > ---------- Forwarded message ----------
> > From: "Rajesh Nikam" <[email protected]>
> > Date: May 24, 2013 3:20 AM
> > Subject: Re: convert input for SVD
> > To: <[email protected]>
> > Cc:
> >
> > > Hello Dmitriy,
> > >
> > > Thanks for reply.
> > >
> > > I see similar discussion on following link where I see your reply.
> > >
> > >
> >
> http://www.searchworkings.org/forum/-/message_boards/view_message/517870#_19_message_519704
> > >
> > > I do also have same problem, need to apply dimensionality reduction and
> > use
> > > clustering algo on reduced features.
> > >
> > > Seems parameters for ssvd are changed from mentioned in SSVD-CLI.pdf.
> It
> > no
> > > longer shows *-us *as parameter
> > >
> > > I am using mahout-examples-0.7-job.jar
> > >
> > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
> > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -pca true -U true
> -V
> > > false *-us true* -ow -q 1
> > >
> > > giving option as "*-pca true*" gives error as
> > >
> > > at
> > >
> >
> org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
> > >         at
> > >
> >
> org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
> > >
> > > So I removed it.
> > >
> > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
> > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -U true -V false
> > *-us
> > > true* -ow -q 1
> > >
> > > *>> *with above command *>> Unexpected -us *while processing
> Job-Specific
> > > Options.
> > >
> > > I tried with "-U false -V false -uhs true" it just generated sigma file
> > as
> > > expected however no "Usigma"
> > >
> > > hadoop fs -lsr /user/hadoop/t/PE_EXE/input-set-svd/
> > >
> > > -rw-r--r--   2 hadoop supergroup       1712 2013-05-24 15:34
> > > /user/hadoop/t/PE_EXE/input-set-svd/sigma
> > >
> > > Then with *"-U true -V false -uhs true" *output dir U is created.
> > > *
> > > *drwxr-xr-x   - hadoop supergroup          0 2013-05-24 15:39
> > > /user/hadoop/t/PE_EXE/input-set-svd/U
> > > -rw-r--r--   2 hadoop supergroup       1712 2013-05-24 15:39
> > > /user/hadoop/t/PE_EXE/input-set-svd/sigma*
> > > *
> > >
> > > My problem is how to use these U/V/sigma file as input to
> canopy/kmeans ?
> > >
> > > How to identify which important features from U/Sigma that are retained
> > in
> > > dimensionality reduction ?
> > >
> > > Thanks in Advance !
> > > Rajesh
> > >
> > >
> > > On Fri, May 24, 2013 at 7:01 AM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> > >
> > > >
> > > >
> >
> https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000
> > > > :
> > > >
> > > > "In most cases where you might be looking to reduce
> > > > dimensionality while retaining variance, you probably need
> combination
> > of
> > > > options -pca true -U false -V
> > > > false -us true.
> > > >
> > > > See §3 for details"
> > > >
> > > >
> > > > On Thu, May 23, 2013 at 6:24 PM, Dmitriy Lyubimov <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Also, for the dimensionality reduction it is important among other
> > things
> > > > > to re-center your input first, which is why you also want "-pca
> > true".
> > > > >
> > > > >
> > > > > On Thu, May 23, 2013 at 6:23 PM, Dmitriy Lyubimov <
> [email protected]
> > > > >wrote:
> > > > >
> > > > >> did you specify -us option? SSVD by default produces only U, V and
> > > > Sigma.
> > > > >> but it can produce more, e.g. U*Sigma, U*sqrt(Sigma) etc. if you
> > ask for
> > > > >> it. And, alternatively, you can suppress any of U, V (you can't
> > suppress
> > > > >> sigma but that doesn't cost anything in space anyway).
> > > > >>
> > > > >>
> > > > >> On Thu, May 23, 2013 at 6:20 PM, Rajesh Nikam <
> > [email protected]
> > > > >wrote:
> > > > >>
> > > > >>> I got all three U, V & sigma from ssvd, however which to use as
> > input
> > > > to
> > > > >>> canopy?
> > > > >>> On May 24, 2013 6:47 AM, "Dmitriy Lyubimov" <[email protected]>
> > wrote:
> > > > >>>
> > > > >>> > I think you want U*Sigma
> > > > >>> >
> > > > >>> > What you want is ssvd ... -pca true ... -us true ... see the
> > manual
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> > On Thu, May 23, 2013 at 6:07 PM, Rajesh Nikam <
> > [email protected]
> > > > >
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> > > Sorry for confusion. Here number of clusters are decided by
> > canopy.
> > > > >>> With
> > > > >>> > > data as it has 60 to 70 clusters.
> > > > >>> > >
> > > > >>> > > My question is which part from ssvd output U, V, Sigma should
> > be
> > > > >>> used as
> > > > >>> > > input to canopy?
> > > > >>> > >  On May 24, 2013 3:56 AM, "Ted Dunning" <
> [email protected]
> > >
> > > > >>> wrote:
> > > > >>> > >
> > > > >>> > > > Rajesh,
> > > > >>> > > >
> > > > >>> > > > This is very confusing.
> > > > >>> > > >
> > > > >>> > > > You have 1500 things that you are clustering into more than
> > 1400
> > > > >>> > > clusters.
> > > > >>> > > >
> > > > >>> > > > There is no way for most of these clusters to have >1
> member
> > just
> > > > >>> > because
> > > > >>> > > > there aren't enough clusters compared to the items.
> > > > >>> > > >
> > > > >>> > > > Is there a typo here?
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > > On Thu, May 23, 2013 at 5:34 AM, Rajesh Nikam <
> > > > >>> [email protected]>
> > > > >>> > > > wrote:
> > > > >>> > > >
> > > > >>> > > > > Hi,
> > > > >>> > > > >
> > > > >>> > > > > I have input test set of 1500 instances with 1000+
> > features. I
> > > > >>> want
> > > > >>> > to
> > > > >>> > > to
> > > > >>> > > > > SVD to reduce features. I have followed following steps
> > with
> > > > >>> generate
> > > > >>> > > > 1400+
> > > > >>> > > > > clusters 99% of clusters contain 1 instance :(
> > > > >>> > > > >
> > > > >>> > > > > Please let me know what is wrong in below steps -
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > mahout arff.vector --input /mnt/cluster/t/input-set.arff
> > > > --output
> > > > >>> > > > > /user/hadoop/t/input-set-vector/ --dictOut
> > > > >>> > > /mnt/cluster/t/input-set-dict
> > > > >>> > > > >
> > > > >>> > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/
> > --output
> > > > >>> > > > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -ow
> > > > >>> > > > >
> > > > >>> > > > > mahout canopy -i */user/hadoop/t/input-set-svd/U* -o
> > > > >>> > > > > /user/hadoop/t/input-set-canopy-centroids -dm
> > > > >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure
> > *-t1
> > > > >>> 0.001
> > > > >>> > > -t2
> > > > >>> > > > > 0.002*
> > > > >>> > > > >
> > > > >>> > > > > mahout kmeans -i */user/hadoop/t/input-set-svd/U* -c
> > > > >>> > > > >
> /user/hadoop/t/input-set-canopy-centroids/clusters-0-final
> > -cl
> > > > -o
> > > > >>> > > > > /user/hadoop/t/input-set-kmeans-clusters -ow -x 10 -dm
> > > > >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure
> > > > >>> > > > >
> > > > >>> > > > > mahout clusterdump -dt sequencefile -i
> > > > >>> > > > >
> /user/hadoop/t/input-set-kmeans-clusters/clusters-1-final/
> > -n
> > > > 20
> > > > >>> -b
> > > > >>> > 100
> > > > >>> > > > -o
> > > > >>> > > > > /mnt/cluster/t/cdump-input-set.txt -p
> > > > >>> > > > > /user/hadoop/t/input-set-kmeans-clusters/clusteredPoints/
> > > > >>> --evaluate
> > > > >>> > > > >
> > > > >>> > > > > Thanks in advance !
> > > > >>> > > > >
> > > > >>> > > > > Rajesh
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > On Wed, May 22, 2013 at 2:18 AM, Dmitriy Lyubimov <
> > > > >>> [email protected]
> > > > >>> > >
> > > > >>> > > > > wrote:
> > > > >>> > > > >
> > > > >>> > > > > > PPS As far as the tool for arff, i am frankly not sure.
> > but
> > > > it
> > > > >>> > sounds
> > > > >>> > > > > like
> > > > >>> > > > > > you've already solved this.
> > > > >>> > > > > >
> > > > >>> > > > > >
> > > > >>> > > > > > On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov <
> > > > >>> > [email protected]
> > > > >>> > > >
> > > > >>> > > > > > wrote:
> > > > >>> > > > > >
> > > > >>> > > > > > > ps as far as U, V data "close to zero", yes that's
> what
> > > > you'd
> > > > >>> > > expect.
> > > > >>> > > > > > >
> > > > >>> > > > > > > Here, by "close to zero" it still means much bigger
> > than a
> > > > >>> > rounding
> > > > >>> > > > > error
> > > > >>> > > > > > > of course. e.g. 1E-12 is indeed a small number, and
> > 1E-16
> > > > to
> > > > >>> > 1E-18
> > > > >>> > > > > would
> > > > >>> > > > > > be
> > > > >>> > > > > > > indeed "close to zero" for the purposes of
> singularity.
> > > > >>> > 1E-2..1E-5
> > > > >>> > > > are
> > > > >>> > > > > > > actually quite  "sizeable" numbers by the scale of
> > IEEE 754
> > > > >>> > > > > arithmetics.
> > > > >>> > > > > > >
> > > > >>> > > > > > > U and V are orthonormal (which means their column
> > vectors
> > > > >>> have
> > > > >>> > > > > euclidiean
> > > > >>> > > > > > > norm of 1) . Note that for large m and n (large
> inputs)
> > > > they
> > > > >>> are
> > > > >>> > > also
> > > > >>> > > > > > > extremely skinny. The larger input is, the smaller
> the
> > > > >>> element
> > > > >>> > of U
> > > > >>> > > > > > or/and
> > > > >>> > > > > > > V is gonna be.
> > > > >>> > > > > > >
> > > > >>> > > > > > >
> > > > >>> > > > > > >
> > > > >>> > > > > > > On Tue, May 21, 2013 at 8:48 AM, Dmitriy Lyubimov <
> > > > >>> > > [email protected]
> > > > >>> > > > > > >wrote:
> > > > >>> > > > > > >
> > > > >>> > > > > > >> Sounds like dimensionality reduction to me. You may
> > want
> > > > to
> > > > >>> use
> > > > >>> > > ssvd
> > > > >>> > > > > > -pca
> > > > >>> > > > > > >>
> > > > >>> > > > > > >> Apologies for brevity. Sent from my Android phone.
> > > > >>> > > > > > >> -Dmitriy
> > > > >>> > > > > > >> On May 21, 2013 6:27 AM, "Rajesh Nikam" <
> > > > >>> [email protected]>
> > > > >>> > > > > wrote:
> > > > >>> > > > > > >>
> > > > >>> > > > > > >>> Hello Ted,
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> Thanks for reply.
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> I have started exploring SVD based on its mention
> of
> > > > could
> > > > >>> help
> > > > >>> > > to
> > > > >>> > > > > drop
> > > > >>> > > > > > >>> features which are not relevant for clustering.
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> My objective is reduce number of features before
> > passing
> > > > >>> them
> > > > >>> > to
> > > > >>> > > > > > >>> clustering
> > > > >>> > > > > > >>> and just keep important features.
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> arff/csv==> ssvd (for dimensionality reduction) ==>
> > > > >>> clustering
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> Could you please illustrate mahout props to join
> > above
> > > > >>> > pipeline.
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> I think, Lanczos SVD needs to be used for mxm
> matrix.
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> I have tried check ssvd, I have used arff.vector to
> > > > covert
> > > > >>> > > arff/csv
> > > > >>> > > > > to
> > > > >>> > > > > > >>> vector file which is then give as input to ssvd and
> > them
> > > > >>> dumped
> > > > >>> > > U,
> > > > >>> > > > V
> > > > >>> > > > > > and
> > > > >>> > > > > > >>> sigma using vectordump.
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> I see most of the values dumped are near to 0. I
> dont
> > > > >>> > understand
> > > > >>> > > is
> > > > >>> > > > > > this
> > > > >>> > > > > > >>> correct or not.
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>>
> > > > >>> > > > > >
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> >
> {0:0.01066724825049657,1:0.016715498597386844,2:2.0187750952311708E-4,3:3.401020567221039E-4,4:-1.2388403347280688E-4,5:6.41502463540719E-5,6:-1.359187582538833E-4,7:6.329813140445419E-5,8:1.670015585746444E-4,9:3.5415113034592744E-4,10:7.108868213280763E-4,11:0.020553517552052456,12:-0.015118680942548916,13:0.007981746711271956,14:-0.003251236468768259,15:0.0038075014396303053,16:-0.0010925318534013683,17:-0.0026943024876179833,18:-0.001744794617721648,19:-0.0024528466548735714}
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>>
> > > > >>> > > > > >
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> >
> {0:0.029978614322360833,1:-0.01431521245087889,2:1.3318592088199427E-4,3:1.495356283071516E-4,4:8.762709213918985E-5,5:1.2765191352425177E-
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> Thanks,
> > > > >>> > > > > > >>> Rajesh
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> On Tue, May 21, 2013 at 11:35 AM, Ted Dunning <
> > > > >>> > > > [email protected]
> > > > >>> > > > > >
> > > > >>> > > > > > >>> wrote:
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>> > Are you using Lanczos instead of SSVD for a
> reason?
> > > > >>> > > > > > >>> >
> > > > >>> > > > > > >>> >
> > > > >>> > > > > > >>> >
> > > > >>> > > > > > >>> >
> > > > >>> > > > > > >>> > On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam <
> > > > >>> > > > > [email protected]
> > > > >>> > > > > > >
> > > > >>> > > > > > >>> > wrote:
> > > > >>> > > > > > >>> >
> > > > >>> > > > > > >>> > > Hello,
> > > > >>> > > > > > >>> > >
> > > > >>> > > > > > >>> > > I have arff / csv file containing input data
> > that I
> > > > >>> want to
> > > > >>> > > > pass
> > > > >>> > > > > to
> > > > >>> > > > > > >>> svd :
> > > > >>> > > > > > >>> > > Lanczos Singular Value Decomposition.
> > > > >>> > > > > > >>> > >
> > > > >>> > > > > > >>> > > Which tool to use to convert it to required
> > format ?
> > > > >>> > > > > > >>> > >
> > > > >>> > > > > > >>> > > Thanks in Advance !
> > > > >>> > > > > > >>> > >
> > > > >>> > > > > > >>> > > Thanks,
> > > > >>> > > > > > >>> > > Rajesh
> > > > >>> > > > > > >>> > >
> > > > >>> > > > > > >>> >
> > > > >>> > > > > > >>>
> > > > >>> > > > > > >>
> > > > >>> > > > > > >
> > > > >>> > > > > >
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> >

Re: Fwd: Re: convert input for SVD

Reply via email to