Hi Dmitriy / Suneel,
You are pointing me to the correct solution. However I see difference
options in source code downloaded from (mahout-trunk.zip) and
mahout-examples-0.7-job.jar.
Could you please verify the same at your end.
==>> from mahout-trunk.zip <<==
addOption("uHalfSigma",
"uhs",
"Compute U * Sigma^0.5",
String.valueOf(false));
* addOption("uSigma", "us", "Compute U * Sigma", String.valueOf(false));*
addOption("computeV", "V", "compute V (true/false)",
String.valueOf(true));
==>> mahout-examples-0.7-job.jar <<==
addOption("uHalfSigma", "uhs", "Compute U as UHat=U x pow(Sigma,0.5)",
String.valueOf(false));
addOption("computeV", "V", "compute V (true/false)",
String.valueOf(true));
addOption("vHalfSigma", "vhs", "compute V as VHat= V x pow(Sigma,0.5)",
String.valueOf(false));
Thanks,
Rajesh
On Fri, May 24, 2013 at 10:48 PM, Dmitriy Lyubimov <[email protected]>wrote:
> "ssvd -us true...." should do this . Suneel says it still works on trunk.
>
>
> On Fri, May 24, 2013 at 9:38 AM, Rajesh Nikam <[email protected]>
> wrote:
>
> > Thanks Dmitriy & Suneel for comments. As you suggested I need to use U *
> > Sigma.
> >
> > It means Need to get multiplication of these matrices.
> >
> > Which Mahout props to use for this?
> >
> > Other question was how to get features that are selected in U?
> > On May 24, 2013 8:45 PM, "Suneel Marthi" <[email protected]>
> wrote:
> >
> > > Rajesh,
> > >
> > > I am working off of trunk and this works fine.
> > >
> > > As Dmitriy says u do need USigma.
> > >
> > > It would help to paste the entire stacktrace you are seeing with
> > > MatrixColumnMeansJob.
> > >
> > > If you are still seeing an issue, I would suggest that you work off of
> > > trunk.
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: Dmitriy Lyubimov <[email protected]>
> > > To: [email protected]
> > > Sent: Friday, May 24, 2013 9:52 AM
> > > Subject: Re: Fwd: Re: convert input for SVD
> > >
> > >
> > > I think last time i verified this flow was as of
> > > https://issues.apache.org/jira/browse/MAHOUT-1097. It was woking then.
> > Did
> > > not look at it since.
> > > On May 24, 2013 6:42 AM, "Dmitriy Lyubimov" <[email protected]> wrote:
> > >
> > > > Rajesh, you will get more help if you stay on the list.
> > > >
> > > > you do need u *sigma output. there is no substitute.
> > > >
> > > > If this option is indeed no longer there, i have no knowledge of it.
> > > Maybe
> > > > there was some work committed that screwed that but at the moment i
> > have
> > > > no time to look at it. Obviously it was there at the time
> documentation
> > > was
> > > > written. I guess you may obtain an earlier snapshot as interim
> solution
> > > if
> > > > it is indeed the case.
> > > >
> > > > ---------- Forwarded message ----------
> > > > From: "Rajesh Nikam" <[email protected]>
> > > > Date: May 24, 2013 3:20 AM
> > > > Subject: Re: convert input for SVD
> > > > To: <[email protected]>
> > > > Cc:
> > > >
> > > > > Hello Dmitriy,
> > > > >
> > > > > Thanks for reply.
> > > > >
> > > > > I see similar discussion on following link where I see your reply.
> > > > >
> > > > >
> > > >
> > >
> >
> http://www.searchworkings.org/forum/-/message_boards/view_message/517870#_19_message_519704
> > > > >
> > > > > I do also have same problem, need to apply dimensionality reduction
> > and
> > > > use
> > > > > clustering algo on reduced features.
> > > > >
> > > > > Seems parameters for ssvd are changed from mentioned in
> SSVD-CLI.pdf.
> > > It
> > > > no
> > > > > longer shows *-us *as parameter
> > > > >
> > > > > I am using mahout-examples-0.7-job.jar
> > > > >
> > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
> > > > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -pca true -U
> > true
> > > -V
> > > > > false *-us true* -ow -q 1
> > > > >
> > > > > giving option as "*-pca true*" gives error as
> > > > >
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
> > > > >
> > > > > So I removed it.
> > > > >
> > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
> > > > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -U true -V
> false
> > > > *-us
> > > > > true* -ow -q 1
> > > > >
> > > > > *>> *with above command *>> Unexpected -us *while processing
> > > Job-Specific
> > > > > Options.
> > > > >
> > > > > I tried with "-U false -V false -uhs true" it just generated sigma
> > file
> > > > as
> > > > > expected however no "Usigma"
> > > > >
> > > > > hadoop fs -lsr /user/hadoop/t/PE_EXE/input-set-svd/
> > > > >
> > > > > -rw-r--r-- 2 hadoop supergroup 1712 2013-05-24 15:34
> > > > > /user/hadoop/t/PE_EXE/input-set-svd/sigma
> > > > >
> > > > > Then with *"-U true -V false -uhs true" *output dir U is created.
> > > > > *
> > > > > *drwxr-xr-x - hadoop supergroup 0 2013-05-24 15:39
> > > > > /user/hadoop/t/PE_EXE/input-set-svd/U
> > > > > -rw-r--r-- 2 hadoop supergroup 1712 2013-05-24 15:39
> > > > > /user/hadoop/t/PE_EXE/input-set-svd/sigma*
> > > > > *
> > > > >
> > > > > My problem is how to use these U/V/sigma file as input to
> > > canopy/kmeans ?
> > > > >
> > > > > How to identify which important features from U/Sigma that are
> > retained
> > > > in
> > > > > dimensionality reduction ?
> > > > >
> > > > > Thanks in Advance !
> > > > > Rajesh
> > > > >
> > > > >
> > > > > On Fri, May 24, 2013 at 7:01 AM, Dmitriy Lyubimov <
> [email protected]
> > >
> > > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000
> > > > > > :
> > > > > >
> > > > > > "In most cases where you might be looking to reduce
> > > > > > dimensionality while retaining variance, you probably need
> > > combination
> > > > of
> > > > > > options -pca true -U false -V
> > > > > > false -us true.
> > > > > >
> > > > > > See ยง3 for details"
> > > > > >
> > > > > >
> > > > > > On Thu, May 23, 2013 at 6:24 PM, Dmitriy Lyubimov <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Also, for the dimensionality reduction it is important among
> > other
> > > > things
> > > > > > > to re-center your input first, which is why you also want "-pca
> > > > true".
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 23, 2013 at 6:23 PM, Dmitriy Lyubimov <
> > > [email protected]
> > > > > > >wrote:
> > > > > > >
> > > > > > >> did you specify -us option? SSVD by default produces only U, V
> > and
> > > > > > Sigma.
> > > > > > >> but it can produce more, e.g. U*Sigma, U*sqrt(Sigma) etc. if
> you
> > > > ask for
> > > > > > >> it. And, alternatively, you can suppress any of U, V (you
> can't
> > > > suppress
> > > > > > >> sigma but that doesn't cost anything in space anyway).
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, May 23, 2013 at 6:20 PM, Rajesh Nikam <
> > > > [email protected]
> > > > > > >wrote:
> > > > > > >>
> > > > > > >>> I got all three U, V & sigma from ssvd, however which to use
> as
> > > > input
> > > > > > to
> > > > > > >>> canopy?
> > > > > > >>> On May 24, 2013 6:47 AM, "Dmitriy Lyubimov" <
> [email protected]
> > >
> > > > wrote:
> > > > > > >>>
> > > > > > >>> > I think you want U*Sigma
> > > > > > >>> >
> > > > > > >>> > What you want is ssvd ... -pca true ... -us true ... see
> the
> > > > manual
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > On Thu, May 23, 2013 at 6:07 PM, Rajesh Nikam <
> > > > [email protected]
> > > > > > >
> > > > > > >>> > wrote:
> > > > > > >>> >
> > > > > > >>> > > Sorry for confusion. Here number of clusters are decided
> by
> > > > canopy.
> > > > > > >>> With
> > > > > > >>> > > data as it has 60 to 70 clusters.
> > > > > > >>> > >
> > > > > > >>> > > My question is which part from ssvd output U, V, Sigma
> > should
> > > > be
> > > > > > >>> used as
> > > > > > >>> > > input to canopy?
> > > > > > >>> > > On May 24, 2013 3:56 AM, "Ted Dunning" <
> > > [email protected]
> > > > >
> > > > > > >>> wrote:
> > > > > > >>> > >
> > > > > > >>> > > > Rajesh,
> > > > > > >>> > > >
> > > > > > >>> > > > This is very confusing.
> > > > > > >>> > > >
> > > > > > >>> > > > You have 1500 things that you are clustering into more
> > than
> > > > 1400
> > > > > > >>> > > clusters.
> > > > > > >>> > > >
> > > > > > >>> > > > There is no way for most of these clusters to have >1
> > > member
> > > > just
> > > > > > >>> > because
> > > > > > >>> > > > there aren't enough clusters compared to the items.
> > > > > > >>> > > >
> > > > > > >>> > > > Is there a typo here?
> > > > > > >>> > > >
> > > > > > >>> > > >
> > > > > > >>> > > >
> > > > > > >>> > > >
> > > > > > >>> > > > On Thu, May 23, 2013 at 5:34 AM, Rajesh Nikam <
> > > > > > >>> [email protected]>
> > > > > > >>> > > > wrote:
> > > > > > >>> > > >
> > > > > > >>> > > > > Hi,
> > > > > > >>> > > > >
> > > > > > >>> > > > > I have input test set of 1500 instances with 1000+
> > > > features. I
> > > > > > >>> want
> > > > > > >>> > to
> > > > > > >>> > > to
> > > > > > >>> > > > > SVD to reduce features. I have followed following
> steps
> > > > with
> > > > > > >>> generate
> > > > > > >>> > > > 1400+
> > > > > > >>> > > > > clusters 99% of clusters contain 1 instance :(
> > > > > > >>> > > > >
> > > > > > >>> > > > > Please let me know what is wrong in below steps -
> > > > > > >>> > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > > > mahout arff.vector --input
> > /mnt/cluster/t/input-set.arff
> > > > > > --output
> > > > > > >>> > > > > /user/hadoop/t/input-set-vector/ --dictOut
> > > > > > >>> > > /mnt/cluster/t/input-set-dict
> > > > > > >>> > > > >
> > > > > > >>> > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/
> > > > --output
> > > > > > >>> > > > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2
> > -ow
> > > > > > >>> > > > >
> > > > > > >>> > > > > mahout canopy -i */user/hadoop/t/input-set-svd/U* -o
> > > > > > >>> > > > > /user/hadoop/t/input-set-canopy-centroids -dm
> > > > > > >>> > > > >
> > org.apache.mahout.common.distance.TanimotoDistanceMeasure
> > > > *-t1
> > > > > > >>> 0.001
> > > > > > >>> > > -t2
> > > > > > >>> > > > > 0.002*
> > > > > > >>> > > > >
> > > > > > >>> > > > > mahout kmeans -i */user/hadoop/t/input-set-svd/U* -c
> > > > > > >>> > > > >
> > > /user/hadoop/t/input-set-canopy-centroids/clusters-0-final
> > > > -cl
> > > > > > -o
> > > > > > >>> > > > > /user/hadoop/t/input-set-kmeans-clusters -ow -x 10
> -dm
> > > > > > >>> > > > >
> > org.apache.mahout.common.distance.TanimotoDistanceMeasure
> > > > > > >>> > > > >
> > > > > > >>> > > > > mahout clusterdump -dt sequencefile -i
> > > > > > >>> > > > >
> > > /user/hadoop/t/input-set-kmeans-clusters/clusters-1-final/
> > > > -n
> > > > > > 20
> > > > > > >>> -b
> > > > > > >>> > 100
> > > > > > >>> > > > -o
> > > > > > >>> > > > > /mnt/cluster/t/cdump-input-set.txt -p
> > > > > > >>> > > > >
> > /user/hadoop/t/input-set-kmeans-clusters/clusteredPoints/
> > > > > > >>> --evaluate
> > > > > > >>> > > > >
> > > > > > >>> > > > > Thanks in advance !
> > > > > > >>> > > > >
> > > > > > >>> > > > > Rajesh
> > > > > > >>> > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > > > On Wed, May 22, 2013 at 2:18 AM, Dmitriy Lyubimov <
> > > > > > >>> [email protected]
> > > > > > >>> > >
> > > > > > >>> > > > > wrote:
> > > > > > >>> > > > >
> > > > > > >>> > > > > > PPS As far as the tool for arff, i am frankly not
> > sure.
> > > > but
> > > > > > it
> > > > > > >>> > sounds
> > > > > > >>> > > > > like
> > > > > > >>> > > > > > you've already solved this.
> > > > > > >>> > > > > >
> > > > > > >>> > > > > >
> > > > > > >>> > > > > > On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov <
> > > > > > >>> > [email protected]
> > > > > > >>> > > >
> > > > > > >>> > > > > > wrote:
> > > > > > >>> > > > > >
> > > > > > >>> > > > > > > ps as far as U, V data "close to zero", yes
> that's
> > > what
> > > > > > you'd
> > > > > > >>> > > expect.
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > > > Here, by "close to zero" it still means much
> bigger
> > > > than a
> > > > > > >>> > rounding
> > > > > > >>> > > > > error
> > > > > > >>> > > > > > > of course. e.g. 1E-12 is indeed a small number,
> and
> > > > 1E-16
> > > > > > to
> > > > > > >>> > 1E-18
> > > > > > >>> > > > > would
> > > > > > >>> > > > > > be
> > > > > > >>> > > > > > > indeed "close to zero" for the purposes of
> > > singularity.
> > > > > > >>> > 1E-2..1E-5
> > > > > > >>> > > > are
> > > > > > >>> > > > > > > actually quite "sizeable" numbers by the scale
> of
> > > > IEEE 754
> > > > > > >>> > > > > arithmetics.
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > > > U and V are orthonormal (which means their column
> > > > vectors
> > > > > > >>> have
> > > > > > >>> > > > > euclidiean
> > > > > > >>> > > > > > > norm of 1) . Note that for large m and n (large
> > > inputs)
> > > > > > they
> > > > > > >>> are
> > > > > > >>> > > also
> > > > > > >>> > > > > > > extremely skinny. The larger input is, the
> smaller
> > > the
> > > > > > >>> element
> > > > > > >>> > of U
> > > > > > >>> > > > > > or/and
> > > > > > >>> > > > > > > V is gonna be.
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > > > On Tue, May 21, 2013 at 8:48 AM, Dmitriy
> Lyubimov <
> > > > > > >>> > > [email protected]
> > > > > > >>> > > > > > >wrote:
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > > >> Sounds like dimensionality reduction to me. You
> > may
> > > > want
> > > > > > to
> > > > > > >>> use
> > > > > > >>> > > ssvd
> > > > > > >>> > > > > > -pca
> > > > > > >>> > > > > > >>
> > > > > > >>> > > > > > >> Apologies for brevity. Sent from my Android
> phone.
> > > > > > >>> > > > > > >> -Dmitriy
> > > > > > >>> > > > > > >> On May 21, 2013 6:27 AM, "Rajesh Nikam" <
> > > > > > >>> [email protected]>
> > > > > > >>> > > > > wrote:
> > > > > > >>> > > > > > >>
> > > > > > >>> > > > > > >>> Hello Ted,
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> Thanks for reply.
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> I have started exploring SVD based on its
> mention
> > > of
> > > > > > could
> > > > > > >>> help
> > > > > > >>> > > to
> > > > > > >>> > > > > drop
> > > > > > >>> > > > > > >>> features which are not relevant for clustering.
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> My objective is reduce number of features
> before
> > > > passing
> > > > > > >>> them
> > > > > > >>> > to
> > > > > > >>> > > > > > >>> clustering
> > > > > > >>> > > > > > >>> and just keep important features.
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> arff/csv==> ssvd (for dimensionality reduction)
> > ==>
> > > > > > >>> clustering
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> Could you please illustrate mahout props to
> join
> > > > above
> > > > > > >>> > pipeline.
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> I think, Lanczos SVD needs to be used for mxm
> > > matrix.
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> I have tried check ssvd, I have used
> arff.vector
> > to
> > > > > > covert
> > > > > > >>> > > arff/csv
> > > > > > >>> > > > > to
> > > > > > >>> > > > > > >>> vector file which is then give as input to ssvd
> > and
> > > > them
> > > > > > >>> dumped
> > > > > > >>> > > U,
> > > > > > >>> > > > V
> > > > > > >>> > > > > > and
> > > > > > >>> > > > > > >>> sigma using vectordump.
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> I see most of the values dumped are near to 0.
> I
> > > dont
> > > > > > >>> > understand
> > > > > > >>> > > is
> > > > > > >>> > > > > > this
> > > > > > >>> > > > > > >>> correct or not.
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > >
> > > >
> > >
> >
> {0:0.01066724825049657,1:0.016715498597386844,2:2.0187750952311708E-4,3:3.401020567221039E-4,4:-1.2388403347280688E-4,5:6.41502463540719E-5,6:-1.359187582538833E-4,7:6.329813140445419E-5,8:1.670015585746444E-4,9:3.5415113034592744E-4,10:7.108868213280763E-4,11:0.020553517552052456,12:-0.015118680942548916,13:0.007981746711271956,14:-0.003251236468768259,15:0.0038075014396303053,16:-0.0010925318534013683,17:-0.0026943024876179833,18:-0.001744794617721648,19:-0.0024528466548735714}
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > >
> > > >
> > >
> >
> {0:0.029978614322360833,1:-0.01431521245087889,2:1.3318592088199427E-4,3:1.495356283071516E-4,4:8.762709213918985E-5,5:1.2765191352425177E-
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> Thanks,
> > > > > > >>> > > > > > >>> Rajesh
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> On Tue, May 21, 2013 at 11:35 AM, Ted Dunning <
> > > > > > >>> > > > [email protected]
> > > > > > >>> > > > > >
> > > > > > >>> > > > > > >>> wrote:
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>> > Are you using Lanczos instead of SSVD for a
> > > reason?
> > > > > > >>> > > > > > >>> >
> > > > > > >>> > > > > > >>> >
> > > > > > >>> > > > > > >>> >
> > > > > > >>> > > > > > >>> >
> > > > > > >>> > > > > > >>> > On Mon, May 20, 2013 at 4:13 AM, Rajesh
> Nikam <
> > > > > > >>> > > > > [email protected]
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > > >>> > wrote:
> > > > > > >>> > > > > > >>> >
> > > > > > >>> > > > > > >>> > > Hello,
> > > > > > >>> > > > > > >>> > >
> > > > > > >>> > > > > > >>> > > I have arff / csv file containing input
> data
> > > > that I
> > > > > > >>> want to
> > > > > > >>> > > > pass
> > > > > > >>> > > > > to
> > > > > > >>> > > > > > >>> svd :
> > > > > > >>> > > > > > >>> > > Lanczos Singular Value Decomposition.
> > > > > > >>> > > > > > >>> > >
> > > > > > >>> > > > > > >>> > > Which tool to use to convert it to required
> > > > format ?
> > > > > > >>> > > > > > >>> > >
> > > > > > >>> > > > > > >>> > > Thanks in Advance !
> > > > > > >>> > > > > > >>> > >
> > > > > > >>> > > > > > >>> > > Thanks,
> > > > > > >>> > > > > > >>> > > Rajesh
> > > > > > >>> > > > > > >>> > >
> > > > > > >>> > > > > > >>> >
> > > > > > >>> > > > > > >>>
> > > > > > >>> > > > > > >>
> > > > > > >>> > > > > > >
> > > > > > >>> > > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > >
> >
>