Re: convert input for SVD

Rajesh Nikam Fri, 24 May 2013 03:20:13 -0700

Hello Dmitriy,

Thanks for reply.


I see similar discussion on following link where I see your reply.

http://www.searchworkings.org/forum/-/message_boards/view_message/517870#_19_message_519704

I do also have same problem, need to apply dimensionality reduction and use
clustering algo on reduced features.

Seems parameters for ssvd are changed from mentioned in SSVD-CLI.pdf. It no
longer shows *-us *as parameter

I am using mahout-examples-0.7-job.jar

mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
/user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -pca true -U true -V
false *-us true* -ow -q 1

giving option as "*-pca true*" gives error as

at
org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
        at
org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)

So I removed it.

mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
/user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -U true -V false *-us
true* -ow -q 1

*>> *with above command *>> Unexpected -us *while processing Job-Specific
Options.

I tried with "-U false -V false -uhs true" it just generated sigma file as
expected however no "Usigma"

hadoop fs -lsr /user/hadoop/t/PE_EXE/input-set-svd/

-rw-r--r--   2 hadoop supergroup       1712 2013-05-24 15:34
/user/hadoop/t/PE_EXE/input-set-svd/sigma

Then with *"-U true -V false -uhs true" *output dir U is created.
*
*drwxr-xr-x   - hadoop supergroup          0 2013-05-24 15:39
/user/hadoop/t/PE_EXE/input-set-svd/U
-rw-r--r--   2 hadoop supergroup       1712 2013-05-24 15:39
/user/hadoop/t/PE_EXE/input-set-svd/sigma*
*

My problem is how to use these U/V/sigma file as input to canopy/kmeans ?

How to identify which important features from U/Sigma that are retained in
dimensionality reduction ?

Thanks in Advance !
Rajesh


On Fri, May 24, 2013 at 7:01 AM, Dmitriy Lyubimov <[email protected]> wrote:

>
> https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000
> :
>
> "In most cases where you might be looking to reduce
> dimensionality while retaining variance, you probably need combination of
> options -pca true -U false -V
> false -us true.
>
> See §3 for details"
>
>
> On Thu, May 23, 2013 at 6:24 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > Also, for the dimensionality reduction it is important among other things
> > to re-center your input first, which is why you also want "-pca true".
> >
> >
> > On Thu, May 23, 2013 at 6:23 PM, Dmitriy Lyubimov <[email protected]
> >wrote:
> >
> >> did you specify -us option? SSVD by default produces only U, V and
> Sigma.
> >> but it can produce more, e.g. U*Sigma, U*sqrt(Sigma) etc. if you ask for
> >> it. And, alternatively, you can suppress any of U, V (you can't suppress
> >> sigma but that doesn't cost anything in space anyway).
> >>
> >>
> >> On Thu, May 23, 2013 at 6:20 PM, Rajesh Nikam <[email protected]
> >wrote:
> >>
> >>> I got all three U, V & sigma from ssvd, however which to use as input
> to
> >>> canopy?
> >>> On May 24, 2013 6:47 AM, "Dmitriy Lyubimov" <[email protected]> wrote:
> >>>
> >>> > I think you want U*Sigma
> >>> >
> >>> > What you want is ssvd ... -pca true ... -us true ... see the manual
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Thu, May 23, 2013 at 6:07 PM, Rajesh Nikam <[email protected]
> >
> >>> > wrote:
> >>> >
> >>> > > Sorry for confusion. Here number of clusters are decided by canopy.
> >>> With
> >>> > > data as it has 60 to 70 clusters.
> >>> > >
> >>> > > My question is which part from ssvd output U, V, Sigma should be
> >>> used as
> >>> > > input to canopy?
> >>> > >  On May 24, 2013 3:56 AM, "Ted Dunning" <[email protected]>
> >>> wrote:
> >>> > >
> >>> > > > Rajesh,
> >>> > > >
> >>> > > > This is very confusing.
> >>> > > >
> >>> > > > You have 1500 things that you are clustering into more than 1400
> >>> > > clusters.
> >>> > > >
> >>> > > > There is no way for most of these clusters to have >1 member just
> >>> > because
> >>> > > > there aren't enough clusters compared to the items.
> >>> > > >
> >>> > > > Is there a typo here?
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > On Thu, May 23, 2013 at 5:34 AM, Rajesh Nikam <
> >>> [email protected]>
> >>> > > > wrote:
> >>> > > >
> >>> > > > > Hi,
> >>> > > > >
> >>> > > > > I have input test set of 1500 instances with 1000+ features. I
> >>> want
> >>> > to
> >>> > > to
> >>> > > > > SVD to reduce features. I have followed following steps with
> >>> generate
> >>> > > > 1400+
> >>> > > > > clusters 99% of clusters contain 1 instance :(
> >>> > > > >
> >>> > > > > Please let me know what is wrong in below steps -
> >>> > > > >
> >>> > > > >
> >>> > > > > mahout arff.vector --input /mnt/cluster/t/input-set.arff
> --output
> >>> > > > > /user/hadoop/t/input-set-vector/ --dictOut
> >>> > > /mnt/cluster/t/input-set-dict
> >>> > > > >
> >>> > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
> >>> > > > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -ow
> >>> > > > >
> >>> > > > > mahout canopy -i */user/hadoop/t/input-set-svd/U* -o
> >>> > > > > /user/hadoop/t/input-set-canopy-centroids -dm
> >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure *-t1
> >>> 0.001
> >>> > > -t2
> >>> > > > > 0.002*
> >>> > > > >
> >>> > > > > mahout kmeans -i */user/hadoop/t/input-set-svd/U* -c
> >>> > > > > /user/hadoop/t/input-set-canopy-centroids/clusters-0-final -cl
> -o
> >>> > > > > /user/hadoop/t/input-set-kmeans-clusters -ow -x 10 -dm
> >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure
> >>> > > > >
> >>> > > > > mahout clusterdump -dt sequencefile -i
> >>> > > > > /user/hadoop/t/input-set-kmeans-clusters/clusters-1-final/ -n
> 20
> >>> -b
> >>> > 100
> >>> > > > -o
> >>> > > > > /mnt/cluster/t/cdump-input-set.txt -p
> >>> > > > > /user/hadoop/t/input-set-kmeans-clusters/clusteredPoints/
> >>> --evaluate
> >>> > > > >
> >>> > > > > Thanks in advance !
> >>> > > > >
> >>> > > > > Rajesh
> >>> > > > >
> >>> > > > >
> >>> > > > >
> >>> > > > >
> >>> > > > > On Wed, May 22, 2013 at 2:18 AM, Dmitriy Lyubimov <
> >>> [email protected]
> >>> > >
> >>> > > > > wrote:
> >>> > > > >
> >>> > > > > > PPS As far as the tool for arff, i am frankly not sure. but
> it
> >>> > sounds
> >>> > > > > like
> >>> > > > > > you've already solved this.
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov <
> >>> > [email protected]
> >>> > > >
> >>> > > > > > wrote:
> >>> > > > > >
> >>> > > > > > > ps as far as U, V data "close to zero", yes that's what
> you'd
> >>> > > expect.
> >>> > > > > > >
> >>> > > > > > > Here, by "close to zero" it still means much bigger than a
> >>> > rounding
> >>> > > > > error
> >>> > > > > > > of course. e.g. 1E-12 is indeed a small number, and 1E-16
> to
> >>> > 1E-18
> >>> > > > > would
> >>> > > > > > be
> >>> > > > > > > indeed "close to zero" for the purposes of singularity.
> >>> > 1E-2..1E-5
> >>> > > > are
> >>> > > > > > > actually quite  "sizeable" numbers by the scale of IEEE 754
> >>> > > > > arithmetics.
> >>> > > > > > >
> >>> > > > > > > U and V are orthonormal (which means their column vectors
> >>> have
> >>> > > > > euclidiean
> >>> > > > > > > norm of 1) . Note that for large m and n (large inputs)
> they
> >>> are
> >>> > > also
> >>> > > > > > > extremely skinny. The larger input is, the smaller the
> >>> element
> >>> > of U
> >>> > > > > > or/and
> >>> > > > > > > V is gonna be.
> >>> > > > > > >
> >>> > > > > > >
> >>> > > > > > >
> >>> > > > > > > On Tue, May 21, 2013 at 8:48 AM, Dmitriy Lyubimov <
> >>> > > [email protected]
> >>> > > > > > >wrote:
> >>> > > > > > >
> >>> > > > > > >> Sounds like dimensionality reduction to me. You may want
> to
> >>> use
> >>> > > ssvd
> >>> > > > > > -pca
> >>> > > > > > >>
> >>> > > > > > >> Apologies for brevity. Sent from my Android phone.
> >>> > > > > > >> -Dmitriy
> >>> > > > > > >> On May 21, 2013 6:27 AM, "Rajesh Nikam" <
> >>> [email protected]>
> >>> > > > > wrote:
> >>> > > > > > >>
> >>> > > > > > >>> Hello Ted,
> >>> > > > > > >>>
> >>> > > > > > >>> Thanks for reply.
> >>> > > > > > >>>
> >>> > > > > > >>> I have started exploring SVD based on its mention of
> could
> >>> help
> >>> > > to
> >>> > > > > drop
> >>> > > > > > >>> features which are not relevant for clustering.
> >>> > > > > > >>>
> >>> > > > > > >>> My objective is reduce number of features before passing
> >>> them
> >>> > to
> >>> > > > > > >>> clustering
> >>> > > > > > >>> and just keep important features.
> >>> > > > > > >>>
> >>> > > > > > >>> arff/csv==> ssvd (for dimensionality reduction) ==>
> >>> clustering
> >>> > > > > > >>>
> >>> > > > > > >>> Could you please illustrate mahout props to join above
> >>> > pipeline.
> >>> > > > > > >>>
> >>> > > > > > >>> I think, Lanczos SVD needs to be used for mxm matrix.
> >>> > > > > > >>>
> >>> > > > > > >>> I have tried check ssvd, I have used arff.vector to
> covert
> >>> > > arff/csv
> >>> > > > > to
> >>> > > > > > >>> vector file which is then give as input to ssvd and them
> >>> dumped
> >>> > > U,
> >>> > > > V
> >>> > > > > > and
> >>> > > > > > >>> sigma using vectordump.
> >>> > > > > > >>>
> >>> > > > > > >>> I see most of the values dumped are near to 0. I dont
> >>> > understand
> >>> > > is
> >>> > > > > > this
> >>> > > > > > >>> correct or not.
> >>> > > > > > >>>
> >>> > > > > > >>>
> >>> > > > > > >>>
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> {0:0.01066724825049657,1:0.016715498597386844,2:2.0187750952311708E-4,3:3.401020567221039E-4,4:-1.2388403347280688E-4,5:6.41502463540719E-5,6:-1.359187582538833E-4,7:6.329813140445419E-5,8:1.670015585746444E-4,9:3.5415113034592744E-4,10:7.108868213280763E-4,11:0.020553517552052456,12:-0.015118680942548916,13:0.007981746711271956,14:-0.003251236468768259,15:0.0038075014396303053,16:-0.0010925318534013683,17:-0.0026943024876179833,18:-0.001744794617721648,19:-0.0024528466548735714}
> >>> > > > > > >>>
> >>> > > > > > >>>
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> {0:0.029978614322360833,1:-0.01431521245087889,2:1.3318592088199427E-4,3:1.495356283071516E-4,4:8.762709213918985E-5,5:1.2765191352425177E-
> >>> > > > > > >>>
> >>> > > > > > >>> Thanks,
> >>> > > > > > >>> Rajesh
> >>> > > > > > >>>
> >>> > > > > > >>>
> >>> > > > > > >>>
> >>> > > > > > >>> On Tue, May 21, 2013 at 11:35 AM, Ted Dunning <
> >>> > > > [email protected]
> >>> > > > > >
> >>> > > > > > >>> wrote:
> >>> > > > > > >>>
> >>> > > > > > >>> > Are you using Lanczos instead of SSVD for a reason?
> >>> > > > > > >>> >
> >>> > > > > > >>> >
> >>> > > > > > >>> >
> >>> > > > > > >>> >
> >>> > > > > > >>> > On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam <
> >>> > > > > [email protected]
> >>> > > > > > >
> >>> > > > > > >>> > wrote:
> >>> > > > > > >>> >
> >>> > > > > > >>> > > Hello,
> >>> > > > > > >>> > >
> >>> > > > > > >>> > > I have arff / csv file containing input data that I
> >>> want to
> >>> > > > pass
> >>> > > > > to
> >>> > > > > > >>> svd :
> >>> > > > > > >>> > > Lanczos Singular Value Decomposition.
> >>> > > > > > >>> > >
> >>> > > > > > >>> > > Which tool to use to convert it to required format ?
> >>> > > > > > >>> > >
> >>> > > > > > >>> > > Thanks in Advance !
> >>> > > > > > >>> > >
> >>> > > > > > >>> > > Thanks,
> >>> > > > > > >>> > > Rajesh
> >>> > > > > > >>> > >
> >>> > > > > > >>> >
> >>> > > > > > >>>
> >>> > > > > > >>
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: convert input for SVD

Reply via email to