https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000 :
"In most cases where you might be looking to reduce dimensionality while retaining variance, you probably need combination of options -pca true -U false -V false -us true. See ยง3 for details" On Thu, May 23, 2013 at 6:24 PM, Dmitriy Lyubimov <[email protected]> wrote: > Also, for the dimensionality reduction it is important among other things > to re-center your input first, which is why you also want "-pca true". > > > On Thu, May 23, 2013 at 6:23 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> did you specify -us option? SSVD by default produces only U, V and Sigma. >> but it can produce more, e.g. U*Sigma, U*sqrt(Sigma) etc. if you ask for >> it. And, alternatively, you can suppress any of U, V (you can't suppress >> sigma but that doesn't cost anything in space anyway). >> >> >> On Thu, May 23, 2013 at 6:20 PM, Rajesh Nikam <[email protected]>wrote: >> >>> I got all three U, V & sigma from ssvd, however which to use as input to >>> canopy? >>> On May 24, 2013 6:47 AM, "Dmitriy Lyubimov" <[email protected]> wrote: >>> >>> > I think you want U*Sigma >>> > >>> > What you want is ssvd ... -pca true ... -us true ... see the manual >>> > >>> > >>> > >>> > >>> > On Thu, May 23, 2013 at 6:07 PM, Rajesh Nikam <[email protected]> >>> > wrote: >>> > >>> > > Sorry for confusion. Here number of clusters are decided by canopy. >>> With >>> > > data as it has 60 to 70 clusters. >>> > > >>> > > My question is which part from ssvd output U, V, Sigma should be >>> used as >>> > > input to canopy? >>> > > On May 24, 2013 3:56 AM, "Ted Dunning" <[email protected]> >>> wrote: >>> > > >>> > > > Rajesh, >>> > > > >>> > > > This is very confusing. >>> > > > >>> > > > You have 1500 things that you are clustering into more than 1400 >>> > > clusters. >>> > > > >>> > > > There is no way for most of these clusters to have >1 member just >>> > because >>> > > > there aren't enough clusters compared to the items. >>> > > > >>> > > > Is there a typo here? >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > On Thu, May 23, 2013 at 5:34 AM, Rajesh Nikam < >>> [email protected]> >>> > > > wrote: >>> > > > >>> > > > > Hi, >>> > > > > >>> > > > > I have input test set of 1500 instances with 1000+ features. I >>> want >>> > to >>> > > to >>> > > > > SVD to reduce features. I have followed following steps with >>> generate >>> > > > 1400+ >>> > > > > clusters 99% of clusters contain 1 instance :( >>> > > > > >>> > > > > Please let me know what is wrong in below steps - >>> > > > > >>> > > > > >>> > > > > mahout arff.vector --input /mnt/cluster/t/input-set.arff --output >>> > > > > /user/hadoop/t/input-set-vector/ --dictOut >>> > > /mnt/cluster/t/input-set-dict >>> > > > > >>> > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output >>> > > > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -ow >>> > > > > >>> > > > > mahout canopy -i */user/hadoop/t/input-set-svd/U* -o >>> > > > > /user/hadoop/t/input-set-canopy-centroids -dm >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure *-t1 >>> 0.001 >>> > > -t2 >>> > > > > 0.002* >>> > > > > >>> > > > > mahout kmeans -i */user/hadoop/t/input-set-svd/U* -c >>> > > > > /user/hadoop/t/input-set-canopy-centroids/clusters-0-final -cl -o >>> > > > > /user/hadoop/t/input-set-kmeans-clusters -ow -x 10 -dm >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure >>> > > > > >>> > > > > mahout clusterdump -dt sequencefile -i >>> > > > > /user/hadoop/t/input-set-kmeans-clusters/clusters-1-final/ -n 20 >>> -b >>> > 100 >>> > > > -o >>> > > > > /mnt/cluster/t/cdump-input-set.txt -p >>> > > > > /user/hadoop/t/input-set-kmeans-clusters/clusteredPoints/ >>> --evaluate >>> > > > > >>> > > > > Thanks in advance ! >>> > > > > >>> > > > > Rajesh >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > On Wed, May 22, 2013 at 2:18 AM, Dmitriy Lyubimov < >>> [email protected] >>> > > >>> > > > > wrote: >>> > > > > >>> > > > > > PPS As far as the tool for arff, i am frankly not sure. but it >>> > sounds >>> > > > > like >>> > > > > > you've already solved this. >>> > > > > > >>> > > > > > >>> > > > > > On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov < >>> > [email protected] >>> > > > >>> > > > > > wrote: >>> > > > > > >>> > > > > > > ps as far as U, V data "close to zero", yes that's what you'd >>> > > expect. >>> > > > > > > >>> > > > > > > Here, by "close to zero" it still means much bigger than a >>> > rounding >>> > > > > error >>> > > > > > > of course. e.g. 1E-12 is indeed a small number, and 1E-16 to >>> > 1E-18 >>> > > > > would >>> > > > > > be >>> > > > > > > indeed "close to zero" for the purposes of singularity. >>> > 1E-2..1E-5 >>> > > > are >>> > > > > > > actually quite "sizeable" numbers by the scale of IEEE 754 >>> > > > > arithmetics. >>> > > > > > > >>> > > > > > > U and V are orthonormal (which means their column vectors >>> have >>> > > > > euclidiean >>> > > > > > > norm of 1) . Note that for large m and n (large inputs) they >>> are >>> > > also >>> > > > > > > extremely skinny. The larger input is, the smaller the >>> element >>> > of U >>> > > > > > or/and >>> > > > > > > V is gonna be. >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > On Tue, May 21, 2013 at 8:48 AM, Dmitriy Lyubimov < >>> > > [email protected] >>> > > > > > >wrote: >>> > > > > > > >>> > > > > > >> Sounds like dimensionality reduction to me. You may want to >>> use >>> > > ssvd >>> > > > > > -pca >>> > > > > > >> >>> > > > > > >> Apologies for brevity. Sent from my Android phone. >>> > > > > > >> -Dmitriy >>> > > > > > >> On May 21, 2013 6:27 AM, "Rajesh Nikam" < >>> [email protected]> >>> > > > > wrote: >>> > > > > > >> >>> > > > > > >>> Hello Ted, >>> > > > > > >>> >>> > > > > > >>> Thanks for reply. >>> > > > > > >>> >>> > > > > > >>> I have started exploring SVD based on its mention of could >>> help >>> > > to >>> > > > > drop >>> > > > > > >>> features which are not relevant for clustering. >>> > > > > > >>> >>> > > > > > >>> My objective is reduce number of features before passing >>> them >>> > to >>> > > > > > >>> clustering >>> > > > > > >>> and just keep important features. >>> > > > > > >>> >>> > > > > > >>> arff/csv==> ssvd (for dimensionality reduction) ==> >>> clustering >>> > > > > > >>> >>> > > > > > >>> Could you please illustrate mahout props to join above >>> > pipeline. >>> > > > > > >>> >>> > > > > > >>> I think, Lanczos SVD needs to be used for mxm matrix. >>> > > > > > >>> >>> > > > > > >>> I have tried check ssvd, I have used arff.vector to covert >>> > > arff/csv >>> > > > > to >>> > > > > > >>> vector file which is then give as input to ssvd and them >>> dumped >>> > > U, >>> > > > V >>> > > > > > and >>> > > > > > >>> sigma using vectordump. >>> > > > > > >>> >>> > > > > > >>> I see most of the values dumped are near to 0. I dont >>> > understand >>> > > is >>> > > > > > this >>> > > > > > >>> correct or not. >>> > > > > > >>> >>> > > > > > >>> >>> > > > > > >>> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> {0:0.01066724825049657,1:0.016715498597386844,2:2.0187750952311708E-4,3:3.401020567221039E-4,4:-1.2388403347280688E-4,5:6.41502463540719E-5,6:-1.359187582538833E-4,7:6.329813140445419E-5,8:1.670015585746444E-4,9:3.5415113034592744E-4,10:7.108868213280763E-4,11:0.020553517552052456,12:-0.015118680942548916,13:0.007981746711271956,14:-0.003251236468768259,15:0.0038075014396303053,16:-0.0010925318534013683,17:-0.0026943024876179833,18:-0.001744794617721648,19:-0.0024528466548735714} >>> > > > > > >>> >>> > > > > > >>> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> {0:0.029978614322360833,1:-0.01431521245087889,2:1.3318592088199427E-4,3:1.495356283071516E-4,4:8.762709213918985E-5,5:1.2765191352425177E- >>> > > > > > >>> >>> > > > > > >>> Thanks, >>> > > > > > >>> Rajesh >>> > > > > > >>> >>> > > > > > >>> >>> > > > > > >>> >>> > > > > > >>> On Tue, May 21, 2013 at 11:35 AM, Ted Dunning < >>> > > > [email protected] >>> > > > > > >>> > > > > > >>> wrote: >>> > > > > > >>> >>> > > > > > >>> > Are you using Lanczos instead of SSVD for a reason? >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam < >>> > > > > [email protected] >>> > > > > > > >>> > > > > > >>> > wrote: >>> > > > > > >>> > >>> > > > > > >>> > > Hello, >>> > > > > > >>> > > >>> > > > > > >>> > > I have arff / csv file containing input data that I >>> want to >>> > > > pass >>> > > > > to >>> > > > > > >>> svd : >>> > > > > > >>> > > Lanczos Singular Value Decomposition. >>> > > > > > >>> > > >>> > > > > > >>> > > Which tool to use to convert it to required format ? >>> > > > > > >>> > > >>> > > > > > >>> > > Thanks in Advance ! >>> > > > > > >>> > > >>> > > > > > >>> > > Thanks, >>> > > > > > >>> > > Rajesh >>> > > > > > >>> > > >>> > > > > > >>> > >>> > > > > > >>> >>> > > > > > >> >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >
