Sorry for confusion. Here number of clusters are decided by canopy. With data as it has 60 to 70 clusters.
My question is which part from ssvd output U, V, Sigma should be used as input to canopy? On May 24, 2013 3:56 AM, "Ted Dunning" <[email protected]> wrote: > Rajesh, > > This is very confusing. > > You have 1500 things that you are clustering into more than 1400 clusters. > > There is no way for most of these clusters to have >1 member just because > there aren't enough clusters compared to the items. > > Is there a typo here? > > > > > On Thu, May 23, 2013 at 5:34 AM, Rajesh Nikam <[email protected]> > wrote: > > > Hi, > > > > I have input test set of 1500 instances with 1000+ features. I want to to > > SVD to reduce features. I have followed following steps with generate > 1400+ > > clusters 99% of clusters contain 1 instance :( > > > > Please let me know what is wrong in below steps - > > > > > > mahout arff.vector --input /mnt/cluster/t/input-set.arff --output > > /user/hadoop/t/input-set-vector/ --dictOut /mnt/cluster/t/input-set-dict > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -ow > > > > mahout canopy -i */user/hadoop/t/input-set-svd/U* -o > > /user/hadoop/t/input-set-canopy-centroids -dm > > org.apache.mahout.common.distance.TanimotoDistanceMeasure *-t1 0.001 -t2 > > 0.002* > > > > mahout kmeans -i */user/hadoop/t/input-set-svd/U* -c > > /user/hadoop/t/input-set-canopy-centroids/clusters-0-final -cl -o > > /user/hadoop/t/input-set-kmeans-clusters -ow -x 10 -dm > > org.apache.mahout.common.distance.TanimotoDistanceMeasure > > > > mahout clusterdump -dt sequencefile -i > > /user/hadoop/t/input-set-kmeans-clusters/clusters-1-final/ -n 20 -b 100 > -o > > /mnt/cluster/t/cdump-input-set.txt -p > > /user/hadoop/t/input-set-kmeans-clusters/clusteredPoints/ --evaluate > > > > Thanks in advance ! > > > > Rajesh > > > > > > > > > > On Wed, May 22, 2013 at 2:18 AM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > PPS As far as the tool for arff, i am frankly not sure. but it sounds > > like > > > you've already solved this. > > > > > > > > > On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov <[email protected]> > > > wrote: > > > > > > > ps as far as U, V data "close to zero", yes that's what you'd expect. > > > > > > > > Here, by "close to zero" it still means much bigger than a rounding > > error > > > > of course. e.g. 1E-12 is indeed a small number, and 1E-16 to 1E-18 > > would > > > be > > > > indeed "close to zero" for the purposes of singularity. 1E-2..1E-5 > are > > > > actually quite "sizeable" numbers by the scale of IEEE 754 > > arithmetics. > > > > > > > > U and V are orthonormal (which means their column vectors have > > euclidiean > > > > norm of 1) . Note that for large m and n (large inputs) they are also > > > > extremely skinny. The larger input is, the smaller the element of U > > > or/and > > > > V is gonna be. > > > > > > > > > > > > > > > > On Tue, May 21, 2013 at 8:48 AM, Dmitriy Lyubimov <[email protected] > > > >wrote: > > > > > > > >> Sounds like dimensionality reduction to me. You may want to use ssvd > > > -pca > > > >> > > > >> Apologies for brevity. Sent from my Android phone. > > > >> -Dmitriy > > > >> On May 21, 2013 6:27 AM, "Rajesh Nikam" <[email protected]> > > wrote: > > > >> > > > >>> Hello Ted, > > > >>> > > > >>> Thanks for reply. > > > >>> > > > >>> I have started exploring SVD based on its mention of could help to > > drop > > > >>> features which are not relevant for clustering. > > > >>> > > > >>> My objective is reduce number of features before passing them to > > > >>> clustering > > > >>> and just keep important features. > > > >>> > > > >>> arff/csv==> ssvd (for dimensionality reduction) ==> clustering > > > >>> > > > >>> Could you please illustrate mahout props to join above pipeline. > > > >>> > > > >>> I think, Lanczos SVD needs to be used for mxm matrix. > > > >>> > > > >>> I have tried check ssvd, I have used arff.vector to covert arff/csv > > to > > > >>> vector file which is then give as input to ssvd and them dumped U, > V > > > and > > > >>> sigma using vectordump. > > > >>> > > > >>> I see most of the values dumped are near to 0. I dont understand is > > > this > > > >>> correct or not. > > > >>> > > > >>> > > > >>> > > > > > > {0:0.01066724825049657,1:0.016715498597386844,2:2.0187750952311708E-4,3:3.401020567221039E-4,4:-1.2388403347280688E-4,5:6.41502463540719E-5,6:-1.359187582538833E-4,7:6.329813140445419E-5,8:1.670015585746444E-4,9:3.5415113034592744E-4,10:7.108868213280763E-4,11:0.020553517552052456,12:-0.015118680942548916,13:0.007981746711271956,14:-0.003251236468768259,15:0.0038075014396303053,16:-0.0010925318534013683,17:-0.0026943024876179833,18:-0.001744794617721648,19:-0.0024528466548735714} > > > >>> > > > >>> > > > > > > {0:0.029978614322360833,1:-0.01431521245087889,2:1.3318592088199427E-4,3:1.495356283071516E-4,4:8.762709213918985E-5,5:1.2765191352425177E- > > > >>> > > > >>> Thanks, > > > >>> Rajesh > > > >>> > > > >>> > > > >>> > > > >>> On Tue, May 21, 2013 at 11:35 AM, Ted Dunning < > [email protected] > > > > > > >>> wrote: > > > >>> > > > >>> > Are you using Lanczos instead of SSVD for a reason? > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam < > > [email protected] > > > > > > > >>> > wrote: > > > >>> > > > > >>> > > Hello, > > > >>> > > > > > >>> > > I have arff / csv file containing input data that I want to > pass > > to > > > >>> svd : > > > >>> > > Lanczos Singular Value Decomposition. > > > >>> > > > > > >>> > > Which tool to use to convert it to required format ? > > > >>> > > > > > >>> > > Thanks in Advance ! > > > >>> > > > > > >>> > > Thanks, > > > >>> > > Rajesh > > > >>> > > > > > >>> > > > > >>> > > > >> > > > > > > > > > >
