Hi,

I have input test set of 1500 instances with 1000+ features. I want to to
SVD to reduce features. I have followed following steps with generate 1400+
clusters 99% of clusters contain 1 instance :(

Please let me know what is wrong in below steps -


mahout arff.vector --input /mnt/cluster/t/input-set.arff --output
/user/hadoop/t/input-set-vector/ --dictOut /mnt/cluster/t/input-set-dict

mahout ssvd --input /user/hadoop/t/input-set-vector/ --output
/user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -ow

mahout canopy -i */user/hadoop/t/input-set-svd/U* -o
/user/hadoop/t/input-set-canopy-centroids -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure *-t1 0.001 -t2
0.002*

mahout kmeans -i */user/hadoop/t/input-set-svd/U* -c
/user/hadoop/t/input-set-canopy-centroids/clusters-0-final -cl -o
/user/hadoop/t/input-set-kmeans-clusters -ow -x 10 -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure

mahout clusterdump -dt sequencefile -i
/user/hadoop/t/input-set-kmeans-clusters/clusters-1-final/ -n 20 -b 100 -o
/mnt/cluster/t/cdump-input-set.txt -p
/user/hadoop/t/input-set-kmeans-clusters/clusteredPoints/ --evaluate

Thanks in advance !

Rajesh




On Wed, May 22, 2013 at 2:18 AM, Dmitriy Lyubimov <[email protected]> wrote:

> PPS As far as the tool for arff, i am frankly not sure. but it sounds like
> you've already solved this.
>
>
> On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > ps as far as U, V data "close to zero", yes that's what you'd expect.
> >
> > Here, by "close to zero" it still means much bigger than a rounding error
> > of course. e.g. 1E-12 is indeed a small number, and 1E-16 to 1E-18 would
> be
> > indeed "close to zero" for the purposes of singularity. 1E-2..1E-5 are
> > actually quite  "sizeable" numbers by the scale of IEEE 754 arithmetics.
> >
> > U and V are orthonormal (which means their column vectors have euclidiean
> > norm of 1) . Note that for large m and n (large inputs) they are also
> > extremely skinny. The larger input is, the smaller the element of U
> or/and
> > V is gonna be.
> >
> >
> >
> > On Tue, May 21, 2013 at 8:48 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
> >
> >> Sounds like dimensionality reduction to me. You may want to use ssvd
> -pca
> >>
> >> Apologies for brevity. Sent from my Android phone.
> >> -Dmitriy
> >> On May 21, 2013 6:27 AM, "Rajesh Nikam" <[email protected]> wrote:
> >>
> >>> Hello Ted,
> >>>
> >>> Thanks for reply.
> >>>
> >>> I have started exploring SVD based on its mention of could help to drop
> >>> features which are not relevant for clustering.
> >>>
> >>> My objective is reduce number of features before passing them to
> >>> clustering
> >>> and just keep important features.
> >>>
> >>> arff/csv==> ssvd (for dimensionality reduction) ==> clustering
> >>>
> >>> Could you please illustrate mahout props to join above pipeline.
> >>>
> >>> I think, Lanczos SVD needs to be used for mxm matrix.
> >>>
> >>> I have tried check ssvd, I have used arff.vector to covert arff/csv to
> >>> vector file which is then give as input to ssvd and them dumped U, V
> and
> >>> sigma using vectordump.
> >>>
> >>> I see most of the values dumped are near to 0. I dont understand is
> this
> >>> correct or not.
> >>>
> >>>
> >>>
> {0:0.01066724825049657,1:0.016715498597386844,2:2.0187750952311708E-4,3:3.401020567221039E-4,4:-1.2388403347280688E-4,5:6.41502463540719E-5,6:-1.359187582538833E-4,7:6.329813140445419E-5,8:1.670015585746444E-4,9:3.5415113034592744E-4,10:7.108868213280763E-4,11:0.020553517552052456,12:-0.015118680942548916,13:0.007981746711271956,14:-0.003251236468768259,15:0.0038075014396303053,16:-0.0010925318534013683,17:-0.0026943024876179833,18:-0.001744794617721648,19:-0.0024528466548735714}
> >>>
> >>>
> {0:0.029978614322360833,1:-0.01431521245087889,2:1.3318592088199427E-4,3:1.495356283071516E-4,4:8.762709213918985E-5,5:1.2765191352425177E-
> >>>
> >>> Thanks,
> >>> Rajesh
> >>>
> >>>
> >>>
> >>> On Tue, May 21, 2013 at 11:35 AM, Ted Dunning <[email protected]>
> >>> wrote:
> >>>
> >>> > Are you using Lanczos instead of SSVD for a reason?
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam <[email protected]
> >
> >>> > wrote:
> >>> >
> >>> > > Hello,
> >>> > >
> >>> > > I have arff / csv file containing input data that I want to pass to
> >>> svd :
> >>> > > Lanczos Singular Value Decomposition.
> >>> > >
> >>> > > Which tool to use to convert it to required format ?
> >>> > >
> >>> > > Thanks in Advance !
> >>> > >
> >>> > > Thanks,
> >>> > > Rajesh
> >>> > >
> >>> >
> >>>
> >>
> >
>

Reply via email to