Re: Using SSVD for dimensionality reduction on Mahout

Ted Dunning Fri, 21 Mar 2014 14:02:49 -0700

Vijay,

SSVD is not really appropriate with 12 columns.  You aren't going to see
any savings at all.


It would be much better if you were to look at extraction of the 7 most
interesting columns out of 1000.

The problem is not that SSVD will fail, but rather that you will have to
include all the columns in the computation so the whole random projection
step is simply wasted effort.

If you want to compute the SVD of a tall skinny matrix you can instead do
this:

     X = A' A
     R'R = X
     Ux D V' = R

     U = A V D^{-1}

The first step is a simple map reduce.  The second and third steps are
in-memory.  The fourth step is a map-only parallel computation (and is
optional in many cases).




On Fri, Mar 21, 2014 at 11:12 AM, Vijay B <[email protected]> wrote:

> Thanks a lot for the reply.
>
> To gain an understanding of how SSVD works, I have taken a sample CSV file
> with 12 columns and I want to perform dimensionality reduction on it by
> asking SSVD to give me 7 most significant columns.
>
> Snippet of my input csv
>
> 22,2,44,36,5,9,2824,2,4,733,285,169
> 25,1,150,175,3,9,4037,2,18,1822,254,171
>
> Here's what I have done.
> Step 1: Converted the csv to a sequence file, below is a snippet of the
> output
> Key: 1: Value:
>
> 1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
> Key: 2: Value:
>
> 2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}
>
> Step 2; Passed this sequence file as input to the SSVD command, below is
> the command I used
>
> bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
> /user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false -pca
> true -ow -t 1
>
>  I then executed vectordump on the contents of USigma folder, below is a
> snippet of the output
>
>
> {0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
>
> {0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}
>
> Please help me interpret the above results in the USigma folder.
>
> Thanks,
> Vijay.
>
>
>
>
>
>
>
>
>
> On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <[email protected]> wrote:
>
> > Vijay, how many columns do you have in the CSV? That is the number you
> > will be reducing.
> >
> > csv:
> > 1,22,33,44,55
> > 13,23,34,45,56
> >
> > would be dense vectors:
> > Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> > Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
> >
> > Unless you have some reason to assign different dimension indexes the row
> > and column numbers from your csv should be used in Mahout. Internal to
> > Mahout the dimensions are assumed to be ordinal. If you do have reasons
> to
> > say column 1 corresponds to something with an id of 12 (your example
> below)
> > then you handle that in the output phase of your problem. In other words
> if
> > you get an answer corresponding to the Mahout column index of 1, you
> lookup
> > its association to 12 in some dictionary you keep outside of Mahout, same
> > with the row keys. Don't put external Ids in the matrix unless they
> really
> > are ordinal dimensions.
> >
> > As Dmitriy said this sounds like a Dense matrix problem. Usually when
> I've
> > used SSVD it was on a matrix with 80,000-500,000 columns in a very sparse
> > matrix so reduction yields big benefits. Also remember that the output is
> > always a dense matrix so ops performed on it tend to be more heavy
> weight.
> >
> >
> > On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >
> > On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <[email protected]> wrote:
> >
> > > Thanks a lot for the detailed explanation, it was very helpful.
> > > I will write a CSV to sequence converter, just needed some clarity on
> the
> > > key/value pairs in the sequence file.
> > >
> > > Suppose my csv file contains the below values
> > > 11,22,33,44,55
> > > 13,23,34,45,56
> > >
> > > I assume that the sequence file would look like this, where 12, 1, 14,
> 8,
> > > 15 are indices which hold the values
> > > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> > >
> >
> > I am not sure -- why are you remapping ordinal position into an index
> > position? Obviously, DRM supports sparse computations (i.e. you can use
> > either SequetialAccessSparseVector or RandomAccessSparseVector as vector
> > values, as long as they have the same cardinality). However, if you imply
> > that all data point ordinal positions map into the same sparse vector
> > index, then there's no true sparsity here and you could just form dense
> > vectors in ordinal order of your data, it seems.
> >
> > Other than that, I don't see any issues with your assumptions.
> >
> >
> > > Please confirm if my understanding is correct.
> > >
> > > Thanks,
> > > Vijay
> > >
> > >
> > > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <[email protected]
> > >> wrote:
> > >
> > >> I am not sure if we have direct CSV converters to do that; CSV is not
> > > that
> > >> expressive anyway. But it is not difficult to write up such converter
> on
> > >> your own, i suppose.
> > >>
> > >> The steps you need to do is this :
> > >>
> > >> (1) prepare set of data points in a form of (unique vector key,
> > n-vector)
> > >> tuples. Vector key can be anything that can be adapted into a
> > >> WritableComparable. Notably, Long or String. Vector key also has to be
> > >> unique to make sense for you.
> > >> (2) save the above tuples into a set of sequence files so that
> sequence
> > >> file key is unique vector key, and sequence file value is
> > >> o.a.m.math.VectorWritable.
> > >> (3) decide how many dimensions there will be in reduced space. The key
> > is
> > >> reduced, i.e. you don't need too many. Say 50.
> > >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> > >> reduced dimensionality output will be in the folder USigma. The output
> > > will
> > >> have same keys bounds to vectors in reduced space of k dimensions.
> > >>
> > >>
> > >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <[email protected]>
> wrote:
> > >>
> > >>> Hi All,
> > >>> I have a CSV file on which I've to perform dimensionality reduction.
> > > I'm
> > >>> new to Mahout, on doing some search I understood that SSVD can be
> used
> > >> for
> > >>> performing dimensionality reduction. I'm not sure of the steps that
> > > have
> > >> to
> > >>> be executed before  SSVD, please help me.
> > >>>
> > >>> Thanks,
> > >>> Vijay
> > >>>
> > >>
> > >
> >
> >
>

Re: Using SSVD for dimensionality reduction on Mahout

Reply via email to