Thanks a lot for the reply.
To gain an understanding of how SSVD works, I have taken a sample CSV file
with 12 columns and I want to perform dimensionality reduction on it by
asking SSVD to give me 7 most significant columns.
Snippet of my input csv
22,2,44,36,5,9,2824,2,4,733,285,169
25,1,150,175,3,9,4037,2,18,1822,254,171
Here's what I have done.
Step 1: Converted the csv to a sequence file, below is a snippet of the
output
Key: 1: Value:
1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
Key: 2: Value:
2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}
Step 2; Passed this sequence file as input to the SSVD command, below is
the command I used
bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
/user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false -pca
true -ow -t 1
I then executed vectordump on the contents of USigma folder, below is a
snippet of the output
{0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
{0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}
Please help me interpret the above results in the USigma folder.
Thanks,
Vijay.
On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <[email protected]> wrote:
> Vijay, how many columns do you have in the CSV? That is the number you
> will be reducing.
>
> csv:
> 1,22,33,44,55
> 13,23,34,45,56
>
> would be dense vectors:
> Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
>
> Unless you have some reason to assign different dimension indexes the row
> and column numbers from your csv should be used in Mahout. Internal to
> Mahout the dimensions are assumed to be ordinal. If you do have reasons to
> say column 1 corresponds to something with an id of 12 (your example below)
> then you handle that in the output phase of your problem. In other words if
> you get an answer corresponding to the Mahout column index of 1, you lookup
> its association to 12 in some dictionary you keep outside of Mahout, same
> with the row keys. Don't put external Ids in the matrix unless they really
> are ordinal dimensions.
>
> As Dmitriy said this sounds like a Dense matrix problem. Usually when I've
> used SSVD it was on a matrix with 80,000-500,000 columns in a very sparse
> matrix so reduction yields big benefits. Also remember that the output is
> always a dense matrix so ops performed on it tend to be more heavy weight.
>
>
> On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <[email protected]> wrote:
>
> On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <[email protected]> wrote:
>
> > Thanks a lot for the detailed explanation, it was very helpful.
> > I will write a CSV to sequence converter, just needed some clarity on the
> > key/value pairs in the sequence file.
> >
> > Suppose my csv file contains the below values
> > 11,22,33,44,55
> > 13,23,34,45,56
> >
> > I assume that the sequence file would look like this, where 12, 1, 14, 8,
> > 15 are indices which hold the values
> > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> >
>
> I am not sure -- why are you remapping ordinal position into an index
> position? Obviously, DRM supports sparse computations (i.e. you can use
> either SequetialAccessSparseVector or RandomAccessSparseVector as vector
> values, as long as they have the same cardinality). However, if you imply
> that all data point ordinal positions map into the same sparse vector
> index, then there's no true sparsity here and you could just form dense
> vectors in ordinal order of your data, it seems.
>
> Other than that, I don't see any issues with your assumptions.
>
>
> > Please confirm if my understanding is correct.
> >
> > Thanks,
> > Vijay
> >
> >
> > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <[email protected]
> >> wrote:
> >
> >> I am not sure if we have direct CSV converters to do that; CSV is not
> > that
> >> expressive anyway. But it is not difficult to write up such converter on
> >> your own, i suppose.
> >>
> >> The steps you need to do is this :
> >>
> >> (1) prepare set of data points in a form of (unique vector key,
> n-vector)
> >> tuples. Vector key can be anything that can be adapted into a
> >> WritableComparable. Notably, Long or String. Vector key also has to be
> >> unique to make sense for you.
> >> (2) save the above tuples into a set of sequence files so that sequence
> >> file key is unique vector key, and sequence file value is
> >> o.a.m.math.VectorWritable.
> >> (3) decide how many dimensions there will be in reduced space. The key
> is
> >> reduced, i.e. you don't need too many. Say 50.
> >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> >> reduced dimensionality output will be in the folder USigma. The output
> > will
> >> have same keys bounds to vectors in reduced space of k dimensions.
> >>
> >>
> >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <[email protected]> wrote:
> >>
> >>> Hi All,
> >>> I have a CSV file on which I've to perform dimensionality reduction.
> > I'm
> >>> new to Mahout, on doing some search I understood that SSVD can be used
> >> for
> >>> performing dimensionality reduction. I'm not sure of the steps that
> > have
> >> to
> >>> be executed before SSVD, please help me.
> >>>
> >>> Thanks,
> >>> Vijay
> >>>
> >>
> >
>
>