PS. dspca method, which is almost exact replica of SSVD --pca true, is also available on Spark running on exactly same sequence file DRM (there's no CLI though, it needs to be wrapped in a scala code) [1]. It potentially may be a bit better performant than MR version, although it is new. If you are in Scala world and looking for an embedded api, this may be a better option for you to try. Although it is a new code, and we haven't collected data on its application yet. it would be awesome if you could try it.
[1] http://mahout.apache.org/users/sparkbindings/home.html On Wed, Mar 19, 2014 at 10:17 AM, Dmitriy Lyubimov <[email protected]>wrote: > I am not sure if we have direct CSV converters to do that; CSV is not that > expressive anyway. But it is not difficult to write up such converter on > your own, i suppose. > > The steps you need to do is this : > > (1) prepare set of data points in a form of (unique vector key, n-vector) > tuples. Vector key can be anything that can be adapted into a > WritableComparable. Notably, Long or String. Vector key also has to be > unique to make sense for you. > (2) save the above tuples into a set of sequence files so that sequence > file key is unique vector key, and sequence file value is > o.a.m.math.VectorWritable. > (3) decide how many dimensions there will be in reduced space. The key is > reduced, i.e. you don't need too many. Say 50. > (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The > reduced dimensionality output will be in the folder USigma. The output will > have same keys bounds to vectors in reduced space of k dimensions. > > > On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <[email protected]> wrote: > >> Hi All, >> I have a CSV file on which I've to perform dimensionality reduction. I'm >> new to Mahout, on doing some search I understood that SSVD can be used for >> performing dimensionality reduction. I'm not sure of the steps that have >> to >> be executed before SSVD, please help me. >> >> Thanks, >> Vijay >> > >
