Vijay, how many columns do you have in the CSV? That is the number you will be
reducing.
csv:
1,22,33,44,55
13,23,34,45,56
would be dense vectors:
Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
Unless you have some reason to assign different dimension indexes the row and
column numbers from your csv should be used in Mahout. Internal to Mahout the
dimensions are assumed to be ordinal. If you do have reasons to say column 1
corresponds to something with an id of 12 (your example below) then you handle
that in the output phase of your problem. In other words if you get an answer
corresponding to the Mahout column index of 1, you lookup its association to 12
in some dictionary you keep outside of Mahout, same with the row keys. Don’t
put external Ids in the matrix unless they really are ordinal dimensions.
As Dmitriy said this sounds like a Dense matrix problem. Usually when I’ve used
SSVD it was on a matrix with 80,000-500,000 columns in a very sparse matrix so
reduction yields big benefits. Also remember that the output is always a dense
matrix so ops performed on it tend to be more heavy weight.
On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <[email protected]> wrote:
On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <[email protected]> wrote:
> Thanks a lot for the detailed explanation, it was very helpful.
> I will write a CSV to sequence converter, just needed some clarity on the
> key/value pairs in the sequence file.
>
> Suppose my csv file contains the below values
> 11,22,33,44,55
> 13,23,34,45,56
>
> I assume that the sequence file would look like this, where 12, 1, 14, 8,
> 15 are indices which hold the values
> Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
>
I am not sure -- why are you remapping ordinal position into an index
position? Obviously, DRM supports sparse computations (i.e. you can use
either SequetialAccessSparseVector or RandomAccessSparseVector as vector
values, as long as they have the same cardinality). However, if you imply
that all data point ordinal positions map into the same sparse vector
index, then there's no true sparsity here and you could just form dense
vectors in ordinal order of your data, it seems.
Other than that, I don't see any issues with your assumptions.
> Please confirm if my understanding is correct.
>
> Thanks,
> Vijay
>
>
> On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <[email protected]
>> wrote:
>
>> I am not sure if we have direct CSV converters to do that; CSV is not
> that
>> expressive anyway. But it is not difficult to write up such converter on
>> your own, i suppose.
>>
>> The steps you need to do is this :
>>
>> (1) prepare set of data points in a form of (unique vector key, n-vector)
>> tuples. Vector key can be anything that can be adapted into a
>> WritableComparable. Notably, Long or String. Vector key also has to be
>> unique to make sense for you.
>> (2) save the above tuples into a set of sequence files so that sequence
>> file key is unique vector key, and sequence file value is
>> o.a.m.math.VectorWritable.
>> (3) decide how many dimensions there will be in reduced space. The key is
>> reduced, i.e. you don't need too many. Say 50.
>> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
>> reduced dimensionality output will be in the folder USigma. The output
> will
>> have same keys bounds to vectors in reduced space of k dimensions.
>>
>>
>> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <[email protected]> wrote:
>>
>>> Hi All,
>>> I have a CSV file on which I've to perform dimensionality reduction.
> I'm
>>> new to Mahout, on doing some search I understood that SSVD can be used
>> for
>>> performing dimensionality reduction. I'm not sure of the steps that
> have
>> to
>>> be executed before SSVD, please help me.
>>>
>>> Thanks,
>>> Vijay
>>>
>>
>