Vckay, People don't typically take a raw Text file which has no keys, and build a DistributedRowMatrix from it. You typically have something you want to key on (file name, guid from a database, embedded timestamp, etc). If you don't have any ids for your rows, you'll need to generate some.
If you look at what we do in RowIdJob, it maps over a SequenceFile of Text -> VectorWritable (which is the output of the seqdirectory script: filename -> vector), and turns this into a pair of sequence files, Int -> Text, and Int -> VectorWritable. The first is a "dictionary" of what ints (docId) maps to what filename, and the latter is a true DistributedRowMatrix, ready for working with transpose, svd, etc. Note that RowIdJob is not truly scalable: it iterates over your entire text directly, so it does not use any parallelism. -jake On Thu, May 5, 2011 at 6:57 AM, Vckay <[email protected]> wrote: > OK. I do plan to use SVD and transpose. Assuming you are correct, I am > curious then: How are people solving this problem? (Surely not all data has > row tags in it). A solution I had in mind was to use a single reducer (have > one key coming in from mapper) so that the single reducer is able to put in > a row number. However, this is not a clean solution since it appears to > have > to do it serially. > > On Thu, May 5, 2011 at 12:49 AM, Dmitriy Lyubimov <[email protected] > >wrote: > > > The interpretation of key in sequence files is subject to restrictions > > of a particular algorithm. We held a discussion on this recently, and > > i think the consensus was that we don't want to lock DRM as a format > > to a particular interpretation of keys in the file -- it is left to > > client's code to interpret those and for ultimate goal of > > vectorization. > > > > However, different algorithms may interpret it differently. E.g. > > stochastic SVD is agnostic of both the key and its class and just > > copies it into keys of left eigenvector matrix whereas Lanczos SVD (I > > think) requires them to be IntWritable (and may also require them to > > be unique -- i am not 100% sure). Similarly, matrix transpose (I > > think) would also require them to be IntWritable and on top of them > > interpret them as row numbers for the sake of transposition. (I might > > be wrong about that last one). > > > > I am not sure about KMeans code. > > > > On Wed, May 4, 2011 at 8:54 PM, Vckay <[email protected]> wrote: > > > Hello all, > > > I am trying to create a distributed row matrix of my data which is > > > currently available as text input with each line supposed to become a > > line > > > of the distributed row. I am using the Spectral KMeans code as a way of > > > understanding how DistributedRowMatrix works and I am sort of confused. > > > Specifically: Does DistributedRowMatrix require that the SequenceFiles > > have > > > the row ID as the "Key" ? > > > ( The Spectral Kmeans code implements that which is easy because their > > > input's first word has that information. However, since as far as I can > > see > > > TextInputFormat just renders a unique byte offset (not necessarily the > > line > > > number), I cant recover the line number from my data. Furthermore, > > suppose I > > > do change my data to say a bunch of images living in a flat directory, > I > > am > > > thinking of having "key" being some combination of the file number and > > this > > > byte offset. ) > > > > > > Thanks > > > > > >
