Re: Question Regarding Distributed Row Matrix

Jake Mannix Thu, 05 May 2011 10:20:03 -0700

Vckay,

  People don't typically take a raw Text file which has no keys, and build
a DistributedRowMatrix from it.  You typically have something you want
to key on (file name, guid from a database, embedded timestamp, etc).
If you don't have any ids for your rows, you'll need to generate some.


  If you look at what we do in RowIdJob, it maps over a SequenceFile
of Text -> VectorWritable (which is the output of the seqdirectory
script: filename -> vector), and turns this into a pair of sequence files,
Int -> Text, and Int -> VectorWritable.  The first is a "dictionary" of
what ints (docId) maps to what filename, and the latter is a true
DistributedRowMatrix, ready for working with transpose, svd, etc.

  Note that RowIdJob is not truly scalable: it iterates over your entire
text directly, so it does not use any parallelism.

  -jake


On Thu, May 5, 2011 at 6:57 AM, Vckay <[email protected]> wrote:

> OK. I do plan to use SVD and transpose. Assuming you are correct, I am
> curious then: How are people solving this problem? (Surely not all data has
> row tags in it). A solution I had in mind was to use a single reducer (have
> one key coming in from mapper) so that the single reducer is able to put in
> a row number. However, this is not a clean solution since it appears to
> have
> to do it serially.
>
> On Thu, May 5, 2011 at 12:49 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
>
> > The interpretation of key in sequence files is subject to restrictions
> > of a particular algorithm. We held a discussion on this recently, and
> > i think the consensus was that we don't want to lock DRM as a format
> > to a particular interpretation of keys in the file -- it is left to
> > client's code to interpret those and for ultimate goal of
> > vectorization.
> >
> > However, different algorithms may interpret it differently. E.g.
> > stochastic SVD is agnostic of both the key and its class and just
> > copies it into keys of left eigenvector matrix whereas Lanczos SVD (I
> > think) requires them to be IntWritable (and may also require them to
> > be unique -- i am not 100% sure). Similarly, matrix transpose (I
> > think) would also require them to be IntWritable and on top of them
> > interpret them as row numbers for the sake of transposition. (I might
> > be wrong about that last one).
> >
> > I am not sure about KMeans code.
> >
> > On Wed, May 4, 2011 at 8:54 PM, Vckay <[email protected]> wrote:
> > > Hello all,
> > >  I am trying to create a distributed row matrix of my data which is
> > > currently available as text input with each line supposed to become a
> > line
> > > of the distributed row. I am using the Spectral KMeans code as a way of
> > > understanding how DistributedRowMatrix works and I am sort of confused.
> > > Specifically: Does DistributedRowMatrix require that the SequenceFiles
> > have
> > > the row ID as the "Key" ?
> > > ( The Spectral Kmeans code implements that which is easy because their
> > > input's first word has that information. However, since as far as I can
> > see
> > > TextInputFormat just renders a unique byte offset (not necessarily the
> > line
> > > number), I cant recover the line number from my data. Furthermore,
> > suppose I
> > > do change my data to say a bunch of images living in a flat directory,
> I
> > am
> > > thinking of having "key" being some combination of the file number and
> > this
> > > byte offset. )
> > >
> > > Thanks
> > >
> >
>

Re: Question Regarding Distributed Row Matrix

Reply via email to