I should have said "don't forget which row is which".

On Mon, Nov 14, 2011 at 12:06 AM, Jake Mannix <[email protected]> wrote:

> The ordering *can* be chosen to be that.  But nothing in our api
> documentation
> implies we will always do this, and in fact it completely depends on
> whether the
> MR job used to create the matrix had reducer outputs creating row numbers
> sequentially.
>
>  -jake
>
> On Sun, Nov 13, 2011 at 11:28 PM, Lance Norskog <[email protected]> wrote:
>
> > So, a DRM is a set of one or more files, where each SequenceFile
> int/vector
> > pair is a row number and a fully wide vector? Then ordering is in the
> > IntWritable keys.
> >
> > On Sun, Nov 13, 2011 at 10:56 PM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > I don't think we currently make any guarantees about sort-order of the
> > > parts
> > > themselves, or among the various part-files, as the may be created by
> any
> > > number of map-reduce jobs, and are then consumed by map-reduce jobs
> > > which have no inter-process communication.
> > >
> > > What would ordering even *mean* among map-inputs?  Or are you just
> > > referring to in each chunk itself?  Or for non-MR use of the files?
> > >
> > >  -jake
> > >
> > > On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > Make sure that the files can be ordered, of course.  Losing the
> > ordering
> > > > can be really bad.
> > > >
> > > > On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Yeah, in particular, DistributedRowMatrix "is" simply a
> > > > > SequenceFile<IntWritable,VectorWritable>, when in its serialized
> > form.
> > > >  As
> > > > > such,
> > > > > this "file" can be (and typically is) a series of part-* files in a
> > > > > directory (typically
> > > > > on HDFS).
> > > > >
> > > > >  -jake
> > > > >
> > > > > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > It's my understanding drm can be multifile. In fact, stuff like
> > > > > seq2sparse
> > > > > > will produce multifile output, being a MR job itself.
> > > > > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Is there a convention for multi-file matrices? For example, the
> > > > > > > DistributedRowMatrix?
> > > > > > >
> > > > > > > --
> > > > > > > Lance Norskog
> > > > > > > [email protected]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > [email protected]
> >
>

Reply via email to