So, a DRM is a set of one or more files, where each SequenceFile int/vector pair is a row number and a fully wide vector? Then ordering is in the IntWritable keys.
On Sun, Nov 13, 2011 at 10:56 PM, Jake Mannix <[email protected]> wrote: > I don't think we currently make any guarantees about sort-order of the > parts > themselves, or among the various part-files, as the may be created by any > number of map-reduce jobs, and are then consumed by map-reduce jobs > which have no inter-process communication. > > What would ordering even *mean* among map-inputs? Or are you just > referring to in each chunk itself? Or for non-MR use of the files? > > -jake > > On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <[email protected]> > wrote: > > > Make sure that the files can be ordered, of course. Losing the ordering > > can be really bad. > > > > On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <[email protected]> > > wrote: > > > > > Yeah, in particular, DistributedRowMatrix "is" simply a > > > SequenceFile<IntWritable,VectorWritable>, when in its serialized form. > > As > > > such, > > > this "file" can be (and typically is) a series of part-* files in a > > > directory (typically > > > on HDFS). > > > > > > -jake > > > > > > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <[email protected] > > > >wrote: > > > > > > > It's my understanding drm can be multifile. In fact, stuff like > > > seq2sparse > > > > will produce multifile output, being a MR job itself. > > > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <[email protected]> wrote: > > > > > > > > > Is there a convention for multi-file matrices? For example, the > > > > > DistributedRowMatrix? > > > > > > > > > > -- > > > > > Lance Norskog > > > > > [email protected] > > > > > > > > > > > > > > > -- Lance Norskog [email protected]
