Yeah, I realized later in the game that I was actually thinking of sequence
files that contained text/vector pairs as matrices.  I haven't yet played
with the matrix manipulation utilities (well, not successfully), but that
gap may explain why.  I suppose I can't just treat a file of named vectors
as a matrix -- instead I imagine I need another utility to convert that
vector file into a matrix file (with a dictionary file translating row
numbers to vector names).

Right now I'm picturing the files that come out of the seq2sparse tool
intended for clustering purposes.  You get a dictionary that translates
columns to words, but you still have document names as rows.  I suppose in
the end it would be good to support two dictionary files -- one for rows and
one for columns (although if you ever transposed the matrix that could get
confusing, so maybe items/features would be a better generic name, or a
better yet a command line option for strongly naming row/column dict files).

So I suppose now I'm proposing two utilities -- one for converting vectors
into matrices (assuming one doesn't already exist) and another for spitting
out matrices in various formats (dense/sparse, horizontal rows/vertical
triples, specifiable delimiter/style).

One thing that would be really nice is a quick reference of standard input
and output file types for all the various utility functions.  A lot of them
are standard transforms and it would be nice to know what they're
transforming from and to at a glance (as well as what's available).

I'm getting pulled onto another project right now, but I have the house to
myself this weekend, so I should be able to work on something then -- I just
submitted my first jira the other day, what's the standard protocol for
something like this? Submit the code as a svn patch file in a jira?

On Wed, Aug 17, 2011 at 6:28 PM, Ted Dunning <[email protected]> wrote:

> Sounds like a good idea in general.
>
> Here a tiny bit of code to get you rolling.  Adding this to the existing
> VectorDumper is better than using a standalone class as I have here.  Your
> thought about long strings is also very pertinent.
>
> public class DumpTriples {
>  public static void dump(PrintWriter out, Matrix m) {
>    for (MatrixSlice row : m) {
>      Iterator<Vector.Element> i = row.vector().iterateNonZero();
>      while (i.hasNext()) {
>        Vector.Element element = i.next();
>        out.printf("%d,%d,%f\n", row.index(), element.index(),
> element.get());
>       }
>    }
>  }
> }
>
>
> On Wed, Aug 17, 2011 at 3:27 PM, Jeff Hansen <[email protected]> wrote:
>
> > Does anybody happen to know if there's already a utility out there for
> > dumping a sequence file of vectors to a csv file with
> vector,element,value?
> >
> > I was hoping to shift some of my results over to R found a comment by Ted
> a
> > while back suggesting that the easiest method is to spit out sparse csv
> > triples and load them with
> >
> > sparseMatrix(x=c(1,1,1,1), i=c(1,2,3,3), j=c(1,1,2,1))
> >
> > from the Matrix library.
> >
> > This wouldn't be that complicated to write, but I imagine I'm not the
> first
> > person to look for it.  If a utility like this doesn't already exist,
> does
> > anybody think it would be a worthwhile enhancement to add an option onto
> > the
> > VectorDump utility to output to this format?  If so I'd be happy to offer
> > up
> > a patch (although I might want to refactor the VectorHelper class to emit
> > straight out to the writer -- I'm not too fond of generating huge
> strings)
> >
>

Reply via email to