Yeah, I realized later in the game that I was actually thinking of sequence files that contained text/vector pairs as matrices. I haven't yet played with the matrix manipulation utilities (well, not successfully), but that gap may explain why. I suppose I can't just treat a file of named vectors as a matrix -- instead I imagine I need another utility to convert that vector file into a matrix file (with a dictionary file translating row numbers to vector names).
Right now I'm picturing the files that come out of the seq2sparse tool intended for clustering purposes. You get a dictionary that translates columns to words, but you still have document names as rows. I suppose in the end it would be good to support two dictionary files -- one for rows and one for columns (although if you ever transposed the matrix that could get confusing, so maybe items/features would be a better generic name, or a better yet a command line option for strongly naming row/column dict files). So I suppose now I'm proposing two utilities -- one for converting vectors into matrices (assuming one doesn't already exist) and another for spitting out matrices in various formats (dense/sparse, horizontal rows/vertical triples, specifiable delimiter/style). One thing that would be really nice is a quick reference of standard input and output file types for all the various utility functions. A lot of them are standard transforms and it would be nice to know what they're transforming from and to at a glance (as well as what's available). I'm getting pulled onto another project right now, but I have the house to myself this weekend, so I should be able to work on something then -- I just submitted my first jira the other day, what's the standard protocol for something like this? Submit the code as a svn patch file in a jira? On Wed, Aug 17, 2011 at 6:28 PM, Ted Dunning <[email protected]> wrote: > Sounds like a good idea in general. > > Here a tiny bit of code to get you rolling. Adding this to the existing > VectorDumper is better than using a standalone class as I have here. Your > thought about long strings is also very pertinent. > > public class DumpTriples { > public static void dump(PrintWriter out, Matrix m) { > for (MatrixSlice row : m) { > Iterator<Vector.Element> i = row.vector().iterateNonZero(); > while (i.hasNext()) { > Vector.Element element = i.next(); > out.printf("%d,%d,%f\n", row.index(), element.index(), > element.get()); > } > } > } > } > > > On Wed, Aug 17, 2011 at 3:27 PM, Jeff Hansen <[email protected]> wrote: > > > Does anybody happen to know if there's already a utility out there for > > dumping a sequence file of vectors to a csv file with > vector,element,value? > > > > I was hoping to shift some of my results over to R found a comment by Ted > a > > while back suggesting that the easiest method is to spit out sparse csv > > triples and load them with > > > > sparseMatrix(x=c(1,1,1,1), i=c(1,2,3,3), j=c(1,1,2,1)) > > > > from the Matrix library. > > > > This wouldn't be that complicated to write, but I imagine I'm not the > first > > person to look for it. If a utility like this doesn't already exist, > does > > anybody think it would be a worthwhile enhancement to add an option onto > > the > > VectorDump utility to output to this format? If so I'd be happy to offer > > up > > a patch (although I might want to refactor the VectorHelper class to emit > > straight out to the writer -- I'm not too fond of generating huge > strings) > > >
