Hi, I'm looking for a good solution to populate and serialize a large sparse matrix using Mahout or related libraries. I noticed SparseMatrix is not serializable when I considered serializing this java object to file. In an experiment to serialize out to a sequence file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row the sequence file was taking about 115 GB space. Lucene is another idea but has similar demand on disk space.
Are there more efficient ways of serializing a matrix to disk? Is there something akin to python's ndarray? (Which I have noticed handles quite large spare matrices population/serialization well.) The object DistributedRowMatrix was mentioned to me but, 1) does it suit my use case? The constructor takes a sequence file as an argument (the generation of which I am having the issue with), 2) there is not a method for accessing a row at an index, which I would need. Thanks in advance for any suggestions, Best, Patrice
