Hi,

I'm looking for a good solution to populate and serialize a large
sparse matrix using Mahout or related libraries. I noticed
SparseMatrix is not serializable when I considered serializing this
java object to file.  In an experiment to serialize out to a sequence
file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row
the sequence file
was taking about 115 GB space. Lucene is another idea but has similar
demand on disk space.

Are there more efficient ways of serializing a matrix to disk? Is
there something akin to python's ndarray? (Which I have noticed
handles quite large spare matrices population/serialization well.)

The object DistributedRowMatrix was mentioned to me but, 1) does it
suit my use case? The constructor takes a sequence file as an argument
(the generation of which I am having the issue with), 2) there is not
a method for accessing a row at an index, which I would need.

Thanks in advance for any suggestions,
Best,
Patrice

Reply via email to