I guess you are talking DRM format (sequence file). current recommended way is to use mahout-samsara with e.g. Spark (no mapreduce support there). Translation of in-core matrix (sparse, for example) would take converting it to distributed matrix (DRM) first by means of drmParallelize [1] and then saving it to hdfs by means of dfwWrite [2] (the doc's method name for saving matrix is outdated a bit there).
It does imply Spark cluster (although you can always run it in local mode which is just as good as completely in-core save). [1] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pfe [2] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pff On Thu, Jun 11, 2015 at 1:53 PM, Patrice Seyed <[email protected]> wrote: > Hi, > > I'm looking for a good solution to populate and serialize a large > sparse matrix using Mahout or related libraries. I noticed > SparseMatrix is not serializable when I considered serializing this > java object to file. In an experiment to serialize out to a sequence > file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row > the sequence file > was taking about 115 GB space. Lucene is another idea but has similar > demand on disk space. > > Are there more efficient ways of serializing a matrix to disk? Is > there something akin to python's ndarray? (Which I have noticed > handles quite large spare matrices population/serialization well.) > > The object DistributedRowMatrix was mentioned to me but, 1) does it > suit my use case? The constructor takes a sequence file as an argument > (the generation of which I am having the issue with), 2) there is not > a method for accessing a row at an index, which I would need. > > Thanks in advance for any suggestions, > Best, > Patrice >
