correction: dfsWrite (typo) On Thu, Jun 11, 2015 at 3:53 PM, Dmitriy Lyubimov <[email protected]> wrote:
> I guess you are talking DRM format (sequence file). > > current recommended way is to use mahout-samsara with e.g. Spark (no > mapreduce support there). Translation of in-core matrix (sparse, for > example) would take converting it to distributed matrix (DRM) first by > means of drmParallelize [1] and then saving it to hdfs by means of dfwWrite > [2] (the doc's method name for saving matrix is outdated a bit there). > > It does imply Spark cluster (although you can always run it in local mode > which is just as good as completely in-core save). > > [1] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pfe > [2] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pff > > On Thu, Jun 11, 2015 at 1:53 PM, Patrice Seyed <[email protected]> wrote: > >> Hi, >> >> I'm looking for a good solution to populate and serialize a large >> sparse matrix using Mahout or related libraries. I noticed >> SparseMatrix is not serializable when I considered serializing this >> java object to file. In an experiment to serialize out to a sequence >> file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row >> the sequence file >> was taking about 115 GB space. Lucene is another idea but has similar >> demand on disk space. >> >> Are there more efficient ways of serializing a matrix to disk? Is >> there something akin to python's ndarray? (Which I have noticed >> handles quite large spare matrices population/serialization well.) >> >> The object DistributedRowMatrix was mentioned to me but, 1) does it >> suit my use case? The constructor takes a sequence file as an argument >> (the generation of which I am having the issue with), 2) there is not >> a method for accessing a row at an index, which I would need. >> >> Thanks in advance for any suggestions, >> Best, >> Patrice >> > >
