Re: populating and serializing large sparse matrices

Dmitriy Lyubimov Thu, 11 Jun 2015 15:57:20 -0700

I guess you are talking DRM format (sequence file).

current recommended way is to use mahout-samsara with e.g. Spark (no
mapreduce support there). Translation of in-core matrix (sparse, for
example) would take converting it to distributed matrix (DRM) first by
means of drmParallelize [1] and then saving it to hdfs by means of dfwWrite
[2] (the doc's method name for saving matrix is outdated a bit there).


It does imply Spark cluster (although you can always run it in local mode
which is just as good as completely in-core save).

[1] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pfe
[2] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pff

On Thu, Jun 11, 2015 at 1:53 PM, Patrice Seyed <[email protected]> wrote:

> Hi,
>
> I'm looking for a good solution to populate and serialize a large
> sparse matrix using Mahout or related libraries. I noticed
> SparseMatrix is not serializable when I considered serializing this
> java object to file.  In an experiment to serialize out to a sequence
> file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row
> the sequence file
> was taking about 115 GB space. Lucene is another idea but has similar
> demand on disk space.
>
> Are there more efficient ways of serializing a matrix to disk? Is
> there something akin to python's ndarray? (Which I have noticed
> handles quite large spare matrices population/serialization well.)
>
> The object DistributedRowMatrix was mentioned to me but, 1) does it
> suit my use case? The constructor takes a sequence file as an argument
> (the generation of which I am having the issue with), 2) there is not
> a method for accessing a row at an index, which I would need.
>
> Thanks in advance for any suggestions,
> Best,
> Patrice
>

Re: populating and serializing large sparse matrices

Reply via email to