correction: dfsWrite (typo)

On Thu, Jun 11, 2015 at 3:53 PM, Dmitriy Lyubimov <[email protected]> wrote:

> I guess you are talking DRM format (sequence file).
>
> current recommended way is to use mahout-samsara with e.g. Spark (no
> mapreduce support there). Translation of in-core matrix (sparse, for
> example) would take converting it to distributed matrix (DRM) first by
> means of drmParallelize [1] and then saving it to hdfs by means of dfwWrite
> [2] (the doc's method name for saving matrix is outdated a bit there).
>
> It does imply Spark cluster (although you can always run it in local mode
> which is just as good as completely in-core save).
>
> [1] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pfe
> [2] http://apache.github.io/mahout/doc/ScalaSparkBindings.html#pff
>
> On Thu, Jun 11, 2015 at 1:53 PM, Patrice Seyed <[email protected]> wrote:
>
>> Hi,
>>
>> I'm looking for a good solution to populate and serialize a large
>> sparse matrix using Mahout or related libraries. I noticed
>> SparseMatrix is not serializable when I considered serializing this
>> java object to file.  In an experiment to serialize out to a sequence
>> file, my ~3mil row matrix (avg ~20 col, sparse), after about 500k row
>> the sequence file
>> was taking about 115 GB space. Lucene is another idea but has similar
>> demand on disk space.
>>
>> Are there more efficient ways of serializing a matrix to disk? Is
>> there something akin to python's ndarray? (Which I have noticed
>> handles quite large spare matrices population/serialization well.)
>>
>> The object DistributedRowMatrix was mentioned to me but, 1) does it
>> suit my use case? The constructor takes a sequence file as an argument
>> (the generation of which I am having the issue with), 2) there is not
>> a method for accessing a row at an index, which I would need.
>>
>> Thanks in advance for any suggestions,
>> Best,
>> Patrice
>>
>
>

Reply via email to