You may need a cluster with more memory. The current ALS
implementation constructs all subproblems in memory. With rank=10,
that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings
need 2GB, not counting the overhead. ALS creates in/out blocks to
optimize the computation, which takes about twice as much as the
original dataset. Note that this optimization becomes "overhead" on a
single machine. All these factors contribute to the OOM error.

You can try DISK_ONLY with spark.rdd.compress set to true. In Spark
1.1, we added an option to set the storage level for in/out blocks
(ALS.setIntermediateRDDStorageLevel), which you can use to store
in/out blocks on disk. That being said, I still recommend running the
dataset on a cluster with more memory.

Best,
Xiangrui

On Tue, Sep 30, 2014 at 10:44 AM, Alex T <chiorts...@gmail.com> wrote:
> Hi,
> I'm trying to use Matrix Factorization over a dataset with like 6.5M users,
> 2.5M products and 120M ratings over products. The test is done in standalone
> mode, with unique worker (Quad-core and 16 Gb RAM).
>
> The program runs out of memory, and I think that this happens because
> flatMap holds data in memory.
> (I tried with Movielens dataset that has 65k users, 11k movies and 100M
> ratings and the test does it without any problem)
>
> Is there any way to make ALS hold the data on disk, instead of memory?
>
> When I was trying the movielens dataset, i noticed that after all the jobs,
> the program holds some residual RDD in-memory. Why is that?
>
> And last question (general question), why when I persist RDD with
> StorageLevel.DISK_ONLY, unix system monitor shows that Apache Spark uses the
> same amount of RAM, as if I persist it in-memory?
>
> Thanks in advance. Hope that is understandable, since it's not my main
> language.
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-tp15420.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to