You may need a cluster with more memory. The current ALS implementation constructs all subproblems in memory. With rank=10, that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings need 2GB, not counting the overhead. ALS creates in/out blocks to optimize the computation, which takes about twice as much as the original dataset. Note that this optimization becomes "overhead" on a single machine. All these factors contribute to the OOM error.
You can try DISK_ONLY with spark.rdd.compress set to true. In Spark 1.1, we added an option to set the storage level for in/out blocks (ALS.setIntermediateRDDStorageLevel), which you can use to store in/out blocks on disk. That being said, I still recommend running the dataset on a cluster with more memory. Best, Xiangrui On Tue, Sep 30, 2014 at 10:44 AM, Alex T <chiorts...@gmail.com> wrote: > Hi, > I'm trying to use Matrix Factorization over a dataset with like 6.5M users, > 2.5M products and 120M ratings over products. The test is done in standalone > mode, with unique worker (Quad-core and 16 Gb RAM). > > The program runs out of memory, and I think that this happens because > flatMap holds data in memory. > (I tried with Movielens dataset that has 65k users, 11k movies and 100M > ratings and the test does it without any problem) > > Is there any way to make ALS hold the data on disk, instead of memory? > > When I was trying the movielens dataset, i noticed that after all the jobs, > the program holds some residual RDD in-memory. Why is that? > > And last question (general question), why when I persist RDD with > StorageLevel.DISK_ONLY, unix system monitor shows that Apache Spark uses the > same amount of RAM, as if I persist it in-memory? > > Thanks in advance. Hope that is understandable, since it's not my main > language. > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-tp15420.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org