Hi, I'm trying to use Matrix Factorization over a dataset with like 6.5M users, 2.5M products and 120M ratings over products. The test is done in standalone mode, with unique worker (Quad-core and 16 Gb RAM).
The program runs out of memory, and I think that this happens because flatMap holds data in memory. (I tried with Movielens dataset that has 65k users, 11k movies and 100M ratings and the test does it without any problem) Is there any way to make ALS hold the data on disk, instead of memory? When I was trying the movielens dataset, i noticed that after all the jobs, the program holds some residual RDD in-memory. Why is that? And last question (general question), why when I persist RDD with StorageLevel.DISK_ONLY, unix system monitor shows that Apache Spark uses the same amount of RAM, as if I persist it in-memory? Thanks in advance. Hope that is understandable, since it's not my main language. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-tp15420.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org