cache() is lazy. The data is stored into memory after the first time it gets materialized. So the first time you call `predict` after you load the model back from HDFS, it still takes time to load the actual data. The second time will be much faster. Or you can call `userJavaRDD.count()` and `productJavaRDD.count()` explicitly to load both into memory before you create the model. -Xiangrui
On Sun, Mar 8, 2015 at 9:43 AM, Yuichiro Sakamoto <ks...@muc.biglobe.ne.jp> wrote: > Hello. > > I create program, collaborative filtering using Spark, > but I have trouble with calculating speed. > > I want to implement recommendation program using ALS (MLlib), > which is another process from Spark. > But access speed of MatrixFactorizationModel object on HDFS is slow, > so I want to cache it, but I can't. > > There are 2 processes: > > process A: > > 1. Create MatrixFactorizationModel by ALS > > 2. Save following objects to HDFS > - MatrixFactorizationModel (on RDD) > - MatrixFactorizationModel#userFeatures(RDD) > - MatrixFactorizationModel#productFeatures(RDD) > > process B: > > 1. Load model information saved by process A. > # In process B, Master of SparkContext is set to "local" > ========== > // Read Model > JavaRDD<MatrixFactorizationModel> modelRDD = > sparkContext.objectFile("<HDFS path>"); > MatrixFactorizationModel preModel = modelData.first(); > // Read Model's RDD > JavaRDD<Tuple2<Object, double[]>> productJavaRDD = > sparkContext.objectFile("<HDFS path>"); > JavaRDD<Tuple2<Object, double[]>> userJavaRDD = > sparkContext.objectFile("<HDFS path>"); > // Create Model > MatrixFactorizationModel model = new > MatrixFactorizationModel(preModel.rank(), > JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD)); > ========== > > 2. Call "predict" method of above MatrixFactorizationModel object. > > > At number 2 of process B, it is slow speed because objects are read from > HDFS every time. > # I confirmed that the result of recommendation is correct. > > So, I tried to cache "productJavaRDD" and "userJavaRDD" as following, > but there was no response from "predict" method. > ========== > // Read Model > JavaRDD<MatrixFactorizationModel> modelRDD = sparkContext.objectFile("<HDFS > path>"); > MatrixFactorizationModel preModel = modelData.first(); > // Read Model's RDD > JavaRDD<Tuple2<Object, double[]>> productJavaRDD = > sparkContext.objectFile("<HDFS path>"); > JavaRDD<Tuple2<Object, double[]>> userJavaRDD = > sparkContext.objectFile("<HDFS path>"); > // Cache > productJavaRDD.cache(); > userJavaRDD.cache(); > // Create Model > MatrixFactorizationModel model = new > MatrixFactorizationModel(preModel.rank(), > JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD)); > ========== > > I could not understand why "predict" method was frozen. > Could you please help me how to cache object ? > > Thank you. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org