Re: What is the most efficient and scalable way to get all the recommendation results from ALS model ?
Could anyone give me some advices or recommendations or usual ways to do this ? I am trying to get all (probably top 100) product recommendations for each user from a model (MatrixFactorizationModel), but I haven't figured out yet to do it efficiently. So far, calling predict (predictAll in pyspark) method with user-product matrix uses too much memory and couldn't complete due to a lack of memory, and calling predict for each user (or for each some users like 100 uses or so) takes too much time to get all the recommendations. I am using spark 1.4.1 and running 5-node cluster with 8GB RAM each. I only use small-sized data set so far, like about 5 users and 5000 products with only about 10 ratings. Thanks. On Sat, Mar 19, 2016 at 7:58 PM, Hiroyuki Yamada wrote: > Hi, > > I'm testing Collaborative Filtering with Milib. > Making a model by ALS.trainImplicit (or train) seems scalable as far as I > have tested, > but I'm wondering how I can get all the recommendation results efficiently. > > The predictAll method can get all the results, > but it needs the whole user-product matrix in memory as an input. > So if there are 1 million users and 1 million products, then the number of > elements is too large (1 million x 1 million) > and the amount of memory to hold them is more than a few TB even when the > element size in only 4B, > which is not a realistic size of memory even now. > > # (100*100)*4/1000/1000/1000/1000 => near equals 4TB) > > We can, of course, use predict method per user, > but, as far as I tried, it is very slow to get 1 million users' results. > > Do I miss something ? > Are there any other better ways to get all the recommendation results in > scalable and efficient way ? > > Best regards, > Hiro > > >
What is the most efficient and scalable way to get all the recommendation results from ALS model ?
Hi, I'm testing Collaborative Filtering with Milib. Making a model by ALS.trainImplicit (or train) seems scalable as far as I have tested, but I'm wondering how I can get all the recommendation results efficiently. The predictAll method can get all the results, but it needs the whole user-product matrix in memory as an input. So if there are 1 million users and 1 million products, then the number of elements is too large (1 million x 1 million) and the amount of memory to hold them is more than a few TB even when the element size in only 4B, which is not a realistic size of memory even now. # (100*100)*4/1000/1000/1000/1000 => near equals 4TB) We can, of course, use predict method per user, but, as far as I tried, it is very slow to get 1 million users' results. Do I miss something ? Are there any other better ways to get all the recommendation results in scalable and efficient way ? Best regards, Hiro
spark-submit with cluster deploy mode fails with ClassNotFoundException (jars are not passed around properley?)
Hi, I am trying to work with spark-submit with cluster deploy mode in single node, but I keep getting ClassNotFoundException as shown below. (in this case, snakeyaml.jar is not found from the spark cluster) === 16/03/12 14:19:12 INFO Remoting: Starting remoting 16/03/12 14:19:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://Driver@192.168.1.2:52993] 16/03/12 14:19:12 INFO util.Utils: Successfully started service 'Driver' on port 52993. 16/03/12 14:19:12 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@192.168.1.2:52985/user/Worker Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NoClassDefFoundError: org/yaml/snakeyaml/Yaml at com.analytics.config.YamlConfigLoader.loadConfig(YamlConfigLoader.java:30) at com.analytics.api.DeclarativeAnalyticsFactory.create(DeclarativeAnalyticsFactory.java:21) at com.analytics.program.QueryExecutor.main(QueryExecutor.java:12) ... 6 more Caused by: java.lang.ClassNotFoundException: org.yaml.snakeyaml.Yaml at java.lang.ClassLoader.findClass(ClassLoader.java:530) at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34) at org.apache.spark.util.ChildFirstURLClassLoader.liftedTree1$1(MutableURLClassLoader.scala:75) at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(MutableURLClassLoader.scala:71) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more 16/03/12 14:19:12 INFO util.Utils: Shutdown hook called I can submit a job successfully with client mode, but I can't with cluster mode, so, it is a matter of not properly passing jars (snakeyaml) to the cluster. The actual command I tried is: $ spark-submit --master spark://192.168.1.2:6066 --deploy-mode cluster --jars all-the-jars(with comma separated) --class com.analytics.program.QueryExecutor analytics.jar (of course, snakeyaml.jar is specified after --jars) I tried spark.executor.extraClassPath and spark.driver.extraClassPath in spark-defaults.conf to specifiy snakeyaml.jar, but none of those worked. I also found couple of similar issues posted in the mailing list or other sites, but, it is not responded back properly or it didn't work to me. < https://mail-archives.apache.org/mod_mbox/spark-user/201505.mbox/%3CCAGSyEuApEkfO_2-iiiuyS2eeg=w_jkf83vcceguns4douod...@mail.gmail.com%3E > < http://stackoverflow.com/questions/34272426/how-to-give-dependent-jars-to-spark-submit-in-cluster-mode > < https://support.datastax.com/hc/en-us/articles/207442243-Spark-submit-fails-with-class-not-found-when-deploying-in-cluster-mode > Could anyone give me a help ? Best regards, Hiro
Re: which is a more appropriate form of ratings ?
Thanks very much, Nick and Sabarish. That helps me a lot. Regards, *Hiro* On Thu, Feb 25, 2016 at 8:52 PM, Nick Pentreath wrote: > Yes, ALS requires the aggregated version (A). You can use decimal or whole > numbers for the rating, depending on your application, as for implicit data > they are not "ratings" but rather "weights". > > A common approach is to apply different weightings to different user > events (such as 1.0 for a page view, 5.0 for a purchase, 2.0 for a like, > etc). That allows all user event data to be aggregated together in a fairly > principled manner. The weights however need to be specified upfront in > order to do that aggregation (they could be selected via cross-validation, > domain knowledge or the relative frequency of each event within a dataset, > for example). > > > On Thu, 25 Feb 2016 at 13:26 Sabarish Sasidharan > wrote: > >> I believe the ALS algo expects the ratings to be aggregated (A). I don't >> see why you have to use decimals for rating. >> >> Regards >> Sab >> >> On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada >> wrote: >> >>> Hello. >>> >>> I just started working on CF in MLlib. >>> I am using trainImplicit because I only have implicit ratings like page >>> views. >>> >>> I am wondering which is a more appropriate form of ratings. >>> Let's assume that view count is regarded as a rating and >>> user 1 sees page 1 3 times and sees page 2 twice and so on. >>> >>> In this case, I think ratings can be formatted like the following 2 >>> cases. (of course it is a RDD actually) >>> >>> A: >>> user_id,page_id,rating(page view) >>> 1,1,0.3 >>> 1,2,0.2 >>> ... >>> >>> B: >>> user_id,page_id,rating(page view) >>> 1,1,0.1 >>> 1,1,0.1 >>> 1,1,0.1 >>> 1,2,0.1 >>> 1,2,0.1 >>> ... >>> >>> It is allowed to have like B ? >>> If it is, which is better ? ( is there any difference between them ?) >>> >>> Best, >>> Hiro >>> >>> >>> >>> >>
which is a more appropriate form of ratings ?
Hello. I just started working on CF in MLlib. I am using trainImplicit because I only have implicit ratings like page views. I am wondering which is a more appropriate form of ratings. Let's assume that view count is regarded as a rating and user 1 sees page 1 3 times and sees page 2 twice and so on. In this case, I think ratings can be formatted like the following 2 cases. (of course it is a RDD actually) A: user_id,page_id,rating(page view) 1,1,0.3 1,2,0.2 ... B: user_id,page_id,rating(page view) 1,1,0.1 1,1,0.1 1,1,0.1 1,2,0.1 1,2,0.1 ... It is allowed to have like B ? If it is, which is better ? ( is there any difference between them ?) Best, Hiro
Re: What is the point of alpha value in Collaborative Filtering in MLlib ?
Hello Sean, Thank you very much for the quick response. That helps me a lot to understand it better ! Best regards, Hiro On Thu, Feb 25, 2016 at 6:59 PM, Sean Owen wrote: > This isn't specific to Spark; it's from the original paper. > > alpha doesn't do a whole lot, and it is a global hyperparam. It > controls the relative weight of observed versus unobserved > user-product interactions in the factorization. Higher alpha means > it's much more important to faithfully reproduce the interactions that > *did* happen as a "1", than reproduce the interactions that *didn't* > happen as a "0". > > I don't think there's a good rule of thumb about what value to pick; > it can't be less than 0 (less than 1 doesn't make much sense either), > and you might just try values between 1 and 100 to see what gives the > best result. > > I think that generally sparser input needs higher alpha, and maybe > someone tells me that really alpha should be a function of the > sparsity, but I've never seen that done. > > > > On Thu, Feb 25, 2016 at 6:33 AM, Hiroyuki Yamada > wrote: > > Hi, I've been doing some POC for CF in MLlib. > > In my environment, ratings are all implicit so that I try to use it with > > trainImplicit method (in python). > > > > The trainImplicit method takes alpha as one of the arguments to specify a > > confidence for the ratings as described in > > <http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html > >, > > but the alpha value is global for all the ratings so I am not sure why we > > need this. > > (If it is per rating, it makes sense to me, though.) > > > > What is the difference in setting different alpha values for exactly the > > same data set ? > > > > I would be very appreciated if someone give me a reasonable explanation > for > > this. > > > > Best regards, > > Hiro >
Re: What is the point of alpha value in Collaborative Filtering in MLlib ?
Hi, I've been doing some POC for CF in MLlib. In my environment, ratings are all implicit so that I try to use it with trainImplicit method (in python). The trainImplicit method takes alpha as one of the arguments to specify a confidence for the ratings as described in < http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html>, but the alpha value is global for all the ratings so I am not sure why we need this. (If it is per rating, it makes sense to me, though.) What is the difference in setting different alpha values for exactly the same data set ? I would be very appreciated if someone give me a reasonable explanation for this. Best regards, Hiro