Yes, ALS requires the aggregated version (A). You can use decimal or whole numbers for the rating, depending on your application, as for implicit data they are not "ratings" but rather "weights".
A common approach is to apply different weightings to different user events (such as 1.0 for a page view, 5.0 for a purchase, 2.0 for a like, etc). That allows all user event data to be aggregated together in a fairly principled manner. The weights however need to be specified upfront in order to do that aggregation (they could be selected via cross-validation, domain knowledge or the relative frequency of each event within a dataset, for example). On Thu, 25 Feb 2016 at 13:26 Sabarish Sasidharan <sabarish....@gmail.com> wrote: > I believe the ALS algo expects the ratings to be aggregated (A). I don't > see why you have to use decimals for rating. > > Regards > Sab > > On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada <mogwa...@gmail.com> > wrote: > >> Hello. >> >> I just started working on CF in MLlib. >> I am using trainImplicit because I only have implicit ratings like page >> views. >> >> I am wondering which is a more appropriate form of ratings. >> Let's assume that view count is regarded as a rating and >> user 1 sees page 1 3 times and sees page 2 twice and so on. >> >> In this case, I think ratings can be formatted like the following 2 >> cases. (of course it is a RDD actually) >> >> A: >> user_id,page_id,rating(page view) >> 1,1,0.3 >> 1,2,0.2 >> ... >> >> B: >> user_id,page_id,rating(page view) >> 1,1,0.1 >> 1,1,0.1 >> 1,1,0.1 >> 1,2,0.1 >> 1,2,0.1 >> ... >> >> It is allowed to have like B ? >> If it is, which is better ? ( is there any difference between them ?) >> >> Best, >> Hiro >> >> >> >> >