My broad advice is to use a non-distributed recommender until it's clear you won't be able to fit in memory.
How many user-item associations do you have? My rule of thumb to date is that 100M or less fits OK and runs OK on a reasonable modern server machine (i.e. 4GB or so of heap). You can go bigger with a big machine. But if you're at 100M now, you may be at 200M and then 500M soon and don't want to get stuck. However, just because you have 500M data points doesn't mean you need to process 500M data points. Sampling, removing noise, removing old data, etc. can let you deal with more data in practice by actually processing less. So, let's again say I'd imagine you can think non-distributed first. I think your data model has preference values. You have ratings, you have other behavior that you will translate into some implicit rating. So you are not looking at "boolean prefs" as the framework calls it. If you want to do content-based recommendation, what you want to do is create an ItemSimilarity implementation that returns whatever notion of similarity you like. The framework doesn't help you with this question at all, but, provides a neat place to plug in your black-box similarity metric. And then sure you can write another ItemSimilarity implementation which blends or delegates to other ItemSimilarity metrics -- yours, and perhaps some standard rating-based metric like PearsonCorrelationSimilarity. That's another bit of code you'd have to write, to your taste, but is quite simple. That's a fine way of dealing with the initial sparseness -- the content-based similarity is always there even when there aren't enough ratings for collaborative filtering-based similarity metrics to do anything. Put that together with GenericItemBasedRecommender, and a DataModel implementation, and that's your recommender. Where are your ratings? database, file, other? That will inform how you pick the DataModel. It's probably FileDataModel, or else MySQLJDBCDataModel + ReloadFromJDBCDataModel to cache it in memory. Real-time recommendations shouldn't be a big deal. It may get to hundreds of milliseconds to compute depending on your data size. But it's not a minute for sure. The 'hard part' is incorporating new information quicklky. In theory, one new bit of information changes the result of many calculations. The way the non-distributed code generally deals with this is a compromise: the default approach is to periodically re-load and re-compute results, rather than clear caches and recompute loads of results on every new bit of info. For example you might be updating into a database table continuously and re-load its contents every 10 minutes or so, which causes updates. That's a fine approach in many situations, but not all. You may need very real-time updates. This is possible with certain algorithms (slope one), or with a little bit of hacking. It's a long subject so we can get into that as and when you get there. But -- while you sure could do batch updates every hour or night, I don't think it's necessary or buys much. Real-ish time is easy even with updates. I'd describe the above as a simple approach to CF. It certainly gets more complex and sophisticated if you like. But the above is quite easy as these things go and will probably solve most of the problem for most kinds of situations. I'd start here and make it more complicated only as you need to. On Wed, Dec 29, 2010 at 11:49 AM, Andy Parsons <[email protected]> wrote: >>> >>> - Do you have explicit ratings from the users or are you working with >>> implicit data? > [ASP] We will have both, in the form of ratings, views/purchases, and > "recommend to a friend" >>> >>> - What do you exactly mean by hybrid recommendations? Do you mean a >>> combination of content based and collaborative filtering techniques? > [ASP] Yes, precisely. >>> >>> - How fast do you need the recommendations? Would it be ok to have them >>> precomputed on a daily basis e.g. or do you need them in realtime? > [ASP] Either *could* work, with a preference for realtime. >>> >>> - How often do new users and new items enter your dataset? How sparse is >>> your rating data? > [ASP] New users are added in the hundreds on a daily basis. Rating data will > be very sparse in the initial months the application is live, so we are > looking at options for priming the system. Given the quantity of items, > however, we'll have fairly sparse rating/item coverage in general.
