On Thu, Dec 16, 2010 at 4:28 PM, Niels Basjes <[email protected]> wrote: > > The dataset size really huge. I'm currently looking up against >5M > items, >2M users and several millions of "item views" per day. > All of those dimensions are growing. > Can the non-distributed way handle that kind of volume? >
"It depends", but, my rule of thumb is that you can probably fit 100M user-item associations on one reasonable modern server machine. After that you are needing 8, 16, 32 GB of RAM to keep scaling. It's possible but gets expensive. Distributed is the answer after about 100M associations. The other big piece of advice to give at this stage is: can you prune? Even if you have 1B associations, it's possible 90% of that data can be usefully ignored. How you choose that is another question unto itself. > > > For Hadoop there is an item-based recommender with pluggable similarity > > metrics. > > For non-distributed there's much more. > > Is there an overview? ... or is that in your book? > Overview of the non-distributed bit? The book does cover it yes but I think you can easily see what's out there by looking up implementations of Recommender from the IDE. Skimming the javadoc ought to do a lot to summarize what's out there. > > > For example a user-based or item-based > > recommender with log-likelihood similarity, or an SVD-based recommender, > > doesn't suffer as much from these issues. > > Ok .... you just lost this newbie ... > Hehe maybe the book would be useful. > > > The distributed version is necessarily batch -- it's Hadoop after all. > > The non-distributed version is all real-time, incremental updates. > > Can the incremental versions handle the volume I mentioned? > Non-distributed? Right now it sounds like you have < 100M associations so yeah you can get away with an in-memory instance and those can be updated in real time. Again the algorithms vary a lot in terms of how fast they can update. I've seen that in some cases of "large volume processing" it pays to > do part of the processing per "day" of input data and aggregate over > the whole period. As I have almost no understanding of the kind of > algorithms used here this remark of mine could very well be > meaningless here. > > For recommenders no it doesn't quite work like that. But for large scale, in general, you'll be doing big batch computations every hour or day or whatever, and then use other means to incorporate new information instantly, but imperfectly, into the recommendations. That is you will find you have to do both a batch and real-time piece and integrate them to really do this properly at scale -- unless it's OK that new data takes a while to incorporate, rather than more or less instantly. This is harder and is something I am working on separately myself.
