Re: Need some pointers towards algorithm capabilities.

Sean Owen Thu, 16 Dec 2010 08:46:12 -0800

On Thu, Dec 16, 2010 at 4:28 PM, Niels Basjes <[email protected]> wrote:
>
> The dataset size really huge. I'm currently looking up against >5M
> items, >2M users and several millions of "item views" per day.
> All of those dimensions are growing.
> Can the non-distributed way handle that kind of volume?
>


"It depends", but, my rule of thumb is that you can probably fit 100M
user-item associations on one reasonable modern server machine. After that
you are needing 8, 16, 32 GB of RAM to keep scaling. It's possible but gets
expensive. Distributed is the answer after about 100M associations.

The other big piece of advice to give at this stage is: can you prune? Even
if you have 1B associations, it's possible 90% of that data can be usefully
ignored. How you choose that is another question unto itself.



>
> > For Hadoop there is an item-based recommender with pluggable similarity
> > metrics.
> > For non-distributed there's much more.
>
> Is there an overview? ... or is that in your book?
>

Overview of the non-distributed bit? The book does cover it yes but I think
you can easily see what's out there by looking up implementations of
Recommender from the IDE. Skimming the javadoc ought to do a lot to
summarize what's out there.


>
> > For example a user-based or item-based
> > recommender with log-likelihood similarity, or an SVD-based recommender,
> > doesn't suffer as much from these issues.
>
> Ok .... you just lost this newbie ...
>

Hehe maybe the book would be useful.


>
> > The distributed version is necessarily batch -- it's Hadoop after all.
> > The non-distributed version is all real-time, incremental updates.
>
> Can the incremental versions handle the volume I mentioned?
>

Non-distributed? Right now it sounds like you have < 100M associations so
yeah you can get away with an in-memory instance and those can be updated in
real time. Again the algorithms vary a lot in terms of how fast they can
update.


I've seen that in some cases of "large volume processing" it pays to
> do part of the processing per "day" of input data and aggregate over
> the whole period. As I have almost no understanding of the kind of
> algorithms used here this remark of mine could very well be
> meaningless here.
>
>
For recommenders no it doesn't quite work like that.

But for large scale, in general, you'll be doing big batch computations
every hour or day or whatever, and then use other means to incorporate new
information instantly, but imperfectly, into the recommendations. That is
you will find you have to do both a batch and real-time piece and integrate
them to really do this properly at scale -- unless it's OK that new data
takes a while to incorporate, rather than more or less instantly.

This is harder and is something I am working on separately myself.

Re: Need some pointers towards algorithm capabilities.

Reply via email to