Re: Evaluating Mahout's recommender support

Sean Owen Thu, 30 Dec 2010 06:41:28 -0800

My broad advice is to use a non-distributed recommender until it's
clear you won't be able to fit in memory.

How many user-item associations do you have? My rule of thumb to date
is that 100M or less fits OK and runs OK on a reasonable modern server
machine (i.e. 4GB or so of heap). You can go bigger with a big
machine. But if you're at 100M now, you may be at 200M and then 500M
soon and don't want to get stuck. However, just because you have 500M
data points doesn't mean you need to process 500M data points.
Sampling, removing noise, removing old data, etc. can let you deal
with more data in practice by actually processing less.

So, let's again say I'd imagine you can think non-distributed first.

I think your data model has preference values. You have ratings, you
have other behavior that you will translate into some implicit rating.
So you are not looking at "boolean prefs" as the framework calls it.

If you want to do content-based recommendation, what you want to do is
create an ItemSimilarity implementation that returns whatever notion
of similarity you like. The framework doesn't help you with this
question at all, but, provides a neat place to plug in your black-box
similarity metric.

And then sure you can write another ItemSimilarity implementation
which blends or delegates to other ItemSimilarity metrics -- yours,
and perhaps some standard rating-based metric like
PearsonCorrelationSimilarity. That's another bit of code you'd have to
write, to your taste, but is quite simple. That's a fine way of
dealing with the initial sparseness -- the content-based similarity is
always there even when there aren't enough ratings for collaborative
filtering-based similarity metrics to do anything.

Put that together with GenericItemBasedRecommender, and a DataModel
implementation, and that's your recommender.
Where are your ratings? database, file, other? That will inform how
you pick the DataModel. It's probably FileDataModel, or else
MySQLJDBCDataModel + ReloadFromJDBCDataModel to cache it in memory.

Real-time recommendations shouldn't be a big deal. It may get to
hundreds of milliseconds to compute depending on your data size. But
it's not a minute for sure.

The 'hard part' is incorporating new information quicklky. In theory,
one new bit of information changes the result of many calculations.
The way the non-distributed code generally deals with this is a
compromise: the default approach is to periodically re-load and
re-compute results, rather than clear caches and recompute loads of
results on every new bit of info.

For example you might be updating into a database table continuously
and re-load its contents every 10 minutes or so, which causes updates.
That's a fine approach in many situations, but not all. You may need
very real-time updates. This is possible with certain algorithms
(slope one), or with a little bit of hacking. It's a long subject so
we can get into that as and when you get there.

But -- while you sure could do batch updates every hour or night, I
don't think it's necessary or buys much. Real-ish time is easy even
with updates.

I'd describe the above as a simple approach to CF. It certainly gets
more complex and sophisticated if you like. But the above is quite
easy as these things go and will probably solve most of the problem
for most kinds of situations. I'd start here and make it more
complicated only as you need to.

On Wed, Dec 29, 2010 at 11:49 AM, Andy Parsons <[email protected]> wrote:
>>>
>>> - Do you have explicit ratings from the users or are you working with
>>> implicit data?
> [ASP] We will have both, in the form of ratings, views/purchases, and 
> "recommend to a friend"
>>>
>>> - What do you exactly mean by hybrid recommendations? Do you mean a
>>> combination of content based and collaborative filtering techniques?
> [ASP] Yes, precisely.
>>>
>>> - How fast do you need the recommendations? Would it be ok to have them
>>> precomputed on a daily basis e.g. or do you need them in realtime?
> [ASP] Either *could* work, with a preference for realtime.
>>>
>>> - How often do new users and new items enter your dataset? How sparse is
>>> your rating data?
> [ASP] New users are added in the hundreds on a daily basis. Rating data will 
> be very sparse in the initial months the application is live, so we are 
> looking at options for priming the system. Given the quantity of items, 
> however, we'll have fairly sparse rating/item coverage in general.

Re: Evaluating Mahout's recommender support

Reply via email to