Recommenders need user preference data. The more the better, right? Well, yes 
and no…

Assuming you have a catalog that may have things added often but older items 
also remain in stock for some time. Training of user preference data over a 
fairly long time period will likely be a good thing. But this user history of 
everything, may not be the best query to use for returning recs.

Using an offline precision metric (MAP@n) and real ecommerce data we build 
Mahout recommender models on 3, 6, 9, and 12 months of data. We held out the 
most recent 10% for testing the recommender’s predictions. As one would expect 
the more data the better. But I think there is a hidden problem in this.

Using a user’s entire history may not lead to the best recs for today. The 
intuition is that the most recent n actions should be used for making recs, 
thereby capturing the user’s current intent.

Unfortunately Mahout’s recommenders use the same data to build the “indicator 
matrix” as they do to make the query for returning recs.

Current Mahout:
B = history of all preferences by all users
Mahout calculates recs by doing 
[B’B]B' = R, where [B’B] is actually the product of the RowSimilarityJob and so 
is an “indicator matrix” not just a cooccurrence matrix. I always use Log 
likelihood or LLR in the RSJ so [B’B] is to be seen as shorthand for this.

The problem with this approach is that B is the only input and therefore used 
for the query as well as the training.

Using the Solr+Mahout recommender--where the query is in realtime and the 
training occurs periodically in the background--solves this problem nicely. The 
indicator matrix is produced on as much data as possible but there is no 
requirement that all of that be used in the query. For the Solr+Mahout 
recommender I’d rather say:
[B’B]h = R, where h is a user's history going back as far as you think good and 
B is as much data as makes sense for your catalog. Picking h is probably done 
by taking the most recent n actions/prefs rather than a point in time cutoff 
because different people are more active than others.

I think this indicates an improvement that could be made to Mahout’s 
recommender. Either B and H can be supplied separately or we can leave the 
query to Solr.

Anyone have an opinion?

Reply via email to