Re: Realtime update of similarity matrices

Pat Ferrel Sun, 21 Jun 2015 17:24:50 -0700

It is possible to do but not implemented anywhere afaik. Streaming and 
online/incremental model calcs are different. Plain Streaming recalcs the model 
on a moving time window but does so very often, online/incremental treats the 
model as a mutable thing and modifies it in place. As you can imagine they 
require very different methods. Ted’s reference points out that the internal 
LLR weighted cooccurrence calc is possible to do online because there is a # of 
cooccurrences cutoff that means many new interactions are not going to affect 
the model and LLR is a very simple calc not involving the entire row or column 
vectors only their non-zero element counts, which are easy to keep in memory 
(one vector each)

It’s relatively simple to set up Mahout’s item and row similarity to take 
streams and recalc at rapid intervals. I’ve done this with Kafka to Spark 
streaming input. This uses an entire time window’s worth of data and so is not 
incremental but since the calc is fast and O(n) can be scaled with size of 
Spark cluster. The cooccurrence and cross-cooccurrence calc can be done on the 
public epinions data on my laptop in 12 minutes. This is a smallish dataset.

But may I ask why you want online/incremental? There are only a few edge cases 
that benefit from this and as Ted points out there may be very few interactions 
that will modify the model at all.

The reasons to update a model are:
1) new items are added. Actually only when new items have some number of 
interactions. How often is your item collection changing? If you have a very 
popular newspaper and the items changed by the minute this might be a case 
where very rapid model updates would benefit you.
2) the characteristics of interactions change very rapidly. So this is where 
users are changing preferences very often. I have never personally run into 
this case but imagine there are examples in social media.

The Multimodal recommender can handle new users that have some usage history 
but were not used in the model calc so new users are not a case where you need 
incremental model updates.

On Jun 19, 2015, at 3:46 PM, Ted Dunning <[email protected]> wrote:

The standard approach is to re-run the off-line learning.

It is possible, though not yet supported in Mahout tools, to do real-time
updates.

See here for some details:
https://www.mapr.com/resources/videos/fully-real-time-recommendation-%E2%80%93-ted-dunning-sf-data-mining

On Fri, Jun 19, 2015 at 2:35 AM, James Donnelly <[email protected]>
wrote:

> Hi,
> 
> First of all, a big thanks to Ted and Pat, and all the authors and
> developers around Mahout.
> 
> I'm putting together an eCommerce recommendation framework, and have a
> couple of questions from using the latest tools in Mahout 1.0.
> 
> I've seen it hinted by Pat that real-time updates (incremental learning)
> are made possible with the latest Mahout tools here:
> 
> 
> http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
> 
> But once I have gone through the first phase of data processing, I'm not
> clear on the basic direction for maintaining the generated data, e.g with
> added products and incremental user behaviour data.
> 
> The only way I can see is to update my input data,  then re-run the entire
> process of generating the similarity matrices using the itemSimilarity and
> rowSImilarity jobs.  Is there a better way?
> 
> James
>

Re: Realtime update of similarity matrices

Reply via email to