It is possible to do but not implemented anywhere afaik. Streaming and online/incremental model calcs are different. Plain Streaming recalcs the model on a moving time window but does so very often, online/incremental treats the model as a mutable thing and modifies it in place. As you can imagine they require very different methods. Ted’s reference points out that the internal LLR weighted cooccurrence calc is possible to do online because there is a # of cooccurrences cutoff that means many new interactions are not going to affect the model and LLR is a very simple calc not involving the entire row or column vectors only their non-zero element counts, which are easy to keep in memory (one vector each)
It’s relatively simple to set up Mahout’s item and row similarity to take streams and recalc at rapid intervals. I’ve done this with Kafka to Spark streaming input. This uses an entire time window’s worth of data and so is not incremental but since the calc is fast and O(n) can be scaled with size of Spark cluster. The cooccurrence and cross-cooccurrence calc can be done on the public epinions data on my laptop in 12 minutes. This is a smallish dataset. But may I ask why you want online/incremental? There are only a few edge cases that benefit from this and as Ted points out there may be very few interactions that will modify the model at all. The reasons to update a model are: 1) new items are added. Actually only when new items have some number of interactions. How often is your item collection changing? If you have a very popular newspaper and the items changed by the minute this might be a case where very rapid model updates would benefit you. 2) the characteristics of interactions change very rapidly. So this is where users are changing preferences very often. I have never personally run into this case but imagine there are examples in social media. The Multimodal recommender can handle new users that have some usage history but were not used in the model calc so new users are not a case where you need incremental model updates. On Jun 19, 2015, at 3:46 PM, Ted Dunning <[email protected]> wrote: The standard approach is to re-run the off-line learning. It is possible, though not yet supported in Mahout tools, to do real-time updates. See here for some details: https://www.mapr.com/resources/videos/fully-real-time-recommendation-%E2%80%93-ted-dunning-sf-data-mining On Fri, Jun 19, 2015 at 2:35 AM, James Donnelly <[email protected]> wrote: > Hi, > > First of all, a big thanks to Ted and Pat, and all the authors and > developers around Mahout. > > I'm putting together an eCommerce recommendation framework, and have a > couple of questions from using the latest tools in Mahout 1.0. > > I've seen it hinted by Pat that real-time updates (incremental learning) > are made possible with the latest Mahout tools here: > > > http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/ > > But once I have gone through the first phase of data processing, I'm not > clear on the basic direction for maintaining the generated data, e.g with > added products and incremental user behaviour data. > > The only way I can see is to update my input data, then re-run the entire > process of generating the similarity matrices using the itemSimilarity and > rowSImilarity jobs. Is there a better way? > > James >
