Yes. The batch training data should be updated as needed but for some length of time the RowSimilarity Model will be valid and useful even with brand new queries that are made from articles not in the model. Remember however that the only items you will get in results are ones in the training data so that will give you an indication of how often to update it.
For a content based recommender you should look at Solr. The rest of the thread is missing but I think I also suggested that you could use it as the similarity engine, especially if you need immediacy of model updates. In this case you simply maintain an up to date Solr index on all articles, and their metadata. The index can be maintained in realtime or very close to it. Once the data is in a form that Solr can index you have a very flexible content based recommender. For instance you can create a query from articles read, along with their metadata, like category, location, etc. Or you may know something from the User’s profile, past usage, or browsing context that allows you to boost results by using this metadata. The collaborative filtering recommender that uses Solr + Mahout can seamlessly include metadata (content based data) to calculate recs. For instance, in the demo site we have Videos with genre data. When a user is looking at a Video which has genre tags these can be included in the query. A simple CF query would be a list of the Videos the user preferred against the RSJ created model. With Solr we can add multiple fields to the query so by adding the current Video’s genre tags against other Videos' genres you get genre boosted CF recs. You should be able to use the same technique with a purely content based recommender. On Feb 15, 2014, at 1:37 PM, Juanjo Ramos <[email protected]> wrote: Hi Pat, Thanks for your comment, I found it quite helpful. I'm also trying to build a content-based recommender. One question though: How can I use RunSimilarityJob for online data? I mean, I have a dataset and the approach you describe works pretty well to precompute the similarity matrix. However, when I get new content in my dataset (it is a dataset of news), I can I compute the similarity of only that new item against the rest without computing the whole matrix again? Many thanks.
