In the simple case I’m not sure a collaborative filtering recommender is going to work here. The items change too quickly to gather significant preference data. Articles are your items, what is their lifetime? To do CF you need relatively long-lived items and enough user preference data about those items.
There are other way to tackle this. Let’s take Google alerts as an example. They start with search text. I created one with the text “machine learning” and got some silly alerts: http://occamsmachete.com/ml/2012/03/16/fun-with-google-alerts/ But what they do is track every time you follow a link from their recs email. Then they train a classifier with all of the text you read. The start is pretty awful but they get better very quickly. I’m sure they do some things to make this more scalable but that’s a longer story. There is a CF angle with enough technology (read on). Can you do the same thing? If you can tell what articles people read you can use this collection as a content exemplar and recommend new news items based on similarity to this collection. To use the GA template: 1) use Solr to recommend articles from a user’s tweets (they may be awful at first) 2) track what they read and keep it as an example of the type of thing they like 3) when new articles come in, find the people who like that sort of thing and make them aware of it. You do this by comparing the new article with each of the user’s collection of past reads. You can do this with Solr for ease and simplicity but batch classification will probably give better results. Some have used Named Entities in news and Tweets to make CF based recs. If you knew one named entity in an article was ‘Putin' you could treat it as an item and gather CF data from people who read about him. With enough history like that you could build a CF type recommender. It wouldn’t surprise me if Google isn’t doing something with this in a lot of their search products, like alerts. On Feb 16, 2014, at 11:51 AM, Juanjo Ramos <[email protected]> wrote: As per your question, we have not built anything yet so, we are dealing with that problem: How to let the tweets drive the recommendation of the news to be viewed. The original idea was to find item-item similarity between the user tweets and the news in order to deal with the cold-start problem and infer some initial preference of the users and the news based on that item-item similarity. This is where my original idea of using RowSimilarityJob to compute the matrix of similarities came into place. Later, as the user accesses different news those preferences will we tuned as in a regular item-based recommender. Since the system has not been built yet, our first goal is to design the architecture of the system first and how it should respond after new tweets are produced, even if the performance is not the best in this first version. Then, we will focus on the particular problem of using tweets to recommend news, for which the links you posted will be extremely helpful. I am new to Mahout. I have just finished reading 'Mahout in Action' and that is why I tried to use only Mahout for the implementation, but the approach you suggest with Solr seems more reasonable to deal with the problem of having the system responding and adapting fast when new tweets are produced. Thanks again.
