The solution you mention doesn’t sound right. You would usually not need to create a new ItemSimilarity class unless you have a new way to measure similarity.
lets see if I have this right: 1) you want to recommend news 2) recs are based on a user’s tweets 3) you have little metadata about either input or recommended items You mention that you have previous tweets? Do you know which tweets led to which news being viewed? Ar you collecting links in tweets? You can augment tweet text with text from the pages linked to. There are many difficulties in using tweets to recommend news, I’d do some research before you start. A quick search got this article http://nlp.cs.rpi.edu/paper/tweetnews.pdf which references others. Also Ken Krugler wrote a series of articles on techniques used to improve text to text similarity—make sure to read both. http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/ Can’t predict where this will end up but an easy thing to do as a trial is index news in Solr, use scrubbed tweets as queries. You could set this up in an hour or so probably and try it with your own tweets to see how well is does. I suspect this won’t be your ultimate solution but it’s easy to do while you get your mind around the research On Feb 16, 2014, at 5:54 AM, Juanjo Ramos <[email protected]> wrote: Hi Pat, Thanks so much for your detailed response. At the moment we do not have any metadata about the articles but just their title & body. In addition, in the dataset we have tweets from the user which will never be in the output of the recommender (we never want to recommend a user to see a particular tweet) but we will use them to tune the users' preferences for different pieces of news based on the similarity between the tweets they have produced and the news that we have. Would the approach you suggest with Solr still be valid in this particular scenario? We would need the user preferences to be updated as soon as they produce a new tweet, therefore my urge in recompute item-similarities as soon as a new tweet is produced. We do not need to recompute the matrix of similarities whenever a piece of news is produced as you well mentioned. I do not if the approach I am about to suggest even makes sense but my idea was to precompute the similarities between items (news + tweets) and stored them along with the vectorized representation of every item. Then, implement my own ItemSimilarity class which would return the similarity for every pair of items (from the matrix if available) or calculated on the fly if not found. My main problem here is that I do not know how to calculate in Mahout the cosine distance between the vectorized representation of 2 particular items. Does this approach make sense in the first place? Many thanks.
