The solution you mention doesn’t sound right. You would usually not need to 
create a new ItemSimilarity class unless you have a new way to measure 
similarity.

lets see if I have this right:

1) you want to recommend news
2) recs are based on a user’s tweets
3) you have little metadata about either input or recommended items

You mention that you have previous tweets? Do you know which tweets led to 
which news being viewed? Ar you collecting links in tweets? You can augment 
tweet text with text from the pages linked to.

There are many difficulties in using tweets to recommend news, I’d do some 
research before you start. A quick search got this article 
http://nlp.cs.rpi.edu/paper/tweetnews.pdf which references others.

Also Ken Krugler wrote a series of articles on techniques used to improve text 
to text similarity—make sure to read both. 
http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/

Can’t predict where this will end up but an easy thing to do as a trial is 
index news in Solr, use scrubbed tweets as queries. You could set this up in an 
hour or so probably and try it with your own tweets to see how well is does. I 
suspect this won’t be your ultimate solution but it’s easy to do while you get 
your mind around the research

On Feb 16, 2014, at 5:54 AM, Juanjo Ramos <[email protected]> wrote:

Hi Pat,
Thanks so much for your detailed response.

At the moment we do not have any metadata 
about the articles but just their title & body. 
In addition, in the dataset we have tweets from the user which 
will never be in the output of the recommender 
(we never want to recommend a user to see a particular tweet) 
but we will use them to tune the users' 
preferences for different pieces of news based 
on the similarity between the tweets they have 
produced and the news that we have.

Would the approach you suggest with Solr 
still be valid in this particular scenario? We would need the 
user preferences to be updated as soon as they produce 
a new tweet, therefore my urge in recompute 
item-similarities as soon as a new tweet is produced. 
We do not need to recompute the matrix of 
similarities whenever a piece of news is produced 
as you well mentioned.

I do not if the approach I am about to suggest 
even makes sense but my idea was to precompute the 
similarities between items (news + tweets) 
and stored them along with the vectorized representation 
of every item. 
Then, implement my own ItemSimilarity class 
which would return the similarity for 
every pair of items (from the matrix if available) 
or calculated on the fly if not found. My main 
problem here is that I do not know how to calculate 
in Mahout the cosine distance between the 
vectorized representation of 2 particular items. 
Does this approach make sense in the first place?

Many thanks.


Reply via email to