Hello Sean,
I am trying to cluster not only based on sales data in a data model, but
also based on a content as well. Down below is the function that is doing
that:
@Override
public double userSimilarity(long itemID1, long itemID2) throwsTasteException {
// converting the item ids from long to String
String itemOneID = String.valueOf(itemID1);
String itemTwoID = String.valueOf(itemID2);
// looking up the ids in the hashmap
String itemOneValue = productIdAttributesMap.get(itemOneID);
String itemTwoValue = productIdAttributesMap.get(itemTwoID);
// load the tfidf object with many documents
for(String s: productIdAttributesMap.values())
tfIdf.handle(s);
// compute the distance and return it...
double proximity = 0;
if(itemOneValue!=null && itemTwoValue!=null){
proximity = tfIdf.proximity(itemOneValue, itemTwoValue);
}
// now computing similarity between items from sales data
DataModel dataModel = getDataModel();
FastIDSet prefs1 = dataModel.getItemIDsFromUser(itemID1);
FastIDSet prefs2 = dataModel.getItemIDsFromUser(itemID2);
long prefs1Size = prefs1.size();
long prefs2Size = prefs2.size();
long intersectionSize =
prefs1Size < prefs2Size ? prefs2.intersectionSize(prefs1) :
prefs1.intersectionSize(prefs2);
if (intersectionSize == 0) {
return Double.NaN;
}
long numItems = dataModel.getNumItems();
double logLikelihood =
LogLikelihood.logLikelihoodRatio(intersectionSize,
prefs2Size - intersectionSize,
prefs1Size - intersectionSize,
numItems - prefs1Size - prefs2Size + intersectionSize);
// merging the distance and the loglikelihood similarity
return ExperimentParams.LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 +
logLikelihood)) + (ExperimentParams.PROXIMITY_WEIGHT * proximity);
}
Please let me know if this is clearer now.
Thanks very much,
-Ahmed
On Thu, Mar 22, 2012 at 5:26 PM, Sean Owen <[email protected]> wrote:
> What do you mean that you have a user-item association from a
> log-likelihood metric?
>
> Combining two values is easy in the sense that you can average them or
> something, but only if they are in the same "units". Log likelihood
> may be viewed as a probability. The distance function you derive from
> it -- and your own TFIDF distance -- it's not clear if these are
> comparable.
>
> Rather than get into this, I wonder whether you need any of this at
> all, since I'm not sure what the user-item value is to begin with.
> That's your output, not an input.
>
> On Thu, Mar 22, 2012 at 9:18 PM, Ahmed Abdeen Hamed
> <[email protected]> wrote:
> > Hello,
> >
> > I developed a recommender that computes the distance between two items
> > based on contents. However, I also need to include the association
> between
> > the user-item. But, when I do that, I end up having a similarity score
> from
> > the item-item content based and also another similarity score based on
> the
> > item-user association (loglikelihood). I am now designing some
> experiments
> > to consider different weights for each approach before I add them
> together.
> > Here is the mathematical model what I have in mind:
> >
> > LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + logLikelihood)) +
> > (CONTENT_WEIGHT* content-proximity) such that
> >
> > [1] LOGLIKELIHOOD_WEIGHT (weight between 0, 1 e.g., 0.6)
> >
> > [2] CONTENT_WEIGHT (weight between 0, 1 e.g., 0.4)
> >
> > [3] logLikelihood is a variable that gets populated by a logLikelihood
> > similarity metric based on the user-item association
> >
> > [4] content-proximity is variable that gets populated by
> > a contents-based similarity algorithm (TFIDF).
> >
> > My question now is: Does this mathematical model make sense? Can we add
> the
> > two different scores even though they are from two different
> distributions
> > the way I did above or the outcome will be skewed?
> >
> > Please let me know if you have an answer for me.
> >
> > Thanks very much,
> >
> > -Ahmed
>