Hello Sean,

I am trying to cluster not only based on sales data in a data model, but
also based on a content as well. Down below is the function that is doing
that:

 @Override

public double userSimilarity(long itemID1, long itemID2) throwsTasteException {

 // converting the item ids from long to String

 String itemOneID = String.valueOf(itemID1);

 String itemTwoID = String.valueOf(itemID2);


 // looking up the ids in the hashmap

 String itemOneValue = productIdAttributesMap.get(itemOneID);

 String itemTwoValue = productIdAttributesMap.get(itemTwoID);


 // load the tfidf object with many documents

 for(String s: productIdAttributesMap.values())

 tfIdf.handle(s);


 // compute the distance and return it...

 double proximity = 0;

 if(itemOneValue!=null && itemTwoValue!=null){

 proximity = tfIdf.proximity(itemOneValue, itemTwoValue);

 }


// now computing similarity between items from sales data

 DataModel dataModel = getDataModel();

 FastIDSet prefs1 = dataModel.getItemIDsFromUser(itemID1);

 FastIDSet prefs2 = dataModel.getItemIDsFromUser(itemID2);


 long prefs1Size = prefs1.size();

 long prefs2Size = prefs2.size();

 long intersectionSize =

 prefs1Size < prefs2Size ? prefs2.intersectionSize(prefs1) :
prefs1.intersectionSize(prefs2);

 if (intersectionSize == 0) {

  return Double.NaN;

 }

 long numItems = dataModel.getNumItems();

 double logLikelihood =

  LogLikelihood.logLikelihoodRatio(intersectionSize,

   prefs2Size - intersectionSize,

   prefs1Size - intersectionSize,

   numItems - prefs1Size - prefs2Size + intersectionSize);

// merging the distance and the loglikelihood similarity

 return ExperimentParams.LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 +
logLikelihood)) + (ExperimentParams.PROXIMITY_WEIGHT * proximity);


 }

Please let me know if this is clearer now.

Thanks very much,

-Ahmed


On Thu, Mar 22, 2012 at 5:26 PM, Sean Owen <[email protected]> wrote:

> What do you mean that you have a user-item association from a
> log-likelihood metric?
>
> Combining two values is easy in the sense that you can average them or
> something, but only if they are in the same "units". Log likelihood
> may be viewed as a probability. The distance function you derive from
> it -- and your own TFIDF distance -- it's not clear if these are
> comparable.
>
> Rather than get into this, I wonder whether you need any of this at
> all, since I'm not sure what the user-item value is to begin with.
> That's your output, not an input.
>
> On Thu, Mar 22, 2012 at 9:18 PM, Ahmed Abdeen Hamed
> <[email protected]> wrote:
> > Hello,
> >
> > I developed a recommender that computes the distance between two items
> > based on contents. However, I also need to include the association
> between
> > the user-item. But, when I do that, I end up having a similarity score
> from
> > the item-item content based and also another similarity score based on
> the
> > item-user association (loglikelihood). I am now designing some
> experiments
> > to consider different weights for each approach before I add them
> together.
> > Here is the mathematical model what I have in mind:
> >
> > LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + logLikelihood)) +
> > (CONTENT_WEIGHT* content-proximity) such that
> >
> > [1] LOGLIKELIHOOD_WEIGHT (weight between 0, 1 e.g., 0.6)
> >
> > [2] CONTENT_WEIGHT (weight between 0, 1 e.g., 0.4)
> >
> > [3] logLikelihood is a variable that gets populated by a logLikelihood
> > similarity metric based on the user-item association
> >
> > [4] content-proximity is variable that gets populated by
> > a contents-based similarity algorithm (TFIDF).
> >
> > My question now is: Does this mathematical model make sense? Can we add
> the
> > two different scores even though they are from two different
> distributions
> > the way I did above or the outcome will be skewed?
> >
> > Please let me know if you have an answer for me.
> >
> > Thanks very much,
> >
> > -Ahmed
>

Reply via email to