You're implementing userSimilarity(), but appear to be computing item-item similarity. Halfway through, you use the item IDs as user IDs. I can't see what this is intending to do as a result?
On Thu, Mar 22, 2012 at 9:33 PM, Ahmed Abdeen Hamed <[email protected]> wrote: > Hello Sean, > > I am trying to cluster not only based on sales data in a data model, but > also based on a content as well. Down below is the function that is doing > that: > > @Override > > public double userSimilarity(long itemID1, long itemID2) throws > TasteException { > > // converting the item ids from long to String > > String itemOneID = String.valueOf(itemID1); > > String itemTwoID = String.valueOf(itemID2); > > > // looking up the ids in the hashmap > > String itemOneValue = productIdAttributesMap.get(itemOneID); > > String itemTwoValue = productIdAttributesMap.get(itemTwoID); > > > // load the tfidf object with many documents > > for(String s: productIdAttributesMap.values()) > > tfIdf.handle(s); > > > // compute the distance and return it... > > double proximity = 0; > > if(itemOneValue!=null && itemTwoValue!=null){ > > proximity = tfIdf.proximity(itemOneValue, itemTwoValue); > > } > > > // now computing similarity between items from sales data > > DataModel dataModel = getDataModel(); > > FastIDSet prefs1 = dataModel.getItemIDsFromUser(itemID1); > > FastIDSet prefs2 = dataModel.getItemIDsFromUser(itemID2); > > > long prefs1Size = prefs1.size(); > > long prefs2Size = prefs2.size(); > > long intersectionSize = > > prefs1Size < prefs2Size ? prefs2.intersectionSize(prefs1) : > prefs1.intersectionSize(prefs2); > > if (intersectionSize == 0) { > > return Double.NaN; > > } > > long numItems = dataModel.getNumItems(); > > double logLikelihood = > > LogLikelihood.logLikelihoodRatio(intersectionSize, > > prefs2Size - intersectionSize, > > prefs1Size - intersectionSize, > > numItems - prefs1Size - prefs2Size + intersectionSize); > > // merging the distance and the loglikelihood similarity > > return ExperimentParams.LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + > logLikelihood)) + (ExperimentParams.PROXIMITY_WEIGHT * proximity); > > > } > > > Please let me know if this is clearer now. > > Thanks very much, > > -Ahmed > > > On Thu, Mar 22, 2012 at 5:26 PM, Sean Owen <[email protected]> wrote: >> >> What do you mean that you have a user-item association from a >> log-likelihood metric? >> >> Combining two values is easy in the sense that you can average them or >> something, but only if they are in the same "units". Log likelihood >> may be viewed as a probability. The distance function you derive from >> it -- and your own TFIDF distance -- it's not clear if these are >> comparable. >> >> Rather than get into this, I wonder whether you need any of this at >> all, since I'm not sure what the user-item value is to begin with. >> That's your output, not an input. >> >> On Thu, Mar 22, 2012 at 9:18 PM, Ahmed Abdeen Hamed >> <[email protected]> wrote: >> > Hello, >> > >> > I developed a recommender that computes the distance between two items >> > based on contents. However, I also need to include the association >> > between >> > the user-item. But, when I do that, I end up having a similarity score >> > from >> > the item-item content based and also another similarity score based on >> > the >> > item-user association (loglikelihood). I am now designing some >> > experiments >> > to consider different weights for each approach before I add them >> > together. >> > Here is the mathematical model what I have in mind: >> > >> > LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + logLikelihood)) + >> > (CONTENT_WEIGHT* content-proximity) such that >> > >> > [1] LOGLIKELIHOOD_WEIGHT (weight between 0, 1 e.g., 0.6) >> > >> > [2] CONTENT_WEIGHT (weight between 0, 1 e.g., 0.4) >> > >> > [3] logLikelihood is a variable that gets populated by a logLikelihood >> > similarity metric based on the user-item association >> > >> > [4] content-proximity is variable that gets populated by >> > a contents-based similarity algorithm (TFIDF). >> > >> > My question now is: Does this mathematical model make sense? Can we add >> > the >> > two different scores even though they are from two different >> > distributions >> > the way I did above or the outcome will be skewed? >> > >> > Please let me know if you have an answer for me. >> > >> > Thanks very much, >> > >> > -Ahmed > >
