Can you tell us what you are trying to do? You have TF-IDF weighted vectors of terms so you may be trying to calculate textual content based recommendations? In other words you are trying to show similar documents when a user looks at a document? With 1000s of dimensions this means 1000s or unique terms so these must be pretty big docs.
You can do this by encoding your vectors into a Mahout DRM (distributed row matrix) and running RowSimilarityJob on it using SIMILARITY_COSINE. You will get a list of similar docs for each doc. No need for GenericBooleanPrefItemBasedRecommender or clustering. On Aug 26, 2014, at 3:51 AM, Jozua Sijsling <[email protected]> wrote: Dear Mahout users, I am using a GenericBooleanPrefItemBasedRecommender where item similarity is predetermined and based on k-means clustering results. I'm finding it hard to rate the similarity of items even with the cluster data. Given two items in the same cluster, I attempt to rate their similarity based on the following calculation: *double* distanceSquared = vector1.getDistanceSquared(vector2); *double* radiusSquared = radius.getLengthSquared(); *double* diameterSquared = radiusSquared * 4; *double* distanceRate = (diameterSquared - distanceSquared) / diameterSquared; *double* similarity = distanceRate * 2 - 1; *double* result = Math.max(-1, Math.min(similarity, 1)); Note that the clustered points are tf-idf vectors where each vector will have thousands of dimensions. I find that with this simple calculation I am making the false assumption that *radius.getLengthSquared() * 4* is a valid means of determining the maximum squared distance for any two points in a cluster. This might have worked if the clusters were n-spheres having the same radius on every axis. But that is not the case. Often *distanceSquared* will be much larger than diameterSquared, so my assumption fails and so does my calculation. Given that my approach does not work. How should I determine a 'similarity score' based on in-cluster distance? Working with ~40k dimensions I fear that any optimal solution is still going to be very slow, so I am perfectly okay with a heuristic. But I don't know where to get started. I am not a mathematician or even skilled in algorithms. Any advice on how to approach this is very much appreciated. Kind regards, Jozua Sijsling
