Dear Mahout users,
I am using a GenericBooleanPrefItemBasedRecommender where item similarity is predetermined and based on k-means clustering results. I'm finding it hard to rate the similarity of items even with the cluster data. Given two items in the same cluster, I attempt to rate their similarity based on the following calculation: *double* distanceSquared = vector1.getDistanceSquared(vector2); *double* radiusSquared = radius.getLengthSquared(); *double* diameterSquared = radiusSquared * 4; *double* distanceRate = (diameterSquared - distanceSquared) / diameterSquared; *double* similarity = distanceRate * 2 - 1; *double* result = Math.max(-1, Math.min(similarity, 1)); Note that the clustered points are tf-idf vectors where each vector will have thousands of dimensions. I find that with this simple calculation I am making the false assumption that *radius.getLengthSquared() * 4* is a valid means of determining the maximum squared distance for any two points in a cluster. This might have worked if the clusters were n-spheres having the same radius on every axis. But that is not the case. Often *distanceSquared* will be much larger than diameterSquared, so my assumption fails and so does my calculation. Given that my approach does not work. How should I determine a 'similarity score' based on in-cluster distance? Working with ~40k dimensions I fear that any optimal solution is still going to be very slow, so I am perfectly okay with a heuristic. But I don't know where to get started. I am not a mathematician or even skilled in algorithms. Any advice on how to approach this is very much appreciated. Kind regards, Jozua Sijsling
