Dear Mahout users,

I am using a GenericBooleanPrefItemBasedRecommender where item similarity
is predetermined and based on k-means clustering results. I'm finding it
hard to rate the similarity of items even with the cluster data.

Given two items in the same cluster, I attempt to rate their similarity
based on the following calculation:

*double* distanceSquared = vector1.getDistanceSquared(vector2);
*double* radiusSquared = radius.getLengthSquared();
*double* diameterSquared = radiusSquared * 4;
*double* distanceRate = (diameterSquared - distanceSquared) /
diameterSquared;
*double* similarity = distanceRate * 2 - 1;
*double* result = Math.max(-1, Math.min(similarity, 1));

Note that the clustered points are tf-idf vectors where each vector will
have thousands of dimensions.

I find that with this simple calculation I am making the false assumption
that *radius.getLengthSquared() * 4* is a valid means of determining the
maximum squared distance for any two points in a cluster. This might have
worked if the clusters were n-spheres having the same radius on every axis.
But that is not the case. Often *distanceSquared* will be much larger than
diameterSquared, so my assumption fails and so does my calculation.

Given that my approach does not work. How should I determine a 'similarity
score' based on in-cluster distance? Working with ~40k dimensions I fear
that any optimal solution is still going to be very slow, so I am perfectly
okay with a heuristic. But I don't know where to get started.

I am not a mathematician or even skilled in algorithms. Any advice on how
to approach this is very much appreciated.


Kind regards,
Jozua Sijsling

Reply via email to