Hello Mahout developers, Currently I am trying to get more in depth with the clustering algorithms - how they should be used and tuned. For this purpose I decided to learn from the source code of the different implementations. In this respect I have the following questions about the Meanshift algorithm (sorry if it may sound naive, but I am a novice in the area):
1. I noted that the way it is implemented is different from the straightforward approach that is described in the paper ( http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf). Later I learned from Jira MAHOUT-15 that this was made to enable parallelism. There I also noticed that T2 should be fixed to 1.0. In fact for me it seems that T2 should be correlated with the convergence delta parameter (which by default is 0.5) and should be slightly higher then it. Is my assumption correct? 2. With the current implementation the user has the option to select desired distance measure, but does not have the flexibility to select a kernel. The current approach results in a hard-coded conical kernel with radius T1 and no points outside T1 are considered in the path calculation of the canopy. Is it possible to slightly modify the algorithm (similar to the modification from kmeans to fuzzy kmeans) where weights are associated with a given point that would touch the canopy and these weights are drown from the kernel function. For example they could be drawn from a normal distribution? Do you think the possibility for kernel selection could impact positively the clustering with meanshift in some cases? Regards, Vasil
