Hi Jeff, I used the meanshift algorithm to cluster the synthetic control data (I produced slightly different type of vectorization than what was given in the examples) and it gave the best results compared to all other algorithms. So I am really amazed by what it can do!
About (1): One interesting thing that I noted is the influence of T2. It turned out that if you do not select T2 correctly the algorithm may not converge. In my case I had the following situation. At the end (6-th iteration) there were 2 canopies left which were within each-other's T1 boundaries, but outside each-other's T2 boundaries. This led to the fact that they were touching each-other on every iteration and switching their centroids. As the distance between the centroids was larger then the convergence delta the algorithm never converged. However when I increased T2 to merge these 2 clusters the algorithm converged successfully. In this respect for me the following procedure worked: 1. Select T2 as small as possible (greater then convergence delta) 2. Increase it until the algorithm converges With this approach the algorithm determined also very nicely the correct number of clusters About (2): OK. I will try it and let you know about the result. On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman <[email protected]>wrote: > On 1/18/11 9:10 AM, Vasil Vasilev wrote: > >> Hello Mahout developers, >> >> Currently I am trying to get more in depth with the clustering algorithms >> - >> how they should be used and tuned. >> For this purpose I decided to learn from the source code of the different >> implementations. >> In this respect I have the following questions about the Meanshift >> algorithm >> (sorry if it may sound naive, but I am a novice in the area): >> >> 1. I noted that the way it is implemented is different from the >> straightforward approach that is described in the paper ( >> >> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf >> ). >> Later I learned from Jira MAHOUT-15 that this was made to enable >> parallelism. There I also noticed that T2 should be fixed to 1.0. >> In fact for me it seems that T2 should be correlated with the convergence >> delta parameter (which by default is 0.5) and should be slightly higher >> then >> it. Is my assumption correct? >> > I think the optimal values for T1 and T2 depend upon the distance measure > chosen and the nature of the data itself. As this implementation is really > just an iterative application of Canopy, I left both T parameters > specifiable too. This is not exactly the same algorithm as Mean Shift in the > paper but it seems to do amazingly well in some cases. > > 2. With the current implementation the user has the option to select >> desired >> distance measure, but does not have the flexibility to select a kernel. >> The >> current approach results in a hard-coded conical kernel with radius T1 and >> no points outside T1 are considered in the path calculation of the canopy. >> Is it possible to slightly modify the algorithm (similar to the >> modification >> from kmeans to fuzzy kmeans) where weights are associated with a given >> point >> that would touch the canopy and these weights are drown from the kernel >> function. For example they could be drawn from a normal distribution? Do >> you >> think the possibility for kernel selection could impact positively the >> clustering with meanshift in some cases? >> > I don't know but it is intriguing. Why don't you try it? > >> Regards, Vasil >> >> >
