Hi Vasil,

Thanks for the great testimonial. Why don't you add your tuning suggestions to the wiki? How did your vectorization differ from the example code? If your synthetic control results are as good as you say they are you ought to post those results too. The current implementation has some scalability limitations - as does canopy - but you are not the first to write about its positive results. I'd be interested in collaborating on implementing kernel selection and it would be nice to be able to compare and contrast the algorithm with the original citation on the other points as well. This might be another page on the wiki as this implementation is novel afaict.

I have a pretty big dataset (18 gb, compressed) and will try out mean shift on it when I get a chance.

On 1/20/11 1:27 AM, Vasil Vasilev wrote:
Hi Jeff,

I used the meanshift algorithm to cluster the synthetic control data (I
produced slightly different type of vectorization than what was given in the
examples) and it gave the best results compared to all other algorithms. So
I am really amazed by what it can do!

About (1): One interesting thing that I noted is the influence of T2. It
turned out that if you do not select T2 correctly the algorithm may not
converge. In my case I had the following situation. At the end (6-th
iteration) there were 2 canopies left which were within each-other's T1
boundaries, but outside each-other's T2 boundaries. This led to the fact
that they were touching each-other on every iteration and switching their
centroids. As the distance between the centroids was larger then the
convergence delta the algorithm never converged. However when I increased T2
to merge these 2 clusters the algorithm converged successfully.
In this respect for me the following procedure worked:
1. Select T2 as small as possible (greater then convergence delta)
2. Increase it until the algorithm converges
With this approach the algorithm determined also very nicely the correct
number of clusters

About (2): OK. I will try it and let you know about the result.


On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman<[email protected]>wrote:

On 1/18/11 9:10 AM, Vasil Vasilev wrote:

Hello Mahout developers,

Currently I am trying to get more in depth with the clustering algorithms
-
how they should be used and tuned.
For this purpose I decided to learn from the source code of the different
implementations.
In this respect I have the following questions about the Meanshift
algorithm
(sorry if it may sound naive, but I am a novice in the area):

1. I noted that the way it is implemented is different from the
straightforward approach that is described in the paper (

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
).
Later I learned from Jira MAHOUT-15 that this was made to enable
parallelism. There I also noticed that T2 should be fixed to 1.0.
In fact for me it seems that T2 should be correlated with the convergence
delta parameter (which by default is 0.5) and should be slightly higher
then
it. Is my assumption correct?

I think the optimal values for T1 and T2 depend upon the distance measure
chosen and the nature of the data itself. As this implementation is really
just an iterative application of Canopy, I left both T parameters
specifiable too. This is not exactly the same algorithm as Mean Shift in the
paper but it seems to do amazingly well in some cases.

  2. With the current implementation the user has the option to select
desired
distance measure, but does not have the flexibility to select a kernel.
The
current approach results in a hard-coded conical kernel with radius T1 and
no points outside T1 are considered in the path calculation of the canopy.
Is it possible to slightly modify the algorithm (similar to the
modification
from kmeans to fuzzy kmeans) where weights are associated with a given
point
that would touch the canopy and these weights are drown from the kernel
function. For example they could be drawn from a normal distribution? Do
you
think the possibility for kernel selection could impact positively the
clustering with meanshift in some cases?

I don't know but it is intriguing. Why don't you try it?

Regards, Vasil



Reply via email to