It is added to https://issues.apache.org/jira/browse/MAHOUT-15
I hope this was the correct place to add it! On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[email protected]> wrote: > Sure, please. Please include hints about the various cases where it > works (or does not). > > Thanks! > > On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[email protected]> > wrote: > > Hello, > > > > In fact my estimation technique works only for the Meanshift algorithm, > > because in it the centers of the canopies are moving around. With pure > > Canopy clustering I don't think it will work. > > > > I spent some time trying to realize the idea of different kernel profiles > > with the meanshift clustering algorithm and here are the results: > > 1. I changed a little bit the original algorithm. Previously when one > > cluster "touched" other clusters its centroid was computed only based on > the > > centroids of the clusters it touched, but not based on his centroid > itself. > > Now befor calculating the shift I added one more line which makes the > > cluster observe his personal centroid: > > > > public boolean shiftToMean(MeanShiftCanopy canopy) { > > canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); > > canopy.computeConvergence(measure, convergenceDelta); > > canopy.computeParameters(); > > return canopy.isConverged(); > > } > > > > With this modification also the problem with convergence of the algorithm > > (that I described above) disappeared, although the number of iterations > > until convergence increased slightly. > > This change was necessary in order to introduce the other types of "soft" > > kernels. > > > > 2. I introduced IKernelProfile interface which has the methods: > > public double calculateValue(double distance, double h); > > public double calculateDerivativeValue(double distance, double h); > > > > 3. I created 2 implementations: > > TriangularKernelProfile with calculated value: > > @Override > > public double calculateDerivativeValue(double distance, double h) { > > return (distance < h) ? 1.0 : 0.0; > > } > > > > and NormalKernelProfile with calculated value: > > @Override > > public double calculateDerivativeValue(double distance, double h) { > > return UncommonDistributions.dNorm(distance, 0.0, h); > > } > > > > 4. I modified the core for merging canopies: > > > > public void mergeCanopy(MeanShiftCanopy aCanopy, > Collection<MeanShiftCanopy> > > canopies) { > > MeanShiftCanopy closestCoveringCanopy = null; > > double closestNorm = Double.MAX_VALUE; > > for (MeanShiftCanopy canopy : canopies) { > > double norm = measure.distance(canopy.getCenter(), > > aCanopy.getCenter()); > > double weight = kernelProfile.calculateDerivativeValue(norm, t1); > > if(weight > 0.0) > > { > > aCanopy.touch(canopy, weight); > > } > > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < > > closestNorm))) { > > closestNorm = norm; > > closestCoveringCanopy = canopy; > > } > > } > > if (closestCoveringCanopy == null) { > > canopies.add(aCanopy); > > } else { > > closestCoveringCanopy.merge(aCanopy); > > } > > } > > > > 5. I modified MeanShiftCanopy > > void touch(MeanShiftCanopy canopy, double weight) { > > canopy.observe(getCenter(), weight*((double)boundPoints.size())); > > observe(canopy.getCenter(), > weight*((double)canopy.boundPoints.size())); > > } > > > > 6. I modified some other files which were necessary to compile the code > and > > for the tests to pass > > > > With the so changed algorithm I had the following observations: > > > > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster > > content is more intermixed > > 2. As there is no convergence problem any more I found the following > > procedure for estimating T2 and convergence delta: > > - bind convergence delta = T2 / 2 > > - When you decrease T2 the number of iterations increases and the number > of > > resulting clusters after convergence decreases > > - You come to a moment when you decrease T2 the number of iterations > > increases, but the number of resulting clusters remains unchanged. This > is > > the point with the best value for T2 > > 3. In case of Normal kernel what you give as T1 is in fact the standard > > deviation of a normal distribution, so the radius of the window will be > T1^2 > > > > If you are interested I can send the code. > > > > Regards, Vasil > > > > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <[email protected]> > wrote: > > > >> Vasil- > >> > >> Would you consider adding your estimation algorithm to this patch? > >> https://issues.apache.org/jira/browse/MAHOUT-563 > >> > >> The estimator in there now is stupid- a real one would make the Canopy > >> algorithms orders of magnitude more useful. > >> > >> Lance > >> > >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[email protected]> > >> wrote: > >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <[email protected]> > >> wrote: > >> > > >> >> > >> >> dimension 1: Using linear regression with gradient descent algorithm > I > >> find > >> >> what is the trend of the line, i.e. is it increasing, decreasing or > >> >> straight > >> >> line > >> >> dimension 2: Knowing the approximating line (from the linear > regression) > >> I > >> >> count how many times this line gets crossed by the original signal. > This > >> >> helps in separating the cyclic data from all the rest > >> >> dimension 3: What is the biggest increase/decrease of a single signal > >> line. > >> >> This helps find shifts > >> >> > >> >> So to say - I put a semantics for the data that are to be clustered > (I > >> >> don't > >> >> know if it is correct to do that, but I couldn't think of how an > >> algorithm > >> >> could cope with the task without such additional semantics) > >> >> > >> > > >> > It is very common for feature extraction like this to be the key for > >> > data-mining projects. Such features are absolutely critical for most > >> time > >> > series mining and are highly application dependent. > >> > > >> > One key aspect of your features is that they are shift invariant. > >> > > >> > > >> >> Also I developed a small swing application which visualizes the > >> clustered > >> >> signals and which helped me in playing with the algorithms. > >> >> > >> > > >> > Great idea. > >> > > >> > >> > >> > >> -- > >> Lance Norskog > >> [email protected] > >> > > > > > > -- > Lance Norskog > [email protected] >
