I created: https://issues.apache.org/jira/browse/MAHOUT-597
Regards, Vasil On Thu, Jan 27, 2011 at 12:40 AM, Lance Norskog <[email protected]> wrote: > No, please make a new JIRA. > > Lance > > On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[email protected]> > wrote: > > It is added to https://issues.apache.org/jira/browse/MAHOUT-15 > > > > I hope this was the correct place to add it! > > > > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[email protected]> > wrote: > > > >> Sure, please. Please include hints about the various cases where it > >> works (or does not). > >> > >> Thanks! > >> > >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[email protected]> > >> wrote: > >> > Hello, > >> > > >> > In fact my estimation technique works only for the Meanshift > algorithm, > >> > because in it the centers of the canopies are moving around. With pure > >> > Canopy clustering I don't think it will work. > >> > > >> > I spent some time trying to realize the idea of different kernel > profiles > >> > with the meanshift clustering algorithm and here are the results: > >> > 1. I changed a little bit the original algorithm. Previously when one > >> > cluster "touched" other clusters its centroid was computed only based > on > >> the > >> > centroids of the clusters it touched, but not based on his centroid > >> itself. > >> > Now befor calculating the shift I added one more line which makes the > >> > cluster observe his personal centroid: > >> > > >> > public boolean shiftToMean(MeanShiftCanopy canopy) { > >> > canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); > >> > canopy.computeConvergence(measure, convergenceDelta); > >> > canopy.computeParameters(); > >> > return canopy.isConverged(); > >> > } > >> > > >> > With this modification also the problem with convergence of the > algorithm > >> > (that I described above) disappeared, although the number of > iterations > >> > until convergence increased slightly. > >> > This change was necessary in order to introduce the other types of > "soft" > >> > kernels. > >> > > >> > 2. I introduced IKernelProfile interface which has the methods: > >> > public double calculateValue(double distance, double h); > >> > public double calculateDerivativeValue(double distance, double h); > >> > > >> > 3. I created 2 implementations: > >> > TriangularKernelProfile with calculated value: > >> > @Override > >> > public double calculateDerivativeValue(double distance, double h) { > >> > return (distance < h) ? 1.0 : 0.0; > >> > } > >> > > >> > and NormalKernelProfile with calculated value: > >> > @Override > >> > public double calculateDerivativeValue(double distance, double h) { > >> > return UncommonDistributions.dNorm(distance, 0.0, h); > >> > } > >> > > >> > 4. I modified the core for merging canopies: > >> > > >> > public void mergeCanopy(MeanShiftCanopy aCanopy, > >> Collection<MeanShiftCanopy> > >> > canopies) { > >> > MeanShiftCanopy closestCoveringCanopy = null; > >> > double closestNorm = Double.MAX_VALUE; > >> > for (MeanShiftCanopy canopy : canopies) { > >> > double norm = measure.distance(canopy.getCenter(), > >> > aCanopy.getCenter()); > >> > double weight = kernelProfile.calculateDerivativeValue(norm, t1); > >> > if(weight > 0.0) > >> > { > >> > aCanopy.touch(canopy, weight); > >> > } > >> > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < > >> > closestNorm))) { > >> > closestNorm = norm; > >> > closestCoveringCanopy = canopy; > >> > } > >> > } > >> > if (closestCoveringCanopy == null) { > >> > canopies.add(aCanopy); > >> > } else { > >> > closestCoveringCanopy.merge(aCanopy); > >> > } > >> > } > >> > > >> > 5. I modified MeanShiftCanopy > >> > void touch(MeanShiftCanopy canopy, double weight) { > >> > canopy.observe(getCenter(), weight*((double)boundPoints.size())); > >> > observe(canopy.getCenter(), > >> weight*((double)canopy.boundPoints.size())); > >> > } > >> > > >> > 6. I modified some other files which were necessary to compile the > code > >> and > >> > for the tests to pass > >> > > >> > With the so changed algorithm I had the following observations: > >> > > >> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster > >> > content is more intermixed > >> > 2. As there is no convergence problem any more I found the following > >> > procedure for estimating T2 and convergence delta: > >> > - bind convergence delta = T2 / 2 > >> > - When you decrease T2 the number of iterations increases and the > number > >> of > >> > resulting clusters after convergence decreases > >> > - You come to a moment when you decrease T2 the number of iterations > >> > increases, but the number of resulting clusters remains unchanged. > This > >> is > >> > the point with the best value for T2 > >> > 3. In case of Normal kernel what you give as T1 is in fact the > standard > >> > deviation of a normal distribution, so the radius of the window will > be > >> T1^2 > >> > > >> > If you are interested I can send the code. > >> > > >> > Regards, Vasil > >> > > >> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <[email protected]> > >> wrote: > >> > > >> >> Vasil- > >> >> > >> >> Would you consider adding your estimation algorithm to this patch? > >> >> https://issues.apache.org/jira/browse/MAHOUT-563 > >> >> > >> >> The estimator in there now is stupid- a real one would make the > Canopy > >> >> algorithms orders of magnitude more useful. > >> >> > >> >> Lance > >> >> > >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[email protected]> > >> >> wrote: > >> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev < > [email protected]> > >> >> wrote: > >> >> > > >> >> >> > >> >> >> dimension 1: Using linear regression with gradient descent > algorithm > >> I > >> >> find > >> >> >> what is the trend of the line, i.e. is it increasing, decreasing > or > >> >> >> straight > >> >> >> line > >> >> >> dimension 2: Knowing the approximating line (from the linear > >> regression) > >> >> I > >> >> >> count how many times this line gets crossed by the original > signal. > >> This > >> >> >> helps in separating the cyclic data from all the rest > >> >> >> dimension 3: What is the biggest increase/decrease of a single > signal > >> >> line. > >> >> >> This helps find shifts > >> >> >> > >> >> >> So to say - I put a semantics for the data that are to be > clustered > >> (I > >> >> >> don't > >> >> >> know if it is correct to do that, but I couldn't think of how an > >> >> algorithm > >> >> >> could cope with the task without such additional semantics) > >> >> >> > >> >> > > >> >> > It is very common for feature extraction like this to be the key > for > >> >> > data-mining projects. Such features are absolutely critical for > most > >> >> time > >> >> > series mining and are highly application dependent. > >> >> > > >> >> > One key aspect of your features is that they are shift invariant. > >> >> > > >> >> > > >> >> >> Also I developed a small swing application which visualizes the > >> >> clustered > >> >> >> signals and which helped me in playing with the algorithms. > >> >> >> > >> >> > > >> >> > Great idea. > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Lance Norskog > >> >> [email protected] > >> >> > >> > > >> > >> > >> > >> -- > >> Lance Norskog > >> [email protected] > >> > > > > > > -- > Lance Norskog > [email protected] >
