No, please make a new JIRA. Lance
On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[email protected]> wrote: > It is added to https://issues.apache.org/jira/browse/MAHOUT-15 > > I hope this was the correct place to add it! > > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[email protected]> wrote: > >> Sure, please. Please include hints about the various cases where it >> works (or does not). >> >> Thanks! >> >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[email protected]> >> wrote: >> > Hello, >> > >> > In fact my estimation technique works only for the Meanshift algorithm, >> > because in it the centers of the canopies are moving around. With pure >> > Canopy clustering I don't think it will work. >> > >> > I spent some time trying to realize the idea of different kernel profiles >> > with the meanshift clustering algorithm and here are the results: >> > 1. I changed a little bit the original algorithm. Previously when one >> > cluster "touched" other clusters its centroid was computed only based on >> the >> > centroids of the clusters it touched, but not based on his centroid >> itself. >> > Now befor calculating the shift I added one more line which makes the >> > cluster observe his personal centroid: >> > >> > public boolean shiftToMean(MeanShiftCanopy canopy) { >> > canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); >> > canopy.computeConvergence(measure, convergenceDelta); >> > canopy.computeParameters(); >> > return canopy.isConverged(); >> > } >> > >> > With this modification also the problem with convergence of the algorithm >> > (that I described above) disappeared, although the number of iterations >> > until convergence increased slightly. >> > This change was necessary in order to introduce the other types of "soft" >> > kernels. >> > >> > 2. I introduced IKernelProfile interface which has the methods: >> > public double calculateValue(double distance, double h); >> > public double calculateDerivativeValue(double distance, double h); >> > >> > 3. I created 2 implementations: >> > TriangularKernelProfile with calculated value: >> > @Override >> > public double calculateDerivativeValue(double distance, double h) { >> > return (distance < h) ? 1.0 : 0.0; >> > } >> > >> > and NormalKernelProfile with calculated value: >> > @Override >> > public double calculateDerivativeValue(double distance, double h) { >> > return UncommonDistributions.dNorm(distance, 0.0, h); >> > } >> > >> > 4. I modified the core for merging canopies: >> > >> > public void mergeCanopy(MeanShiftCanopy aCanopy, >> Collection<MeanShiftCanopy> >> > canopies) { >> > MeanShiftCanopy closestCoveringCanopy = null; >> > double closestNorm = Double.MAX_VALUE; >> > for (MeanShiftCanopy canopy : canopies) { >> > double norm = measure.distance(canopy.getCenter(), >> > aCanopy.getCenter()); >> > double weight = kernelProfile.calculateDerivativeValue(norm, t1); >> > if(weight > 0.0) >> > { >> > aCanopy.touch(canopy, weight); >> > } >> > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < >> > closestNorm))) { >> > closestNorm = norm; >> > closestCoveringCanopy = canopy; >> > } >> > } >> > if (closestCoveringCanopy == null) { >> > canopies.add(aCanopy); >> > } else { >> > closestCoveringCanopy.merge(aCanopy); >> > } >> > } >> > >> > 5. I modified MeanShiftCanopy >> > void touch(MeanShiftCanopy canopy, double weight) { >> > canopy.observe(getCenter(), weight*((double)boundPoints.size())); >> > observe(canopy.getCenter(), >> weight*((double)canopy.boundPoints.size())); >> > } >> > >> > 6. I modified some other files which were necessary to compile the code >> and >> > for the tests to pass >> > >> > With the so changed algorithm I had the following observations: >> > >> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster >> > content is more intermixed >> > 2. As there is no convergence problem any more I found the following >> > procedure for estimating T2 and convergence delta: >> > - bind convergence delta = T2 / 2 >> > - When you decrease T2 the number of iterations increases and the number >> of >> > resulting clusters after convergence decreases >> > - You come to a moment when you decrease T2 the number of iterations >> > increases, but the number of resulting clusters remains unchanged. This >> is >> > the point with the best value for T2 >> > 3. In case of Normal kernel what you give as T1 is in fact the standard >> > deviation of a normal distribution, so the radius of the window will be >> T1^2 >> > >> > If you are interested I can send the code. >> > >> > Regards, Vasil >> > >> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <[email protected]> >> wrote: >> > >> >> Vasil- >> >> >> >> Would you consider adding your estimation algorithm to this patch? >> >> https://issues.apache.org/jira/browse/MAHOUT-563 >> >> >> >> The estimator in there now is stupid- a real one would make the Canopy >> >> algorithms orders of magnitude more useful. >> >> >> >> Lance >> >> >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[email protected]> >> >> wrote: >> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <[email protected]> >> >> wrote: >> >> > >> >> >> >> >> >> dimension 1: Using linear regression with gradient descent algorithm >> I >> >> find >> >> >> what is the trend of the line, i.e. is it increasing, decreasing or >> >> >> straight >> >> >> line >> >> >> dimension 2: Knowing the approximating line (from the linear >> regression) >> >> I >> >> >> count how many times this line gets crossed by the original signal. >> This >> >> >> helps in separating the cyclic data from all the rest >> >> >> dimension 3: What is the biggest increase/decrease of a single signal >> >> line. >> >> >> This helps find shifts >> >> >> >> >> >> So to say - I put a semantics for the data that are to be clustered >> (I >> >> >> don't >> >> >> know if it is correct to do that, but I couldn't think of how an >> >> algorithm >> >> >> could cope with the task without such additional semantics) >> >> >> >> >> > >> >> > It is very common for feature extraction like this to be the key for >> >> > data-mining projects. Such features are absolutely critical for most >> >> time >> >> > series mining and are highly application dependent. >> >> > >> >> > One key aspect of your features is that they are shift invariant. >> >> > >> >> > >> >> >> Also I developed a small swing application which visualizes the >> >> clustered >> >> >> signals and which helped me in playing with the algorithms. >> >> >> >> >> > >> >> > Great idea. >> >> > >> >> >> >> >> >> >> >> -- >> >> Lance Norskog >> >> [email protected] >> >> >> > >> >> >> >> -- >> Lance Norskog >> [email protected] >> > -- Lance Norskog [email protected]
