Hi Mahout developers, As suggested above, I wrote about my experience clustering the synthetic control data with the different algorithms. Can you tell me your opinion and suggestions? Also where is the best place to post it on the wiki?
https://docs.google.com/leaf?id=0B8qKJ44gVr7QMDEyMTlkY2YtNDQ1OC00MmQ3LWFmZTUtOGIxNGZiODFhOGYx&hl=en Regards, Vasil On Thu, Jan 27, 2011 at 1:06 PM, Vasil Vasilev <[email protected]> wrote: > I created: https://issues.apache.org/jira/browse/MAHOUT-597 > > Regards, Vasil > > > On Thu, Jan 27, 2011 at 12:40 AM, Lance Norskog <[email protected]> wrote: > >> No, please make a new JIRA. >> >> Lance >> >> On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[email protected]> >> wrote: >> > It is added to https://issues.apache.org/jira/browse/MAHOUT-15 >> > >> > I hope this was the correct place to add it! >> > >> > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[email protected]> >> wrote: >> > >> >> Sure, please. Please include hints about the various cases where it >> >> works (or does not). >> >> >> >> Thanks! >> >> >> >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[email protected]> >> >> wrote: >> >> > Hello, >> >> > >> >> > In fact my estimation technique works only for the Meanshift >> algorithm, >> >> > because in it the centers of the canopies are moving around. With >> pure >> >> > Canopy clustering I don't think it will work. >> >> > >> >> > I spent some time trying to realize the idea of different kernel >> profiles >> >> > with the meanshift clustering algorithm and here are the results: >> >> > 1. I changed a little bit the original algorithm. Previously when one >> >> > cluster "touched" other clusters its centroid was computed only based >> on >> >> the >> >> > centroids of the clusters it touched, but not based on his centroid >> >> itself. >> >> > Now befor calculating the shift I added one more line which makes the >> >> > cluster observe his personal centroid: >> >> > >> >> > public boolean shiftToMean(MeanShiftCanopy canopy) { >> >> > canopy.observe(canopy.getCenter(), >> canopy.getBoundPoints().size()); >> >> > canopy.computeConvergence(measure, convergenceDelta); >> >> > canopy.computeParameters(); >> >> > return canopy.isConverged(); >> >> > } >> >> > >> >> > With this modification also the problem with convergence of the >> algorithm >> >> > (that I described above) disappeared, although the number of >> iterations >> >> > until convergence increased slightly. >> >> > This change was necessary in order to introduce the other types of >> "soft" >> >> > kernels. >> >> > >> >> > 2. I introduced IKernelProfile interface which has the methods: >> >> > public double calculateValue(double distance, double h); >> >> > public double calculateDerivativeValue(double distance, double h); >> >> > >> >> > 3. I created 2 implementations: >> >> > TriangularKernelProfile with calculated value: >> >> > @Override >> >> > public double calculateDerivativeValue(double distance, double h) >> { >> >> > return (distance < h) ? 1.0 : 0.0; >> >> > } >> >> > >> >> > and NormalKernelProfile with calculated value: >> >> > @Override >> >> > public double calculateDerivativeValue(double distance, double h) >> { >> >> > return UncommonDistributions.dNorm(distance, 0.0, h); >> >> > } >> >> > >> >> > 4. I modified the core for merging canopies: >> >> > >> >> > public void mergeCanopy(MeanShiftCanopy aCanopy, >> >> Collection<MeanShiftCanopy> >> >> > canopies) { >> >> > MeanShiftCanopy closestCoveringCanopy = null; >> >> > double closestNorm = Double.MAX_VALUE; >> >> > for (MeanShiftCanopy canopy : canopies) { >> >> > double norm = measure.distance(canopy.getCenter(), >> >> > aCanopy.getCenter()); >> >> > double weight = kernelProfile.calculateDerivativeValue(norm, >> t1); >> >> > if(weight > 0.0) >> >> > { >> >> > aCanopy.touch(canopy, weight); >> >> > } >> >> > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < >> >> > closestNorm))) { >> >> > closestNorm = norm; >> >> > closestCoveringCanopy = canopy; >> >> > } >> >> > } >> >> > if (closestCoveringCanopy == null) { >> >> > canopies.add(aCanopy); >> >> > } else { >> >> > closestCoveringCanopy.merge(aCanopy); >> >> > } >> >> > } >> >> > >> >> > 5. I modified MeanShiftCanopy >> >> > void touch(MeanShiftCanopy canopy, double weight) { >> >> > canopy.observe(getCenter(), weight*((double)boundPoints.size())); >> >> > observe(canopy.getCenter(), >> >> weight*((double)canopy.boundPoints.size())); >> >> > } >> >> > >> >> > 6. I modified some other files which were necessary to compile the >> code >> >> and >> >> > for the tests to pass >> >> > >> >> > With the so changed algorithm I had the following observations: >> >> > >> >> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the >> cluster >> >> > content is more intermixed >> >> > 2. As there is no convergence problem any more I found the following >> >> > procedure for estimating T2 and convergence delta: >> >> > - bind convergence delta = T2 / 2 >> >> > - When you decrease T2 the number of iterations increases and the >> number >> >> of >> >> > resulting clusters after convergence decreases >> >> > - You come to a moment when you decrease T2 the number of iterations >> >> > increases, but the number of resulting clusters remains unchanged. >> This >> >> is >> >> > the point with the best value for T2 >> >> > 3. In case of Normal kernel what you give as T1 is in fact the >> standard >> >> > deviation of a normal distribution, so the radius of the window will >> be >> >> T1^2 >> >> > >> >> > If you are interested I can send the code. >> >> > >> >> > Regards, Vasil >> >> > >> >> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <[email protected]> >> >> wrote: >> >> > >> >> >> Vasil- >> >> >> >> >> >> Would you consider adding your estimation algorithm to this patch? >> >> >> https://issues.apache.org/jira/browse/MAHOUT-563 >> >> >> >> >> >> The estimator in there now is stupid- a real one would make the >> Canopy >> >> >> algorithms orders of magnitude more useful. >> >> >> >> >> >> Lance >> >> >> >> >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[email protected] >> > >> >> >> wrote: >> >> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev < >> [email protected]> >> >> >> wrote: >> >> >> > >> >> >> >> >> >> >> >> dimension 1: Using linear regression with gradient descent >> algorithm >> >> I >> >> >> find >> >> >> >> what is the trend of the line, i.e. is it increasing, decreasing >> or >> >> >> >> straight >> >> >> >> line >> >> >> >> dimension 2: Knowing the approximating line (from the linear >> >> regression) >> >> >> I >> >> >> >> count how many times this line gets crossed by the original >> signal. >> >> This >> >> >> >> helps in separating the cyclic data from all the rest >> >> >> >> dimension 3: What is the biggest increase/decrease of a single >> signal >> >> >> line. >> >> >> >> This helps find shifts >> >> >> >> >> >> >> >> So to say - I put a semantics for the data that are to be >> clustered >> >> (I >> >> >> >> don't >> >> >> >> know if it is correct to do that, but I couldn't think of how an >> >> >> algorithm >> >> >> >> could cope with the task without such additional semantics) >> >> >> >> >> >> >> > >> >> >> > It is very common for feature extraction like this to be the key >> for >> >> >> > data-mining projects. Such features are absolutely critical for >> most >> >> >> time >> >> >> > series mining and are highly application dependent. >> >> >> > >> >> >> > One key aspect of your features is that they are shift invariant. >> >> >> > >> >> >> > >> >> >> >> Also I developed a small swing application which visualizes the >> >> >> clustered >> >> >> >> signals and which helped me in playing with the algorithms. >> >> >> >> >> >> >> > >> >> >> > Great idea. >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Lance Norskog >> >> >> [email protected] >> >> >> >> >> > >> >> >> >> >> >> >> >> -- >> >> Lance Norskog >> >> [email protected] >> >> >> > >> >> >> >> -- >> Lance Norskog >> [email protected] >> > >
