I created: https://issues.apache.org/jira/browse/MAHOUT-597

Regards, Vasil

On Thu, Jan 27, 2011 at 12:40 AM, Lance Norskog <[email protected]> wrote:

> No, please make a new JIRA.
>
> Lance
>
> On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[email protected]>
> wrote:
> > It is added to https://issues.apache.org/jira/browse/MAHOUT-15
> >
> > I hope this was the correct place to add it!
> >
> > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[email protected]>
> wrote:
> >
> >> Sure, please. Please include hints about the various cases where it
> >> works (or does not).
> >>
> >> Thanks!
> >>
> >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[email protected]>
> >> wrote:
> >> > Hello,
> >> >
> >> > In fact my estimation technique works only for the Meanshift
> algorithm,
> >> > because in it the centers of the canopies are moving around. With pure
> >> > Canopy clustering I don't think it will work.
> >> >
> >> > I spent some time trying to realize the idea of different kernel
> profiles
> >> > with the meanshift clustering algorithm and here are the results:
> >> > 1. I changed a little bit the original algorithm. Previously when one
> >> > cluster "touched" other clusters its centroid was computed only based
> on
> >> the
> >> > centroids of the clusters it touched, but not based on his centroid
> >> itself.
> >> > Now befor calculating the shift I added one more line which makes the
> >> > cluster observe his personal centroid:
> >> >
> >> > public boolean shiftToMean(MeanShiftCanopy canopy) {
> >> >    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
> >> >    canopy.computeConvergence(measure, convergenceDelta);
> >> >    canopy.computeParameters();
> >> >    return canopy.isConverged();
> >> >  }
> >> >
> >> > With this modification also the problem with convergence of the
> algorithm
> >> > (that I described above) disappeared, although the number of
> iterations
> >> > until convergence increased slightly.
> >> > This change was necessary in order to introduce the other types of
> "soft"
> >> > kernels.
> >> >
> >> > 2. I introduced IKernelProfile interface which has the methods:
> >> > public double calculateValue(double distance, double h);
> >> >    public double calculateDerivativeValue(double distance, double h);
> >> >
> >> > 3. I created 2 implementations:
> >> > TriangularKernelProfile with calculated value:
> >> > @Override
> >> >    public double calculateDerivativeValue(double distance, double h) {
> >> >        return (distance < h) ? 1.0 : 0.0;
> >> >    }
> >> >
> >> > and NormalKernelProfile with calculated value:
> >> > @Override
> >> >    public double calculateDerivativeValue(double distance, double h) {
> >> >        return UncommonDistributions.dNorm(distance, 0.0, h);
> >> >    }
> >> >
> >> > 4. I modified the core for merging canopies:
> >> >
> >> > public void mergeCanopy(MeanShiftCanopy aCanopy,
> >> Collection<MeanShiftCanopy>
> >> > canopies) {
> >> >    MeanShiftCanopy closestCoveringCanopy = null;
> >> >    double closestNorm = Double.MAX_VALUE;
> >> >    for (MeanShiftCanopy canopy : canopies) {
> >> >      double norm = measure.distance(canopy.getCenter(),
> >> > aCanopy.getCenter());
> >> >      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
> >> >      if(weight > 0.0)
> >> >      {
> >> >          aCanopy.touch(canopy, weight);
> >> >      }
> >> >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
> >> > closestNorm))) {
> >> >        closestNorm = norm;
> >> >        closestCoveringCanopy = canopy;
> >> >      }
> >> >    }
> >> >    if (closestCoveringCanopy == null) {
> >> >      canopies.add(aCanopy);
> >> >    } else {
> >> >      closestCoveringCanopy.merge(aCanopy);
> >> >    }
> >> >  }
> >> >
> >> > 5. I modified MeanShiftCanopy
> >> > void touch(MeanShiftCanopy canopy, double weight) {
> >> >    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
> >> >    observe(canopy.getCenter(),
> >> weight*((double)canopy.boundPoints.size()));
> >> >  }
> >> >
> >> > 6. I modified some other files which were necessary to compile the
> code
> >> and
> >> > for the tests to pass
> >> >
> >> > With the so changed algorithm I had the following observations:
> >> >
> >> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> >> > content is more intermixed
> >> > 2. As there is no convergence problem any more I found the following
> >> > procedure for estimating T2 and convergence delta:
> >> >  - bind convergence delta = T2 / 2
> >> >  - When you decrease T2 the number of iterations increases and the
> number
> >> of
> >> > resulting clusters after convergence decreases
> >> >  - You come to a moment when you decrease T2 the number of iterations
> >> > increases, but the number of resulting clusters remains unchanged.
> This
> >> is
> >> > the point with the best value for T2
> >> > 3. In case of Normal kernel what you give as T1 is in fact the
> standard
> >> > deviation of a normal distribution, so the radius of the window will
> be
> >> T1^2
> >> >
> >> > If you are interested I can send the code.
> >> >
> >> > Regards, Vasil
> >> >
> >> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <[email protected]>
> >> wrote:
> >> >
> >> >> Vasil-
> >> >>
> >> >> Would you consider adding your estimation algorithm to this patch?
> >> >> https://issues.apache.org/jira/browse/MAHOUT-563
> >> >>
> >> >> The estimator in there now is stupid- a real one would make the
> Canopy
> >> >> algorithms orders of magnitude more useful.
> >> >>
> >> >> Lance
> >> >>
> >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[email protected]>
> >> >> wrote:
> >> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <
> [email protected]>
> >> >> wrote:
> >> >> >
> >> >> >>
> >> >> >> dimension 1: Using linear regression with gradient descent
> algorithm
> >> I
> >> >> find
> >> >> >> what is the trend of the line, i.e. is it increasing, decreasing
> or
> >> >> >> straight
> >> >> >> line
> >> >> >> dimension 2: Knowing the approximating line (from the linear
> >> regression)
> >> >> I
> >> >> >> count how many times this line gets crossed by the original
> signal.
> >> This
> >> >> >> helps in separating the cyclic data from all the rest
> >> >> >> dimension 3: What is the biggest increase/decrease of a single
> signal
> >> >> line.
> >> >> >> This helps find shifts
> >> >> >>
> >> >> >> So to say - I put a semantics for the data that are to be
> clustered
> >> (I
> >> >> >> don't
> >> >> >> know if it is correct to do that, but I couldn't think of how an
> >> >> algorithm
> >> >> >> could cope with the task without such additional semantics)
> >> >> >>
> >> >> >
> >> >> > It is very common for feature extraction like this to be the key
> for
> >> >> > data-mining projects.   Such features are absolutely critical for
> most
> >> >> time
> >> >> > series mining and are highly application dependent.
> >> >> >
> >> >> > One key aspect of your features is that they are shift invariant.
> >> >> >
> >> >> >
> >> >> >> Also I developed a small swing application which visualizes the
> >> >> clustered
> >> >> >> signals and which helped me in playing with the algorithms.
> >> >> >>
> >> >> >
> >> >> > Great idea.
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Lance Norskog
> >> >> [email protected]
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> [email protected]
> >>
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to