No, please make a new JIRA.

Lance

On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[email protected]> wrote:
> It is added to https://issues.apache.org/jira/browse/MAHOUT-15
>
> I hope this was the correct place to add it!
>
> On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[email protected]> wrote:
>
>> Sure, please. Please include hints about the various cases where it
>> works (or does not).
>>
>> Thanks!
>>
>> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[email protected]>
>> wrote:
>> > Hello,
>> >
>> > In fact my estimation technique works only for the Meanshift algorithm,
>> > because in it the centers of the canopies are moving around. With pure
>> > Canopy clustering I don't think it will work.
>> >
>> > I spent some time trying to realize the idea of different kernel profiles
>> > with the meanshift clustering algorithm and here are the results:
>> > 1. I changed a little bit the original algorithm. Previously when one
>> > cluster "touched" other clusters its centroid was computed only based on
>> the
>> > centroids of the clusters it touched, but not based on his centroid
>> itself.
>> > Now befor calculating the shift I added one more line which makes the
>> > cluster observe his personal centroid:
>> >
>> > public boolean shiftToMean(MeanShiftCanopy canopy) {
>> >    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
>> >    canopy.computeConvergence(measure, convergenceDelta);
>> >    canopy.computeParameters();
>> >    return canopy.isConverged();
>> >  }
>> >
>> > With this modification also the problem with convergence of the algorithm
>> > (that I described above) disappeared, although the number of iterations
>> > until convergence increased slightly.
>> > This change was necessary in order to introduce the other types of "soft"
>> > kernels.
>> >
>> > 2. I introduced IKernelProfile interface which has the methods:
>> > public double calculateValue(double distance, double h);
>> >    public double calculateDerivativeValue(double distance, double h);
>> >
>> > 3. I created 2 implementations:
>> > TriangularKernelProfile with calculated value:
>> > @Override
>> >    public double calculateDerivativeValue(double distance, double h) {
>> >        return (distance < h) ? 1.0 : 0.0;
>> >    }
>> >
>> > and NormalKernelProfile with calculated value:
>> > @Override
>> >    public double calculateDerivativeValue(double distance, double h) {
>> >        return UncommonDistributions.dNorm(distance, 0.0, h);
>> >    }
>> >
>> > 4. I modified the core for merging canopies:
>> >
>> > public void mergeCanopy(MeanShiftCanopy aCanopy,
>> Collection<MeanShiftCanopy>
>> > canopies) {
>> >    MeanShiftCanopy closestCoveringCanopy = null;
>> >    double closestNorm = Double.MAX_VALUE;
>> >    for (MeanShiftCanopy canopy : canopies) {
>> >      double norm = measure.distance(canopy.getCenter(),
>> > aCanopy.getCenter());
>> >      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
>> >      if(weight > 0.0)
>> >      {
>> >          aCanopy.touch(canopy, weight);
>> >      }
>> >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
>> > closestNorm))) {
>> >        closestNorm = norm;
>> >        closestCoveringCanopy = canopy;
>> >      }
>> >    }
>> >    if (closestCoveringCanopy == null) {
>> >      canopies.add(aCanopy);
>> >    } else {
>> >      closestCoveringCanopy.merge(aCanopy);
>> >    }
>> >  }
>> >
>> > 5. I modified MeanShiftCanopy
>> > void touch(MeanShiftCanopy canopy, double weight) {
>> >    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
>> >    observe(canopy.getCenter(),
>> weight*((double)canopy.boundPoints.size()));
>> >  }
>> >
>> > 6. I modified some other files which were necessary to compile the code
>> and
>> > for the tests to pass
>> >
>> > With the so changed algorithm I had the following observations:
>> >
>> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
>> > content is more intermixed
>> > 2. As there is no convergence problem any more I found the following
>> > procedure for estimating T2 and convergence delta:
>> >  - bind convergence delta = T2 / 2
>> >  - When you decrease T2 the number of iterations increases and the number
>> of
>> > resulting clusters after convergence decreases
>> >  - You come to a moment when you decrease T2 the number of iterations
>> > increases, but the number of resulting clusters remains unchanged. This
>> is
>> > the point with the best value for T2
>> > 3. In case of Normal kernel what you give as T1 is in fact the standard
>> > deviation of a normal distribution, so the radius of the window will be
>> T1^2
>> >
>> > If you are interested I can send the code.
>> >
>> > Regards, Vasil
>> >
>> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <[email protected]>
>> wrote:
>> >
>> >> Vasil-
>> >>
>> >> Would you consider adding your estimation algorithm to this patch?
>> >> https://issues.apache.org/jira/browse/MAHOUT-563
>> >>
>> >> The estimator in there now is stupid- a real one would make the Canopy
>> >> algorithms orders of magnitude more useful.
>> >>
>> >> Lance
>> >>
>> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[email protected]>
>> >> wrote:
>> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <[email protected]>
>> >> wrote:
>> >> >
>> >> >>
>> >> >> dimension 1: Using linear regression with gradient descent algorithm
>> I
>> >> find
>> >> >> what is the trend of the line, i.e. is it increasing, decreasing or
>> >> >> straight
>> >> >> line
>> >> >> dimension 2: Knowing the approximating line (from the linear
>> regression)
>> >> I
>> >> >> count how many times this line gets crossed by the original signal.
>> This
>> >> >> helps in separating the cyclic data from all the rest
>> >> >> dimension 3: What is the biggest increase/decrease of a single signal
>> >> line.
>> >> >> This helps find shifts
>> >> >>
>> >> >> So to say - I put a semantics for the data that are to be clustered
>> (I
>> >> >> don't
>> >> >> know if it is correct to do that, but I couldn't think of how an
>> >> algorithm
>> >> >> could cope with the task without such additional semantics)
>> >> >>
>> >> >
>> >> > It is very common for feature extraction like this to be the key for
>> >> > data-mining projects.   Such features are absolutely critical for most
>> >> time
>> >> > series mining and are highly application dependent.
>> >> >
>> >> > One key aspect of your features is that they are shift invariant.
>> >> >
>> >> >
>> >> >> Also I developed a small swing application which visualizes the
>> >> clustered
>> >> >> signals and which helped me in playing with the algorithms.
>> >> >>
>> >> >
>> >> > Great idea.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> [email protected]
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>



-- 
Lance Norskog
[email protected]

Reply via email to