Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

2020-03-25 Thread Gilles Sadowski
Hello.

> > [...]
> 
>  I have started 2 PRs to solve the problem you metioned.
> 
>  About the "CentroidInitializer" I have a new idea:
>  Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
>  and add a construct parameter and a property "useKMeansPlusPlus" to
>  "KMeansPlusPlusCluster":
>  ```java
>  // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
>  public class KMeansPlusPlusClusterer extends
>  Clusterer {
>  public KMeansPlusPlusClusterer(final int k, final int maxIterations,
> final DistanceMeasure measure,
> final UniformRandomProvider random,
> final EmptyClusterStrategy emptyStrategy,
>  + final useKMeansPlusPlus) {
>  // ...
>  -  // Use K-means++ to choose the initial centers.
>  -  this.centroidInitializer = new
>  KMeansPlusPlusCentroidInitializer(measure, random);
>  +  this.useKMeansPlusPlus = useKMeansPlusPlus;
>  }
> >>>
> >>>What if one comes up with a third way to initialize the centroids?
> >>>If you can ensure that there is no other initialization procedure,
> >>>a boolean is fine, if not, we could still make the existing procedures
> >>>package-private (e.g. moving them in as classes defined within
> >>>"KMeansPlusPlusClusterer".
> >>
> >> As I know the k-means has two center initialize methods, random and
> >> k-means++ so far,
> >> use a boolean to choose which method to use is good enough for current use,
> >
> >Indeed but the question is: Will it be future-proof?
> >[We don't want to break compatibility (and have to release a
> >new major version of the library) for having overlooked the
> >question which I'm asking.]
> >
> >> but there are two situations use need to implement the center initialize
> >> method themselves:
> >> 1. The Commoans Maths's implements is not good enough;
> >
> >As this is FLOSS, the understanding (IMO) is in that case users
> >would contribute back to fix what needs be.
> >
> >> 2. There are new center initialize methods.
> >
> >So, that would be arguing against using a boolean (?).
> >Cf. above (about breaking compatibility).
> The name "KMeansPlusPlusClusterer" decide it won't support new initialize 
> methods.
> Whether "KMeansPlusPlusClusterer" compatible random

Sorry, but I don't understand what you mean.

> >>>
> >>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
> >>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
> >>>Perhaps we don't want to depart from the original (?) algorithm; if so,
> >>>the new constructor could be made protected (thus simplifying the API).
> >>
> >> k-means++ is the recommend center initialize method for now days,
> >> show we let user to fall back to random choose centers, that is a question
> >> need to tradeoff.
> >
> >Is there any advantage to "random" init vs "kmeans" init?
> >E.g. is "random" faster, yet would give similar clustering results?
>
> k-means++ is an improvement of k-means(random init).

Do I understand correctly that to call the algorithm KMeans++,
the only change (from "Kmeans") is the centroid initialization
(using "KMeans" vs using "random")?
If so, it would be a contradiction (name vs functionality) to allow
"random" in a class called "KMeansPlusPlusClusterer".

So it seems the alternatives are:
 (1) Drop "random" init (and keep "KMeansPlusPlusClusterer") if
you are positive that this functionality won"t be missed.
 (2) Allow flexibility for init (and rename "KMeansPlusPlusClusterer"
to "KMeansClusterer"), adding a boolean argument to the constructor:
"doKMeansPlusPlus" to select "random" (false) or "kmeans" (true).
 (3) Allow flexibility by passing a function to perform the init.

According to what you wrote, (1) is the better alternative right
now. Option (3) might (perhaps!) become a nice enhancement,
but only after we settle on a better design for the fundamental
classes ("Cluster", "ClusterSet", "ClusterResult", ...).

> The performance loss by using "k-means++" is tinily relative to entire 
> clustering process.
> I think keep old state is OK(make chooseInitialCenters package-private),
> but now "CentroidInitializer" is a middle state.

+1 if you mean that you'll remove, for now, the package "initialization".
Please submit a PR.

> >
> >> Show we make the API simple or rich?
> >
> >I'd keep it simple until we fix the (IMHO) more important
> >issues of thread-safety and sparse data.
>
> Some Personal opinion about "Sparse data":
> Math ML do not use a martix for clustering, but list of vectors(double 
> arrays),
>  this is flexible and simple (This is the main attraction of Math ML to me).
>
> Operations on sparse martixs can be faster than common martixs,
> but not that faster on sparse vectors, but the performance improvement will 
> bring complexity.
>
> Anyhow if we want to start this 

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

2020-03-25 Thread chentao...@qq.com
Hi,

>2020-03-25 4:54 UTC+01:00, chentao...@qq.com :
>> Hi,
>>
>>>Hello.
>>>
>>>Le mar. 24 mars 2020 à 06:39, chentao...@qq.com  a écrit
>>> :

 Hi,

 I have started 2 PRs to solve the problem you metioned.

 About the "CentroidInitializer" I have a new idea:
 Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
 and add a construct parameter and a property "useKMeansPlusPlus" to
 "KMeansPlusPlusCluster":
 ```java
 // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
 public class KMeansPlusPlusClusterer extends
 Clusterer {
 public KMeansPlusPlusClusterer(final int k, final int maxIterations,
    final DistanceMeasure measure,
    final UniformRandomProvider random,
    final EmptyClusterStrategy emptyStrategy,
 + final useKMeansPlusPlus) {
 // ...
 -  // Use K-means++ to choose the initial centers.
 -  this.centroidInitializer = new
 KMeansPlusPlusCentroidInitializer(measure, random);
 +  this.useKMeansPlusPlus = useKMeansPlusPlus;
 }
>>>
>>>What if one comes up with a third way to initialize the centroids?
>>>If you can ensure that there is no other initialization procedure,
>>>a boolean is fine, if not, we could still make the existing procedures
>>>package-private (e.g. moving them in as classes defined within
>>>"KMeansPlusPlusClusterer".
>>
>> As I know the k-means has two center initialize methods, random and
>> k-means++ so far,
>> use a boolean to choose which method to use is good enough for current use,
>
>Indeed but the question is: Will it be future-proof?
>[We don't want to break compatibility (and have to release a
>new major version of the library) for having overlooked the
>question which I'm asking.]
>
>> but there are two situations use need to implement the center initialize
>> method themselves:
>> 1. The Commoans Maths's implements is not good enough;
>
>As this is FLOSS, the understanding (IMO) is in that case users
>would contribute back to fix what needs be.
>
>> 2. There are new center initialize methods.
>
>So, that would be arguing against using a boolean (?).
>Cf. above (about breaking compatibility). 
The name "KMeansPlusPlusClusterer" decide it won't support new initialize 
methods.
Whether "KMeansPlusPlusClusterer" compatible random 

>
>>>
>>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
>>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
>>>Perhaps we don't want to depart from the original (?) algorithm; if so,
>>>the new constructor could be made protected (thus simplifying the API).
>>
>> k-means++ is the recommend center initialize method for now days,
>> show we let user to fall back to random choose centers, that is a question
>> need to tradeoff.
>
>Is there any advantage to "random" init vs "kmeans" init?
>E.g. is "random" faster, yet would give similar clustering results? 

k-means++ is an improvement of k-means(random init).
The performance loss by using "k-means++" is tinily relative to entire 
clustering process.
I think keep old state is OK(make chooseInitialCenters package-private), 
but now "CentroidInitializer" is a middle state.

>
>> Show we make the API simple or rich?
>
>I'd keep it simple until we fix the (IMHO) more important
>issues of thread-safety and sparse data. 

Some Personal opinion about "Sparse data":
Math ML do not use a martix for clustering, but list of vectors(double arrays),
 this is flexible and simple (This is the main attraction of Math ML to me).

Operations on sparse martixs can be faster than common martixs,
but not that faster on sparse vectors, but the performance improvement will 
bring complexity.

Anyhow if we want to start this optimization,  I think there are 2 ways:
1. Define a abstract Martix, and make Cluser API work on Martix, this is a big 
change.
2. Defina a abstract Vector, and replace double[] with Vector, implement a 
series of operations on Vector, include Sparse Vector.

>
>>>
 public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}
>>>
>>>Why should this method be defined?
>>
>> To let user get their cluster parameters, same as "getEmptyStrategy()"
>>
>
>Well, obviously.  But then, perhaps obviously too, the "user" is the
>one who called the constructor in the first place, and knowns the
>value of all the arguments.
>And if we consider the general use-case, when client code is passed
>an instance as type "Clusterer", then the specific accessor methods
>are not available anymore (short of using "instanceof" and a cast).
>
>
>Regards,
>Gilles
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

2020-03-25 Thread Gilles Sadowski
2020-03-25 4:54 UTC+01:00, chentao...@qq.com :
> Hi,
>
>>Hello.
>>
>>Le mar. 24 mars 2020 à 06:39, chentao...@qq.com  a écrit
>> :
>>>
>>> Hi,
>>>
>>> I have started 2 PRs to solve the problem you metioned.
>>>
>>> About the "CentroidInitializer" I have a new idea:
>>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
>>> and add a construct parameter and a property "useKMeansPlusPlus" to
>>> "KMeansPlusPlusCluster":
>>> ```java
>>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
>>> public class KMeansPlusPlusClusterer extends
>>> Clusterer {
>>> public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>>>final DistanceMeasure measure,
>>>final UniformRandomProvider random,
>>>final EmptyClusterStrategy emptyStrategy,
>>> + final useKMeansPlusPlus) {
>>> // ...
>>> -  // Use K-means++ to choose the initial centers.
>>> -  this.centroidInitializer = new
>>> KMeansPlusPlusCentroidInitializer(measure, random);
>>> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
>>> }
>>
>>What if one comes up with a third way to initialize the centroids?
>>If you can ensure that there is no other initialization procedure,
>>a boolean is fine, if not, we could still make the existing procedures
>>package-private (e.g. moving them in as classes defined within
>>"KMeansPlusPlusClusterer".
>
> As I know the k-means has two center initialize methods, random and
> k-means++ so far,
> use a boolean to choose which method to use is good enough for current use,

Indeed but the question is: Will it be future-proof?
[We don't want to break compatibility (and have to release a
new major version of the library) for having overlooked the
question which I'm asking.]

> but there are two situations use need to implement the center initialize
> method themselves:
> 1. The Commoans Maths's implements is not good enough;

As this is FLOSS, the understanding (IMO) is in that case users
would contribute back to fix what needs be.

> 2. There are new center initialize methods.

So, that would be arguing against using a boolean (?).
Cf. above (about breaking compatibility).

>>
>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
>>Perhaps we don't want to depart from the original (?) algorithm; if so,
>>the new constructor could be made protected (thus simplifying the API).
>
> k-means++ is the recommend center initialize method for now days,
> show we let user to fall back to random choose centers, that is a question
> need to tradeoff.

Is there any advantage to "random" init vs "kmeans" init?
E.g. is "random" faster, yet would give similar clustering results?

> Show we make the API simple or rich?

I'd keep it simple until we fix the (IMHO) more important
issues of thread-safety and sparse data.

>>
>>> public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}
>>
>>Why should this method be defined?
>
> To let user get their cluster parameters, same as "getEmptyStrategy()"
>

Well, obviously.  But then, perhaps obviously too, the "user" is the
one who called the constructor in the first place, and knowns the
value of all the arguments.
And if we consider the general use-case, when client code is passed
an instance as type "Clusterer", then the specific accessor methods
are not available anymore (short of using "instanceof" and a cast).


Regards,
Gilles

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

2020-03-24 Thread chentao...@qq.com
Hi,

>Hello.
>
>Le mar. 24 mars 2020 à 06:39, chentao...@qq.com  a écrit :
>>
>> Hi,
>>
>> I have started 2 PRs to solve the problem you metioned.
>>
>> About the "CentroidInitializer" I have a new idea:
>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
>> and add a construct parameter and a property "useKMeansPlusPlus" to 
>> "KMeansPlusPlusCluster":
>> ```java
>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
>> public class KMeansPlusPlusClusterer extends 
>> Clusterer {
>> public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>>    final DistanceMeasure measure,
>>    final UniformRandomProvider random,
>>    final EmptyClusterStrategy emptyStrategy,
>> + final useKMeansPlusPlus) {
>> // ...
>> -  // Use K-means++ to choose the initial centers.
>> -  this.centroidInitializer = new KMeansPlusPlusCentroidInitializer(measure, 
>> random);
>> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
>> }
>
>What if one comes up with a third way to initialize the centroids?
>If you can ensure that there is no other initialization procedure,
>a boolean is fine, if not, we could still make the existing procedures
>package-private (e.g. moving them in as classes defined within
>"KMeansPlusPlusClusterer". 

As I know the k-means has two center initialize methods, random and k-means++ 
so far,
use a boolean to choose which method to use is good enough for current use,
but there are two situations use need to implement the center initialize method 
themselves:
1. The Commoans Maths's implements is not good enough;
2. There are new center initialize methods.

>
>Also, in the current implementation of "KMeansPlusPlusClusterer", the
>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
>Perhaps we don't want to depart from the original (?) algorithm; if so,
>the new constructor could be made protected (thus simplifying the API). 

k-means++ is the recommend center initialize method for now days,
show we let user to fall back to random choose centers, that is a question need 
to tradeoff.
Show we make the API simple or rich?

>
>> public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}
>
>Why should this method be defined? 

To let user get their cluster parameters, same as "getEmptyStrategy()"

>
>Regards,
>Gilles
>
>> [...]
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

2020-03-04 Thread chentao...@qq.com
Hi,

>Hi.
>
>Le ven. 28 févr. 2020 à 05:04, chentao...@qq.com  a écrit :
>>
>>
>>
>>
>> --
>> chentao...@qq.com
>> >Hi.
>> >
>> >Le jeu. 27 févr. 2020 à 06:17, chentao...@qq.com  a 
>> >écrit :
>> >>
>> >> Hi,
>> >>
>> >> > [...]
>> >> >> >>
>> >> >> >> Do you mean I should fire a JIRA issue about 
>> >> >> >> reuse"centroidOf" and "chooseInitialCenters",
>> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> >> >> And thenstart the PR of "MiniBatchKMeansClusterer" after all 
>> >> >> >> done?
>> >> >> >
>> >> >> >I cannot guarantee that the whole process will be streamlined.
>> >> >> >In effect, you can work on multiple branches (one for each
>> >> >> >prospective PR).
>> >> >> >I'd say that you should start by describing (here on the ML) the
>> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >> >> >base class).
>> >> >> >[Only when the design has been agreed on,  a JIRA issue to
>> >> >> >implement it should be created in order to track the actual
>> >> >> >coding work).]
>> >> >>
>> >> >> OK, I think we should start from here:
>> >> >>
>> >> >> The method "centroidOf"  and "chooseInitialCenters" in 
>> >> >> KMeansPlusPlusClusterer
>> >> >>  could be reused by other KMeans Clusterer like 
>> >> >>MiniBatchKMeansClusterer which I want to implement.
>> >> >>
>> >> >> There are two solution for reuse "centroidOf"  and 
>> >> >> "chooseInitialCenters":
>> >> >> 1. Extract a abstract class for KMeans Clusterer named 
>> >> >> "AbstractKMeansClusterer",
>> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected 
>> >> >>methods in it;
>> >> >>  the EmptyClusterStrategy and related logic can also move to the 
>> >> >>"AbstractKMeansClusterer".
>> >> >> 2. Create a static utility class, and move "centroidOf"  and 
>> >> >> "chooseInitialCenters" in it,
>> >> >>  and some useful clustering method like predict(Predict which cluster 
>> >> >>is best for a specified point) can put in it.
>> >> >>
>> >> >
>> >> >At first sight, I prefer option 1.
>> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no 
>> >> >interest to
>> >> >users of the functionality (and so should not be part of the "public" 
>> >> >API).
>> >>
>> >> Persuasive explain, and I agree with you, that extract a abstract class 
>> >> for KMeans is better.
>> >> And how can we make a conclusion?
>> >> -
>> >>
>> >> Mention the "public API", I suppose there should be a series of 
>> >> "CentroidInitializer",
>> >>  that "chooseInitialCenters" with various of algorithms.
>> >> The k-means++ cluster algorithm is a special implementation of k-means
>> >>  which initialize cluster centers with k-means++ algorithm.
>> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be 
>> >> "KMeansClusterer"
>> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it 
>> >> is a common k-means.
>> >>
>> >> --
>> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>> >> >the "Cluster"
>> >> >interface?  What is the difference with method "getCenter" (define by 
>> >> >class
>> >> >"CentroidCluster")?
>> >>
>> >> My understanding is,:
>> >>  * "Cluster" is a data class that carry the result of a clustering,
>> >> "getCenter" is just a get method of CentroidCluster for get the value of 
>> >> a center point.
>> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to 
>> >>sets of Cluster.
>> >>  * "CentroidCluster" is the result of a group of special Clusterer 
>> >>algorithm like k-means,
>> >>  "centroidOf" is a specific logic to calculate the center point for a 
>> >>collection of points.
>> >> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
>> >>
>> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
>> >>  but different with "CentroidCluster.getCenter".
>> >
>> >I may be missing something about the existing design,
>> >but it seems strange that "CentroidCluster" is initialized
>> >with a given "center", yet it is possible to add points after
>> >initialization (which IIUC would invalidate the "center").
>>
>> The "centroidOf" could be part of "CentroidCluster",
>> but I think the existsing desgin was focus on decouple of 
>> "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".
>
>I don't see why we need both "Cluster" and "CentroidCluster".
>Indeed, as suggested before, the "center" can be computed
>from a "Cluster", but does not need to be stored in it. 

Typical usecase for a List is, when we need classify a new 
point, 
calculate the distance of the new point to each CentroidCluster.center is the 
simplest,
and the center should be cached.

>
>>
>> Center recalculate often happens in each iteration of k-means Clustering,
>> always with points reassign to clusters.
>> We often 

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

2020-02-28 Thread Gilles Sadowski
Hi.

Le ven. 28 févr. 2020 à 05:04, chentao...@qq.com  a écrit :
>
>
>
>
> --
> chentao...@qq.com
> >Hi.
> >
> >Le jeu. 27 févr. 2020 à 06:17, chentao...@qq.com  a écrit 
> >:
> >>
> >> Hi,
> >>
> >> > [...]
> >> >> >>
> >> >> >> Do you mean I should fire a JIRA issue about reuse"centroidOf" 
> >> >> >> and "chooseInitialCenters",
> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
> >> >> >> And thenstart the PR of "MiniBatchKMeansClusterer" after all 
> >> >> >> done?
> >> >> >
> >> >> >I cannot guarantee that the whole process will be streamlined.
> >> >> >In effect, you can work on multiple branches (one for each
> >> >> >prospective PR).
> >> >> >I'd say that you should start by describing (here on the ML) the
> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
> >> >> >base class).
> >> >> >[Only when the design has been agreed on,  a JIRA issue to
> >> >> >implement it should be created in order to track the actual
> >> >> >coding work).]
> >> >>
> >> >> OK, I think we should start from here:
> >> >>
> >> >> The method "centroidOf"  and "chooseInitialCenters" in 
> >> >> KMeansPlusPlusClusterer
> >> >>  could be reused by other KMeans Clusterer like 
> >> >> MiniBatchKMeansClusterer which I want to implement.
> >> >>
> >> >> There are two solution for reuse "centroidOf"  and 
> >> >> "chooseInitialCenters":
> >> >> 1. Extract a abstract class for KMeans Clusterer named 
> >> >> "AbstractKMeansClusterer",
> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected methods 
> >> >> in it;
> >> >>  the EmptyClusterStrategy and related logic can also move to the 
> >> >> "AbstractKMeansClusterer".
> >> >> 2. Create a static utility class, and move "centroidOf"  and 
> >> >> "chooseInitialCenters" in it,
> >> >>  and some useful clustering method like predict(Predict which cluster 
> >> >> is best for a specified point) can put in it.
> >> >>
> >> >
> >> >At first sight, I prefer option 1.
> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no 
> >> >interest to
> >> >users of the functionality (and so should not be part of the "public" 
> >> >API).
> >>
> >> Persuasive explain, and I agree with you, that extract a abstract class 
> >> for KMeans is better.
> >> And how can we make a conclusion?
> >> -
> >>
> >> Mention the "public API", I suppose there should be a series of 
> >> "CentroidInitializer",
> >>  that "chooseInitialCenters" with various of algorithms.
> >> The k-means++ cluster algorithm is a special implementation of k-means
> >>  which initialize cluster centers with k-means++ algorithm.
> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be 
> >> "KMeansClusterer"
> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it 
> >> is a common k-means.
> >>
> >> --
> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
> >> >the "Cluster"
> >> >interface?  What is the difference with method "getCenter" (define by 
> >> >class
> >> >"CentroidCluster")?
> >>
> >> My understanding is,:
> >>  * "Cluster" is a data class that carry the result of a clustering,
> >> "getCenter" is just a get method of CentroidCluster for get the value of a 
> >> center point.
> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to 
> >> sets of Cluster.
> >>  * "CentroidCluster" is the result of a group of special Clusterer 
> >> algorithm like k-means,
> >>  "centroidOf" is a specific logic to calculate the center point for a 
> >> collection of points.
> >> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
> >>
> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
> >>  but different with "CentroidCluster.getCenter".
> >
> >I may be missing something about the existing design,
> >but it seems strange that "CentroidCluster" is initialized
> >with a given "center", yet it is possible to add points after
> >initialization (which IIUC would invalidate the "center").
>
> The "centroidOf" could be part of "CentroidCluster",
> but I think the existsing desgin was focus on decouple of 
> "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".

I don't see why we need both "Cluster" and "CentroidCluster".
Indeed, as suggested before, the "center" can be computed
from a "Cluster", but does not need to be stored in it.

>
> Center recalculate often happens in each iteration of k-means Clustering,
> always with points reassign to clusters.
> We often use k-means as two pharse:
> Pharse 1: Training, classify thousands of points to set of clusters.
> Pharse 2: Predict, predict which cluster is best for a new point,
> or add a new point to the best cluster in ClusterSet,

Method "cluster" returns a "List"; there is no need for a
new "ClusterSet" class.

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

2020-02-27 Thread chentao...@qq.com



--
chentao...@qq.com
>Hi.
>
>Le jeu. 27 févr. 2020 à 06:17, chentao...@qq.com  a écrit :
>>
>> Hi,
>>
>> > [...]
>> >> >>
>> >> >> Do you mean I should fire a JIRA issue about reuse"centroidOf" 
>> >> >> and "chooseInitialCenters",
>> >> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> >> And thenstart the PR of "MiniBatchKMeansClusterer" after all 
>> >> >> done?
>> >> >
>> >> >I cannot guarantee that the whole process will be streamlined.
>> >> >In effect, you can work on multiple branches (one for each
>> >> >prospective PR).
>> >> >I'd say that you should start by describing (here on the ML) the
>> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >> >base class).
>> >> >[Only when the design has been agreed on,  a JIRA issue to
>> >> >implement it should be created in order to track the actual
>> >> >coding work).]
>> >>
>> >> OK, I think we should start from here:
>> >>
>> >> The method "centroidOf"  and "chooseInitialCenters" in 
>> >> KMeansPlusPlusClusterer
>> >>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer 
>> >>which I want to implement.
>> >>
>> >> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
>> >> 1. Extract a abstract class for KMeans Clusterer named 
>> >> "AbstractKMeansClusterer",
>> >>  and move "centroidOf"  and "chooseInitialCenters" as protected methods 
>> >>in it;
>> >>  the EmptyClusterStrategy and related logic can also move to the 
>> >>"AbstractKMeansClusterer".
>> >> 2. Create a static utility class, and move "centroidOf"  and 
>> >> "chooseInitialCenters" in it,
>> >>  and some useful clustering method like predict(Predict which cluster is 
>> >>best for a specified point) can put in it.
>> >>
>> >
>> >At first sight, I prefer option 1.
>> >Indeed, o.a things "chooseInitialCenters" is a method that is of no 
>> >interest to
>> >users of the functionality (and so should not be part of the "public" API).
>>
>> Persuasive explain, and I agree with you, that extract a abstract class for 
>> KMeans is better.
>> And how can we make a conclusion?
>> -
>>
>> Mention the "public API", I suppose there should be a series of 
>> "CentroidInitializer",
>>  that "chooseInitialCenters" with various of algorithms.
>> The k-means++ cluster algorithm is a special implementation of k-means
>>  which initialize cluster centers with k-means++ algorithm.
>> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be 
>> "KMeansClusterer"
>>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is 
>> a common k-means.
>>
>> --
>> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>> >the "Cluster"
>> >interface?  What is the difference with method "getCenter" (define by class
>> >"CentroidCluster")?
>>
>> My understanding is,:
>>  * "Cluster" is a data class that carry the result of a clustering,
>> "getCenter" is just a get method of CentroidCluster for get the value of a 
>> center point.
>>  * "Cluster[er]" is a (Interface of )algorithm that classify points to sets 
>>of Cluster.
>>  * "CentroidCluster" is the result of a group of special Clusterer algorithm 
>>like k-means,
>>  "centroidOf" is a specific logic to calculate the center point for a 
>>collection of points.
>> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
>>
>> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
>>  but different with "CentroidCluster.getCenter".
>
>I may be missing something about the existing design,
>but it seems strange that "CentroidCluster" is initialized
>with a given "center", yet it is possible to add points after
>initialization (which IIUC would invalidate the "center"). 

The "centroidOf" could be part of "CentroidCluster",
but I think the existsing desgin was focus on decouple of 
"DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".

Center recalculate often happens in each iteration of k-means Clustering, 
always with points reassign to clusters.
We often use k-means as two pharse:
Pharse 1: Training, classify thousands of points to set of clusters.
Pharse 2: Predict, predict which cluster is best for a new point,
or add a new point to the best cluster in ClusterSet,
but we never update the cluster center until next retraining.

The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" just 
design for pharse "Training",
it is clearly if we can consider "CentroidCluster" as a pure data class just 
for k-means clustering result.

If we want the cluster result useful enough for parse "Predict",
 the result of "KMeansPlusPlusClusterer.cluster" should return a  "ClusterSet":
```java
public interface ClusterSet extends Collection {
  // Retrun the cluster which the point should belong to.
  Cluster predict(T