I saw Thomas’ patch in https://issues.apache.org/jira/browse/MATH-959 which 
aims to add support for HAC to commons-math. However, I am just faced with a 
use case and wonder if/how this could be done either with existing methods or 
the proposed HAC algorithm there.

Lets assume we have items to 1000 cluster. Each item represents a sequence, 
e.g. AB, AC, AD, …, BA, BB, BC, …, ZA, …, ZZ and I can assign data points  to 
each item which can be used to calculate their similarity/distance. My goal is 
to create 50 clusters containing all sequences – this can be done pretty 
straight forward using KMeans++.
However, lets assume we want a hierarchical cluster,  with 10 clusters at level 
1 and 50 at level 2. At level one, I have the restriction that the first 
element in the sequence needs to be assigned to a unique cluster, e.g., the 
structure should look something like this:
Cluster1: A, B, C
Cluster1.1: AA, AC, AD, AE, …, BD, BE, BF, … CA
Cluster1.2: AB,BA,BC,BD,CB,CC,CD, …
…
Cluster1.7: AY,AZ,BZ,CZ
// cluster 1 has 7 subclusters.
Cluster2: D, E, F
…
Cluster3: G
Cluster3.1: GA,GB …, GU
Cluster3.2: GV, GW,… GZ
// note that cluster 3 has only 2 sub clusters
Cluster4: H, I
…
Cluster 10: W, X, Y, Z
// all sub clusters from cluster1 to cluster10 should add up to 50

Hence, all sequences in a cluster in level 2 need to have its sequence prefix 
in the parent cluster. Furthermore, even though I want 10 clusters on level 1 
and 50 on level 2, it does not mean that each level 1 cluster should 
necessarily have 5 child clusters.

I hope its clear enough to get the general restriction I want to ensure and I 
wonder how this could be implemented using the clustering algorithms in 
commons-math.

Cheers,

Thorsten

Reply via email to