I saw Thomas’ patch in https://issues.apache.org/jira/browse/MATH-959 which aims to add support for HAC to commons-math. However, I am just faced with a use case and wonder if/how this could be done either with existing methods or the proposed HAC algorithm there.
Lets assume we have items to 1000 cluster. Each item represents a sequence, e.g. AB, AC, AD, …, BA, BB, BC, …, ZA, …, ZZ and I can assign data points to each item which can be used to calculate their similarity/distance. My goal is to create 50 clusters containing all sequences – this can be done pretty straight forward using KMeans++. However, lets assume we want a hierarchical cluster, with 10 clusters at level 1 and 50 at level 2. At level one, I have the restriction that the first element in the sequence needs to be assigned to a unique cluster, e.g., the structure should look something like this: Cluster1: A, B, C Cluster1.1: AA, AC, AD, AE, …, BD, BE, BF, … CA Cluster1.2: AB,BA,BC,BD,CB,CC,CD, … … Cluster1.7: AY,AZ,BZ,CZ // cluster 1 has 7 subclusters. Cluster2: D, E, F … Cluster3: G Cluster3.1: GA,GB …, GU Cluster3.2: GV, GW,… GZ // note that cluster 3 has only 2 sub clusters Cluster4: H, I … Cluster 10: W, X, Y, Z // all sub clusters from cluster1 to cluster10 should add up to 50 Hence, all sequences in a cluster in level 2 need to have its sequence prefix in the parent cluster. Furthermore, even though I want 10 clusters on level 1 and 50 on level 2, it does not mean that each level 1 cluster should necessarily have 5 child clusters. I hope its clear enough to get the general restriction I want to ensure and I wonder how this could be implemented using the clustering algorithms in commons-math. Cheers, Thorsten
