So no, apparently the problem's still there. With the most recent code, I get:
Average distance in cluster 0 [1]: 0.000000 Average distance in cluster 1 [18775]: 63.839819 Average distance in cluster 2 [11]: 448.706077 Average distance in cluster 3 [1]: 0.000000 Average distance in cluster 4 [8]: 213.629578 Average distance in cluster 5 [1]: 0.000000 Average distance in cluster 6 [10]: 369.592682 Average distance in cluster 7 [1]: 0.000000 Average distance in cluster 8 [2]: 31.061103 Average distance in cluster 9 [1]: 0.000000 Average distance in cluster 10 [2]: 309.934857 Average distance in cluster 11 [1]: 0.000000 Average distance in cluster 12 [1]: 0.000000 Average distance in cluster 13 [1]: 0.000000 Average distance in cluster 14 [1]: 0.000000 Average distance in cluster 15 [4]: 229.180504 Average distance in cluster 16 [1]: 0.000000 Average distance in cluster 17 [3]: 336.835246 Average distance in cluster 18 [2]: 76.485594 Average distance in cluster 19 [1]: 0.000000 Num clusters: 20; maxDistance: 724.060033 I'll have to recheck. :/ On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <[email protected]> wrote: > Hot damn! > > Well spotted. > > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon > <[email protected]>wrote: > >> Ted, remember we talked about this last week? >> >> The problem was (I think it's fixed now) that when I was asking for 20 >> clusters, every mapper would give me 20 clusters (rather than k log n >> ~ 200) and the points clumped together resulting in one cluster with >> the vast majority of the points ~17K out the ~19K. >> >> Now that I fixed that added more tests that seem to be confirming all >> StreamingKMeans implementations get about the same results (whether >> they're local or MapReduce) and the multiple restarts of BallKMeans, >> I'm expecting it to be a lot better. >> >> Actual data tests coming soon (please check that new cluster thread). :) >>
