Check your convergence criteria. The iterations will end when: a) the maxIterations have been accomplished or; b) when all the clusters have converged. If they did not converge in either run then the times won't change by plugging them together.
-----Original Message----- From: David Saile [mailto:[email protected]] Sent: Thursday, May 12, 2011 10:09 AM To: [email protected] Subject: Re: AW: Incremental clustering I had that same thought, so I actually tried running k-Means twice on the Reuters dataset (as described in Quickstart). The second run received the resulting cluster of the first run as input. However, the execution times of the two runs did not differ much (actually the 2nd run was a bit slower). I also tried to double the input or the number of iterations, but no improvement. Could this be caused by running Hadoop on a single machine? Or is the number of iterations with 20 (or 40) simply not high enough? David Am 12.05.2011 um 18:46 schrieb Jeff Eastman: > Also, if cluster training begins with the posterior from a previous training > session over the corpus but with new data added since that training began, > the prior clusters should be very close to an optimal solution with the new > data and the number of iterations required to converge on a new posterior > should be reduced. Haven't tried this in practice but it seems logical. > Convergence is calculated by how much each cluster has changed during an > iteration. > > -----Original Message----- > From: Benson Margulies [mailto:[email protected]] > Sent: Thursday, May 12, 2011 9:14 AM > To: [email protected] > Subject: Re: AW: Incremental clustering > > Is the idea here that you are going to be presented with many > different corpora that have some sort of overall resemblance, so that > priors derived from the first N speed up clustering N+1? > > --benson >
