I'm sure someone fixed something like this a while ago but yes I still see this in the code. Search JIRA and file a bug
On Sun, Jan 13, 2013 at 2:20 AM, sam wu <[email protected]> wrote: > Sorry for my previous email, apparently not yet finished but get sent out. > > Here is the complete one > > Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform > random. > > org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem > -------- > line 96-109 > > if (currentSize < k) { > > //select > > } else if (random.nextInt(currentSize + 1) != 0) { // with chance > 1/(currentSize+1) pick new element > > int indexToRemove = random.nextInt(currentSize); // evict one > chosen randomly > > // replace with new > > } > > ------- > > again this is not uniform random. > > later sample will get much higher probability to be selected than beginning > sample. > > because currentSize stay to be k after initial k samples. and new sample > will be picked with 1/(k+1) probability. > > So, all ending samples will be selected with much higher prob. > > In case of 1000 samples, k=3 , most likely selected 3 samples will be > 980 > > > Sam > > On Sat, Jan 12, 2013 at 6:07 PM, sam wu <[email protected]> wrote: > >> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform >> random. >> >> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem >> >> line 96-109 >> >> if (currentSize < k) { >> >> //select >> >> } else if (random.nextInt(currentSize + 1) != 0) { // with >> chance 1/(currentSize+1) pick new element >> >> int indexToRemove = random.nextInt(currentSize); // evict one >> chosen randomly >> >> // replace with new >> >> } >>
