Mahout may need to extract the random sampling as an independent class since it's a very often used algorithm.
On Sat, Jan 12, 2013 at 8:30 PM, Sean Owen <[email protected]> wrote: > I'm sure someone fixed something like this a while ago but yes I still > see this in the code. Search JIRA and file a bug > > On Sun, Jan 13, 2013 at 2:20 AM, sam wu <[email protected]> wrote: >> Sorry for my previous email, apparently not yet finished but get sent out. >> >> Here is the complete one >> >> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform >> random. >> >> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem >> -------- >> line 96-109 >> >> if (currentSize < k) { >> >> //select >> >> } else if (random.nextInt(currentSize + 1) != 0) { // with chance >> 1/(currentSize+1) pick new element >> >> int indexToRemove = random.nextInt(currentSize); // evict one >> chosen randomly >> >> // replace with new >> >> } >> >> ------- >> >> again this is not uniform random. >> >> later sample will get much higher probability to be selected than beginning >> sample. >> >> because currentSize stay to be k after initial k samples. and new sample >> will be picked with 1/(k+1) probability. >> >> So, all ending samples will be selected with much higher prob. >> >> In case of 1000 samples, k=3 , most likely selected 3 samples will be > 980 >> >> >> Sam >> >> On Sat, Jan 12, 2013 at 6:07 PM, sam wu <[email protected]> wrote: >> >>> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform >>> random. >>> >>> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem >>> >>> line 96-109 >>> >>> if (currentSize < k) { >>> >>> //select >>> >>> } else if (random.nextInt(currentSize + 1) != 0) { // with >>> chance 1/(currentSize+1) pick new element >>> >>> int indexToRemove = random.nextInt(currentSize); // evict one >>> chosen randomly >>> >>> // replace with new >>> >>> } >>> -- Shengchao Ding
