Sorry for my previous email, apparently not yet finished but get sent out.
Here is the complete one
Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
random.
org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
--------
line 96-109
if (currentSize < k) {
//select
} else if (random.nextInt(currentSize + 1) != 0) { // with chance
1/(currentSize+1) pick new element
int indexToRemove = random.nextInt(currentSize); // evict one
chosen randomly
// replace with new
}
-------
again this is not uniform random.
later sample will get much higher probability to be selected than beginning
sample.
because currentSize stay to be k after initial k samples. and new sample
will be picked with 1/(k+1) probability.
So, all ending samples will be selected with much higher prob.
In case of 1000 samples, k=3 , most likely selected 3 samples will be > 980
Sam
On Sat, Jan 12, 2013 at 6:07 PM, sam wu <[email protected]> wrote:
> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
> random.
>
> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
>
> line 96-109
>
> if (currentSize < k) {
>
> //select
>
> } else if (random.nextInt(currentSize + 1) != 0) { // with
> chance 1/(currentSize+1) pick new element
>
> int indexToRemove = random.nextInt(currentSize); // evict one
> chosen randomly
>
> // replace with new
>
> }
>