Sorry for my previous email, apparently not yet finished but get sent out.

Here is the complete one

Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
random.

org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
--------
line 96-109

if (currentSize < k) {

            //select

          } else if (random.nextInt(currentSize + 1) != 0) { // with chance
1/(currentSize+1) pick new element

            int indexToRemove = random.nextInt(currentSize); // evict one
chosen randomly

            // replace with new

          }

-------

again this is not uniform random.

later sample will get much higher probability to be selected than beginning
sample.

because currentSize stay to be k after initial k samples. and new sample
will be picked with 1/(k+1) probability.

So, all ending samples will be selected with much higher prob.

In case of 1000 samples, k=3 , most likely selected 3 samples will be > 980


Sam

On Sat, Jan 12, 2013 at 6:07 PM, sam wu <[email protected]> wrote:

> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
> random.
>
> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
>
> line 96-109
>
> if (currentSize < k) {
>
>             //select
>
>           } else if (random.nextInt(currentSize + 1) != 0) { // with
> chance 1/(currentSize+1) pick new element
>
>             int indexToRemove = random.nextInt(currentSize); // evict one
> chosen randomly
>
>             // replace with new
>
>           }
>

Reply via email to