I'm sure someone fixed something like this a while ago but yes I still
see this in the code. Search JIRA and file a bug

On Sun, Jan 13, 2013 at 2:20 AM, sam wu <[email protected]> wrote:
> Sorry for my previous email, apparently not yet finished but get sent out.
>
> Here is the complete one
>
> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
> random.
>
> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
> --------
> line 96-109
>
> if (currentSize < k) {
>
>             //select
>
>           } else if (random.nextInt(currentSize + 1) != 0) { // with chance
> 1/(currentSize+1) pick new element
>
>             int indexToRemove = random.nextInt(currentSize); // evict one
> chosen randomly
>
>             // replace with new
>
>           }
>
> -------
>
> again this is not uniform random.
>
> later sample will get much higher probability to be selected than beginning
> sample.
>
> because currentSize stay to be k after initial k samples. and new sample
> will be picked with 1/(k+1) probability.
>
> So, all ending samples will be selected with much higher prob.
>
> In case of 1000 samples, k=3 , most likely selected 3 samples will be > 980
>
>
> Sam
>
> On Sat, Jan 12, 2013 at 6:07 PM, sam wu <[email protected]> wrote:
>
>> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
>> random.
>>
>> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
>>
>> line 96-109
>>
>> if (currentSize < k) {
>>
>>             //select
>>
>>           } else if (random.nextInt(currentSize + 1) != 0) { // with
>> chance 1/(currentSize+1) pick new element
>>
>>             int indexToRemove = random.nextInt(currentSize); // evict one
>> chosen randomly
>>
>>             // replace with new
>>
>>           }
>>

Reply via email to