Re: [MLlib] kmeans random initialization, same seed every time

2017-03-15 Thread Yuhao Yang
Hi Julian,

Thanks for reporting this. This is a valid issue and I created
https://issues.apache.org/jira/browse/SPARK-19957 to track it.

Right now the seed is set to this.getClass.getName.hashCode.toLong by
default, which indeed keeps the same among multiple fits. Feel free to
leave your comments or send a PR for the fix.

For your problem, you may add .setSeed(new Random().nextLong()) before
fit() as a workaround.

Thanks,
Yuhao

2017-03-14 5:46 GMT-07:00 Julian Keppel :

> I'm sorry, I missed some important informations. I use Spark version 2.0.2
> in Scala 2.11.8.
>
> 2017-03-14 13:44 GMT+01:00 Julian Keppel :
>
>> Hi everybody,
>>
>> I make some experiments with the Spark kmeans implementation of the new
>> DataFrame-API. I compare clustering results of different runs with
>> different parameters. I recognized that for random initialization mode, the
>> seed value is the same every time. How is it calculated? In my
>> understanding the seed should be random if it is not provided by the user.
>>
>> Thank you for you help.
>>
>> Julian
>>
>
>


Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
I'm sorry, I missed some important informations. I use Spark version 2.0.2
in Scala 2.11.8.

2017-03-14 13:44 GMT+01:00 Julian Keppel :

> Hi everybody,
>
> I make some experiments with the Spark kmeans implementation of the new
> DataFrame-API. I compare clustering results of different runs with
> different parameters. I recognized that for random initialization mode, the
> seed value is the same every time. How is it calculated? In my
> understanding the seed should be random if it is not provided by the user.
>
> Thank you for you help.
>
> Julian
>


[MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
Hi everybody,

I make some experiments with the Spark kmeans implementation of the new
DataFrame-API. I compare clustering results of different runs with
different parameters. I recognized that for random initialization mode, the
seed value is the same every time. How is it calculated? In my
understanding the seed should be random if it is not provided by the user.

Thank you for you help.

Julian