Re: A note about MLlib's StandardScaler

2017-01-09 Thread Sean Owen
This could be true if you knew you were just going to scale the input to StandardScaler and nothing else. It's probably more typical you'd scale some other data. The current behavior is therefore the sensible default, because the input is a sample of some unknown larger population. I think it does

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Liang-Chi Hsieh
Actually I think it is possibly that an user/developer needs the standardized features with population mean and std in some cases. It would be better if StandardScaler can offer the option to do that. Holden Karau wrote > Hi Gilad, > > Spark uses the sample standard variance inside of the Stan

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Holden Karau
Hi Gilad, Spark uses the sample standard variance inside of the StandardScaler (see https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler ) which I think would explain the results you are seeing you are seeing. I believe the scalers are intended to

A note about MLlib's StandardScaler

2017-01-08 Thread Gilad Barkan
Hi It seems that the output of MLlib's *StandardScaler*(*withMean=*True, *withStd*=True)are not as expected. The above configuration is expected to do the following transformation: X -> Y = (X-Mean)/Std - Eq.1 This transformation (a.k.a. Standardization) should result in a "standardized" vecto