[ 
https://issues.apache.org/jira/browse/SPARK-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1328:
-----------------------------------

    Assignee: Xusen Yin

> Current implementation of Standard Deviation in MLUtils may cause 
> catastrophic cancellation, and loss precision.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-1328
>                 URL: https://issues.apache.org/jira/browse/SPARK-1328
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 0.9.0
>            Reporter: Xusen Yin
>            Assignee: Xusen Yin
>              Labels: MLLib,, statistics, vector
>             Fix For: 1.0.0
>
>
> Standard Deviation (SD) is used for dataset normalization, which is useful in 
> the training process of Lasso, etc. Current implementation of SD is using the 
> second-order expectations equation E^2( x )-E(x^2), which is not a stable 
> algorithm facing with floating point computing. 
> Instead of that, the first-order equation performs better. 
> Moreover, MLutils is not a right place to hold standard statistics methods, 
> It is more suitable that put it in the VectorRDDFunctions. Some other 
> affected machine learning algorithms should also be refined.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to