Re: computeStats() in MLUtils will cause Nan (not a number) error

尹绪森 Tue, 28 Jan 2014 18:34:33 -0800

Yep, thanks Xiangrui. That's my fault, because I write a naive function to
transform my sparse input into dense one, to use the MLlib interface. I
just forget to remove all-zeros columns. Oh it's really a pitfall.



2014-01-29 Xiangrui Meng <[email protected]>

> It happens when there are empty columns. Adding a very small smoothing
> factor should help. Btw, I notice that the computation of variance
> there is not stable, which should use the stable method implemented in
> RDD[Double]. -Xiangrui
>
> On Tue, Jan 28, 2014 at 5:22 AM, yinxusen <[email protected]> wrote:
> > Hi all,
> >
> > These days I test Lasso and ridge regression in MLlib, and I find an
> error
> > of Double.Nan. While other classification and regression methods do very
> > well.
> >
> > Finally I find that Lasso and RidgeRegression call computeStats()
> function
> > to compute mean and SD (standard deviation) for normalizing input data.
> > However, some returned SDs are zeroes. So when encountering 0.0 / 0.0,
> there
> > will be a Nan error.
> >
> > How about setting directly to zero if both the divisor and dividend are
> > zeroes, and adding a smoothing factor (e.g. 1.0e-10) if the dividend
> alone
> > is zero? Or anyone have better ideas ?
> >
> > Thanks !
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/computeStats-in-MLUtils-will-cause-Nan-not-a-number-error-tp980.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and
Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Re: computeStats() in MLUtils will cause Nan (not a number) error

Reply via email to