Yep, thanks Xiangrui. That's my fault, because I write a naive function to transform my sparse input into dense one, to use the MLlib interface. I just forget to remove all-zeros columns. Oh it's really a pitfall.
2014-01-29 Xiangrui Meng <[email protected]> > It happens when there are empty columns. Adding a very small smoothing > factor should help. Btw, I notice that the computation of variance > there is not stable, which should use the stable method implemented in > RDD[Double]. -Xiangrui > > On Tue, Jan 28, 2014 at 5:22 AM, yinxusen <[email protected]> wrote: > > Hi all, > > > > These days I test Lasso and ridge regression in MLlib, and I find an > error > > of Double.Nan. While other classification and regression methods do very > > well. > > > > Finally I find that Lasso and RidgeRegression call computeStats() > function > > to compute mean and SD (standard deviation) for normalizing input data. > > However, some returned SDs are zeroes. So when encountering 0.0 / 0.0, > there > > will be a Nan error. > > > > How about setting directly to zero if both the divisor and dividend are > > zeroes, and adding a smoothing factor (e.g. 1.0e-10) if the dividend > alone > > is zero? Or anyone have better ideas ? > > > > Thanks ! > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/computeStats-in-MLUtils-will-cause-Nan-not-a-number-error-tp980.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Best Regards ----------------------------------- Xusen Yin 尹绪森 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts & Telecommunications Intel Labs China Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
