Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread DB Tsai
Hi David, I got most of the stuff working, and the loss is monotonically decreasing by getting the history from iterator of state. However, in the costFun, I need to know what current iteration is it for miniBatch, which means for one iteration, if optimizer calls costFun several times for line s

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread DB Tsai
Hi David, I got most of the stuff working, and the loss is monotonically decreasing by getting the history from iterator of state. However, in the costFun, I need to know what current iteration is it for miniBatch, which means for one iteration, if optimizer calls costFun several times for line s

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread David Hall
That's right. FWIW, caching should be automatic now, but it might be the version of Breeze you're using doesn't do that yet. Also, In breeze.util._ there's an implicit that adds a tee method to iterator, and also a last method. Both are useful for things like this. -- David On Sun, Apr 27, 2014

Re: thoughts on spark_ec2.py?

2014-04-28 Thread Art Peel
Thanks for the info and good luck with 1.0. Regards, Art On Fri, Apr 25, 2014 at 9:48 AM, Andrew Or wrote: > Hi Art, > > First of all thanks a lot for your PRs. We are currently in the middle of > all the Spark 1.0 release so most of us are swamped with the more core > features. To answer you

Re: Parsing wikipedia xml data in Spark

2014-04-28 Thread Geoffroy Fouquier
We did it using scala xml with spark We start by creating a rdd containing each page is store as a single line : - split the xml dump with xml_split - process each split with a shell script which remove "xml_split" tag and siteinfo section, and put each page on a single line. - copy resu