Hi Ted, 1 hour is acceptable, but I guess you misunderstand the data scale I mean here. The 900M records didn't mean 900M Bytes, but 900M lines of training set(900M training example.). If every training data has 1000 dimension, it means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected to 14 days, it would be still 2-3TB data.
Per our simple test, for 1000 dimension, 10M lines of record, it will take about 1-2 hours to do the training, so 90M lines of data will cost at least 90 hours, is that correct? And from the PPT you provided http://www.slideshare.net/tdunning/sdforum-11042010 You said it would take less than an hour for 20M data records for numeric/category mixed dimensions. I am wondering, how many dimensions per record? Thanks. Stanley Xu On Tue, Apr 26, 2011 at 2:05 PM, Ted Dunning <[email protected]> wrote: > How much time do you have available for training? > > If you can do feature encoding in parallel, then you can probably do this > pretty fast with SGD. > > My guess is that you can push 2-20 MB/s of data through SGD with your kind > of data with a good 8 core processor. If you pre-process your data into 8 > B / dimension, this is 0.25 - 2.5 million data points per second. This > could mean that your training takes less than an hour. If your training > converges with less data, you may do even better. > > Is that not acceptable? > > On Mon, Apr 25, 2011 at 10:11 PM, Stanley Xu <[email protected]> wrote: > > > Thanks Ted. Read the paper and the code and got the rough idea of how the > > iteration goes. Thanks so much. > > > > With the current data scale we have, we were considering if we could > train > > more data with the Logistic Regression. For example, if we wanted to > train > > a > > model for CTR prediction for last 90 days data. It would be 900M records > > after down sampling, and assume there are 1000 feature dimension there. > It > > would still be so slow by a single machine with the current SGD > algorithm. > > > > I wondering if there is a parallel algorithm with map-reduce I could use > > for > > Logistic Regression? The original Newton-Raphson will take N*N*M/P by > > the "Map-Reduce > > for Machine Learning on Multicore" paper, which is much slower than SGD > on > > a > > single machine in a high-dimension space. > > > > Could algorithm like IRLS be parallelized or any approximate algorithm > > there > > could be parallelized? > > > > Thanks, > > Stanley Xu > > > > > > > > On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <[email protected]> > > wrote: > > > > > Paul K described in memory algorithms in his dissertation. Mahout uses > > > on-line algorithms which are not limited by memory size. > > > > > > The method used in Mahout is closer to what Bob Carpenter describes > here: > > > http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf > > > > > > The most important additions in Mahout are: > > > > > > a) confidence weighted learning rates per term > > > > > > b) evolutionary tuning of hyper-parameters > > > > > > c) mixed ranking and regression > > > > > > d) grouped AUC > > > > > > On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <[email protected]> > wrote: > > > > > > > Dear All, > > > > > > > > I am trying to go through the Mahout SGD algorithm and trying to read > > > > the "Logistic > > > > Regression for Data Mining and High-Dimensional Classification" a > > little > > > > bit, I am wondering which algorithm is exactly used in the SGD code? > > > There > > > > are quite a couple of algorithms mentioned in the paper, a little > hard > > to > > > > me > > > > to find out the algorithm matched the code. > > > > > > > > Thanks in advance. > > > > > > > > Best wishes, > > > > Stanley Xu > > > > > > > > > >
