Hi Ted, For the data, currently, we digg the logs for a specific cookie. For example, we will check how many times has he seen the banner from the advertiser in last 7 days. We didn't has 1000 non-zero value now, I thought we will only have 100-200 now, but we expect to have 1000 at most I thought. We won't have lots of text like features for what we were predicting is for display ads rather than search ads. We might introduce a tagging system for ad management to see if we will get better result by adding these features in the future.
I understood as the algorithm, the time in training only relies on the non-zero records, but per our test, there would be some overhead we could not ignore for thoso non-zero records, though the cost is sub-linear or logit to the length of the hashed vector. And in CTR prediction, I am not pretty sure it will converge very quickly. Because we will very possibly see some records has the almost same feature but different result in display ads. But we will see the result in the future. (We were still working on creating a framework to digg all the features we need from the log, I would like to share our experience by using Mahout SGD once we got our CTR prediction model release.) And for parallelize SGD, what do you mean for help with sparse inputs that exhibit long-tail frequency distribution? Would you like to share some of your ideas, Ted? Currently, what I could think about is split the data to multiple mapper randomly and let every mapper to learn from the local data and get an average on the whole model, or let multiple model to vote for every feature's weight. A little like the idea of AdaBoost or RandomForest. But I am not a scientist or mathematician, so no idea if it is correct or not. Thanks so much. Stanley Xu On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <[email protected]> wrote: > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]> wrote: > > > 1 hour is acceptable, but I guess you misunderstand the data scale I mean > > here. The 900M records didn't mean 900M Bytes, but 900M lines of training > > set(900M training example.). If every training data has 1000 dimension, > it > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected > to > > 14 days, it would be still 2-3TB data. > > > > Oops. Forgot that last multiplier. > > > > Per our simple test, for 1000 dimension, 10M lines of record, it will > take > > about 1-2 hours to do the training, so 90M lines of data will cost at > least > > 90 hours, is that correct? > > > > 10M x 1000 x 8 = 80 GB. > > 1-2 hours = (approx) 5000 seconds. So this is > > 80 GB / 5000 s = 80/5 MB /s = 16MB / s > > Yes. This is reasonable speed. I think you can get a small factor faster > than this with SGD. I have seen 100 million records with more non-zero > values than you describe with a training time of 3 hours. I would not > expect even as much as a factor of 10 speedup here. > > > > > > And from the PPT you provided > > http://www.slideshare.net/tdunning/sdforum-11042010 > > You said it would take less than an hour for 20M data records for > > numeric/category mixed dimensions. I am wondering, how many dimensions > per > > record? > > > > These are sparse records records with about a thousand non-zero elements > per > record. > > > But let's step back to your data for a moment. Where do these thousand > dimensions come from? Do you really have a thousand hand-built features? > Do you not have any sparse, text-like features? > > If you really only have a thousand dimensional problem, then I think your > model might exhibit early convergence. > > If not, it is quite possible to parallelize SGD, but this is only likely to > help with sparse inputs that exhibit long-tail frequency distribution. >
