On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <[email protected]> wrote:
> ... I understood as the algorithm, the time in training only relies on the > non-zero records, but per our test, there would be some overhead we could > not ignore for thoso non-zero records, though the cost is sub-linear or > logit to the length of the hashed vector. > This is pretty close if we say "non-zero values". A record usually refers to an entire training example. The extra work refers mostly to deferred regularization that eventually has to be applied. My guess is that it is even less than log in the feature vector size. > And in CTR prediction, I am not pretty sure it will converge very quickly. > I was saying this purely based on the number of features. > Because we will very possibly see some records has the almost same feature > but different result in display ads. The algorithm can still converge to an estimate of the probability here. > But we will see the result in the > future. You have to be *very* careful about this to avoid prejudicing the model against recent impressions. If you have a fast feedback to the ad targeting system, you can have severely instability. The key thing that you have to do to avoid these biases is to define a maximum delay before click for the purposes of modeling. You need to ignore all impressions younger than this delay (because they may still get a click) and you need to ignore all clicks after this delay (to avoid bias in favor of old impressions). For on-line ads you can probably use a maximum delay of a few minutes because most clicks will happen by then. To find a good value for maximum delay, you should plot the CTR for a bunch of ads versus delay. This will increase rapidly shortly after zero delay, but then will level off. The ordering of ads by CTR is what you care about so you can follow the curves back and find the shortest delay where the ordering is clearly preserved. Use that as your maximum delay. Typically this is roughly where your CTR is at about 80-90% of the final value. > (We were still working on creating a framework to digg all the > features we need from the log, I would like to share our experience by > using > Mahout SGD once we got our CTR prediction model release.) > > And for parallelize SGD, what do you mean for help with sparse inputs that > exhibit long-tail frequency distribution? Would you like to share some of > your ideas, Ted? > > Currently, what I could think about is split the data to multiple mapper > randomly and let every mapper to learn from the local data and get an > average on the whole model, or let multiple model to vote for every > feature's weight. A little like the idea of AdaBoost or RandomForest. But I > am not a scientist or mathematician, so no idea if it is correct or not. > > > Thanks so much. > Stanley Xu > > > > On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <[email protected]> > wrote: > > > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]> > wrote: > > > > > 1 hour is acceptable, but I guess you misunderstand the data scale I > mean > > > here. The 900M records didn't mean 900M Bytes, but 900M lines of > training > > > set(900M training example.). If every training data has 1000 dimension, > > it > > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected > > to > > > 14 days, it would be still 2-3TB data. > > > > > > > Oops. Forgot that last multiplier. > > > > > > > Per our simple test, for 1000 dimension, 10M lines of record, it will > > take > > > about 1-2 hours to do the training, so 90M lines of data will cost at > > least > > > 90 hours, is that correct? > > > > > > > 10M x 1000 x 8 = 80 GB. > > > > 1-2 hours = (approx) 5000 seconds. So this is > > > > 80 GB / 5000 s = 80/5 MB /s = 16MB / s > > > > Yes. This is reasonable speed. I think you can get a small factor > faster > > than this with SGD. I have seen 100 million records with more non-zero > > values than you describe with a training time of 3 hours. I would not > > expect even as much as a factor of 10 speedup here. > > > > > > > > > > And from the PPT you provided > > > http://www.slideshare.net/tdunning/sdforum-11042010 > > > You said it would take less than an hour for 20M data records for > > > numeric/category mixed dimensions. I am wondering, how many dimensions > > per > > > record? > > > > > > > These are sparse records records with about a thousand non-zero elements > > per > > record. > > > > > > But let's step back to your data for a moment. Where do these thousand > > dimensions come from? Do you really have a thousand hand-built features? > > Do you not have any sparse, text-like features? > > > > If you really only have a thousand dimensional problem, then I think your > > model might exhibit early convergence. > > > > If not, it is quite possible to parallelize SGD, but this is only likely > to > > help with sparse inputs that exhibit long-tail frequency distribution. > > >
