CTR == Clickthrough Rate On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <[email protected]> wrote: > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <[email protected]> wrote: > >> ... I understood as the algorithm, the time in training only relies on the >> non-zero records, but per our test, there would be some overhead we could >> not ignore for thoso non-zero records, though the cost is sub-linear or >> logit to the length of the hashed vector. >> > > This is pretty close if we say "non-zero values". A record usually refers > to an entire training > example. > > The extra work refers mostly to deferred regularization that eventually has > to be > applied. My guess is that it is even less than log in the feature vector > size. > > >> And in CTR prediction, I am not pretty sure it will converge very quickly. >> > > I was saying this purely based on the number of features. > > >> Because we will very possibly see some records has the almost same feature >> but different result in display ads. > > > The algorithm can still converge to an estimate of the probability here. > > >> But we will see the result in the >> future. > > > You have to be *very* careful about this to avoid prejudicing the model > against > recent impressions. If you have a fast feedback to the ad targeting system, > you > can have severely instability. > > The key thing that you have to do to avoid these biases is to define a > maximum > delay before click for the purposes of modeling. You need to ignore all > impressions > younger than this delay (because they may still get a click) and you need to > ignore > all clicks after this delay (to avoid bias in favor of old impressions). > For on-line ads > you can probably use a maximum delay of a few minutes because most clicks > will > happen by then. > > To find a good value for maximum delay, you should plot the CTR for a bunch > of > ads versus delay. This will increase rapidly shortly after zero delay, but > then will > level off. The ordering of ads by CTR is what you care about so you can > follow the > curves back and find the shortest delay where the ordering is clearly > preserved. Use > that as your maximum delay. Typically this is roughly where your CTR is at > about > 80-90% of the final value. > > > > >> (We were still working on creating a framework to digg all the >> features we need from the log, I would like to share our experience by >> using >> Mahout SGD once we got our CTR prediction model release.) >> >> And for parallelize SGD, what do you mean for help with sparse inputs that >> exhibit long-tail frequency distribution? Would you like to share some of >> your ideas, Ted? >> >> Currently, what I could think about is split the data to multiple mapper >> randomly and let every mapper to learn from the local data and get an >> average on the whole model, or let multiple model to vote for every >> feature's weight. A little like the idea of AdaBoost or RandomForest. But I >> am not a scientist or mathematician, so no idea if it is correct or not. >> >> >> Thanks so much. >> Stanley Xu >> >> >> >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <[email protected]> >> wrote: >> >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]> >> wrote: >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data scale I >> mean >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of >> training >> > > set(900M training example.). If every training data has 1000 dimension, >> > it >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected >> > to >> > > 14 days, it would be still 2-3TB data. >> > > >> > >> > Oops. Forgot that last multiplier. >> > >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, it will >> > take >> > > about 1-2 hours to do the training, so 90M lines of data will cost at >> > least >> > > 90 hours, is that correct? >> > > >> > >> > 10M x 1000 x 8 = 80 GB. >> > >> > 1-2 hours = (approx) 5000 seconds. So this is >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s >> > >> > Yes. This is reasonable speed. I think you can get a small factor >> faster >> > than this with SGD. I have seen 100 million records with more non-zero >> > values than you describe with a training time of 3 hours. I would not >> > expect even as much as a factor of 10 speedup here. >> > >> > >> > > >> > > And from the PPT you provided >> > > http://www.slideshare.net/tdunning/sdforum-11042010 >> > > You said it would take less than an hour for 20M data records for >> > > numeric/category mixed dimensions. I am wondering, how many dimensions >> > per >> > > record? >> > > >> > >> > These are sparse records records with about a thousand non-zero elements >> > per >> > record. >> > >> > >> > But let's step back to your data for a moment. Where do these thousand >> > dimensions come from? Do you really have a thousand hand-built features? >> > Do you not have any sparse, text-like features? >> > >> > If you really only have a thousand dimensional problem, then I think your >> > model might exhibit early convergence. >> > >> > If not, it is quite possible to parallelize SGD, but this is only likely >> to >> > help with sparse inputs that exhibit long-tail frequency distribution. >> > >> >
-- Lance Norskog [email protected]
