After a quick skumming of the paper, it looks vaguely like if you reduced this to learning logistic regression that you have something roughly the same as feature sharding.
(which is still a good idea) With matrices, of course, you have two ways to shard, not just one. On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <[email protected]> wrote: > Interesting. > i'd probably be interested to try it out. > > > > On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <[email protected]> wrote: > > Thanks Ted and Lance. And sorry for the jargon. > > > > For the delay Ted mentioned, we have already considered that, still > thanks a > > lot for all the detail ideas, they were pretty helpful. > > For the parallelized SGD, just found a new paper about using DSGD in > matrix > > factorization, it's different from logistic regression, but might helpful > as > > well. Put the title here "Large-Scale Matrix Factorization with > Distributed > > Stochastic Gradient Descent" if anyone is interested. > > > > Best wishes, > > Stanley Xu > > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <[email protected]> > wrote: > > > >> Yes. > >> > >> Apologies for jargon and TLA< > >> http://en.wikipedia.org/wiki/Three-letter_acronym> > >> 's > >> > >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <[email protected]> > wrote: > >> > >> > CTR == Clickthrough Rate > >> > > >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <[email protected]> > >> > wrote: > >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <[email protected]> > >> wrote: > >> > > > >> > >> ... I understood as the algorithm, the time in training only relies > on > >> > the > >> > >> non-zero records, but per our test, there would be some overhead we > >> > could > >> > >> not ignore for thoso non-zero records, though the cost is > sub-linear > >> or > >> > >> logit to the length of the hashed vector. > >> > >> > >> > > > >> > > This is pretty close if we say "non-zero values". A record usually > >> > refers > >> > > to an entire training > >> > > example. > >> > > > >> > > The extra work refers mostly to deferred regularization that > eventually > >> > has > >> > > to be > >> > > applied. My guess is that it is even less than log in the feature > >> vector > >> > > size. > >> > > > >> > > > >> > >> And in CTR prediction, I am not pretty sure it will converge very > >> > quickly. > >> > >> > >> > > > >> > > I was saying this purely based on the number of features. > >> > > > >> > > > >> > >> Because we will very possibly see some records has the almost same > >> > feature > >> > >> but different result in display ads. > >> > > > >> > > > >> > > The algorithm can still converge to an estimate of the probability > >> here. > >> > > > >> > > > >> > >> But we will see the result in the > >> > >> future. > >> > > > >> > > > >> > > You have to be *very* careful about this to avoid prejudicing the > model > >> > > against > >> > > recent impressions. If you have a fast feedback to the ad targeting > >> > system, > >> > > you > >> > > can have severely instability. > >> > > > >> > > The key thing that you have to do to avoid these biases is to define > a > >> > > maximum > >> > > delay before click for the purposes of modeling. You need to ignore > >> all > >> > > impressions > >> > > younger than this delay (because they may still get a click) and you > >> need > >> > to > >> > > ignore > >> > > all clicks after this delay (to avoid bias in favor of old > >> impressions). > >> > > For on-line ads > >> > > you can probably use a maximum delay of a few minutes because most > >> clicks > >> > > will > >> > > happen by then. > >> > > > >> > > To find a good value for maximum delay, you should plot the CTR for > a > >> > bunch > >> > > of > >> > > ads versus delay. This will increase rapidly shortly after zero > delay, > >> > but > >> > > then will > >> > > level off. The ordering of ads by CTR is what you care about so you > >> can > >> > > follow the > >> > > curves back and find the shortest delay where the ordering is > clearly > >> > > preserved. Use > >> > > that as your maximum delay. Typically this is roughly where your > CTR > >> is > >> > at > >> > > about > >> > > 80-90% of the final value. > >> > > > >> > > > >> > > > >> > > > >> > >> (We were still working on creating a framework to digg all the > >> > >> features we need from the log, I would like to share our experience > by > >> > >> using > >> > >> Mahout SGD once we got our CTR prediction model release.) > >> > >> > >> > >> And for parallelize SGD, what do you mean for help with sparse > inputs > >> > that > >> > >> exhibit long-tail frequency distribution? Would you like to share > some > >> > of > >> > >> your ideas, Ted? > >> > >> > >> > >> Currently, what I could think about is split the data to multiple > >> mapper > >> > >> randomly and let every mapper to learn from the local data and get > an > >> > >> average on the whole model, or let multiple model to vote for every > >> > >> feature's weight. A little like the idea of AdaBoost or > RandomForest. > >> > But I > >> > >> am not a scientist or mathematician, so no idea if it is correct or > >> not. > >> > >> > >> > >> > >> > >> Thanks so much. > >> > >> Stanley Xu > >> > >> > >> > >> > >> > >> > >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning < > [email protected]> > >> > >> wrote: > >> > >> > >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu < > [email protected]> > >> > >> wrote: > >> > >> > > >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data > scale > >> I > >> > >> mean > >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines > of > >> > >> training > >> > >> > > set(900M training example.). If every training data has 1000 > >> > dimension, > >> > >> > it > >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs > >> > collected > >> > >> > to > >> > >> > > 14 days, it would be still 2-3TB data. > >> > >> > > > >> > >> > > >> > >> > Oops. Forgot that last multiplier. > >> > >> > > >> > >> > > >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, > it > >> > will > >> > >> > take > >> > >> > > about 1-2 hours to do the training, so 90M lines of data will > cost > >> > at > >> > >> > least > >> > >> > > 90 hours, is that correct? > >> > >> > > > >> > >> > > >> > >> > 10M x 1000 x 8 = 80 GB. > >> > >> > > >> > >> > 1-2 hours = (approx) 5000 seconds. So this is > >> > >> > > >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s > >> > >> > > >> > >> > Yes. This is reasonable speed. I think you can get a small > factor > >> > >> faster > >> > >> > than this with SGD. I have seen 100 million records with more > >> > non-zero > >> > >> > values than you describe with a training time of 3 hours. I > would > >> not > >> > >> > expect even as much as a factor of 10 speedup here. > >> > >> > > >> > >> > > >> > >> > > > >> > >> > > And from the PPT you provided > >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010 > >> > >> > > You said it would take less than an hour for 20M data records > for > >> > >> > > numeric/category mixed dimensions. I am wondering, how many > >> > dimensions > >> > >> > per > >> > >> > > record? > >> > >> > > > >> > >> > > >> > >> > These are sparse records records with about a thousand non-zero > >> > elements > >> > >> > per > >> > >> > record. > >> > >> > > >> > >> > > >> > >> > But let's step back to your data for a moment. Where do these > >> > thousand > >> > >> > dimensions come from? Do you really have a thousand hand-built > >> > features? > >> > >> > Do you not have any sparse, text-like features? > >> > >> > > >> > >> > If you really only have a thousand dimensional problem, then I > think > >> > your > >> > >> > model might exhibit early convergence. > >> > >> > > >> > >> > If not, it is quite possible to parallelize SGD, but this is only > >> > likely > >> > >> to > >> > >> > help with sparse inputs that exhibit long-tail frequency > >> distribution. > >> > >> > > >> > >> > >> > > > >> > > >> > > >> > > >> > -- > >> > Lance Norskog > >> > [email protected] > >> > > >> > > >
