oh. they do use regularizers but with manually tuned reg rate, it seems.
On Tue, May 31, 2011 at 10:00 PM, Dmitriy Lyubimov <[email protected]> wrote: > i must say i don't understand most of the math. > > as for sharding, if i understood it correctly, i remember having > exactly same idea for 'strata' selection as they show their a year > ago. But i think the problem is that you have to run as many MR jobs > as the number of strata selected. I.e. if you parallelize it 5 ways (5 > maps) then you have to run it at least 5 times. or maybe one can > recombine subepochs in reducers and have another run with reducers (so > it's 3 times, not 5). Which seems to put fundamental limitations on > hadoopified scalability of this (they partly show increased time after > some rather low # of mappers which seems to confirm my old concern > about this). > > it probably makes sense with a lot of data. It probably makes even > more sense without MR sort phase. > > Another thing i did not quite get, how they cope with regularization? > it looks like they don't want to use it. How's overfitting handled > then? > > but it's compelling enough for my work so i could try it. Again, i > probably did not get some aspects of the algorithm though. > > Factorization is essentially a quantitative (continuous) target > regression, not a classification, so our abstract classifier > interfaces probably would not fit here > > On Tue, May 31, 2011 at 9:45 PM, Ted Dunning <[email protected]> wrote: >> After a quick skumming of the paper, it looks vaguely like if you reduced >> this to learning logistic regression that you have something roughly the >> same as feature sharding. >> >> (which is still a good idea) >> >> With matrices, of course, you have two ways to shard, not just one. >> >> On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >>> Interesting. >>> i'd probably be interested to try it out. >>> >>> >>> >>> On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <[email protected]> wrote: >>> > Thanks Ted and Lance. And sorry for the jargon. >>> > >>> > For the delay Ted mentioned, we have already considered that, still >>> thanks a >>> > lot for all the detail ideas, they were pretty helpful. >>> > For the parallelized SGD, just found a new paper about using DSGD in >>> matrix >>> > factorization, it's different from logistic regression, but might helpful >>> as >>> > well. Put the title here "Large-Scale Matrix Factorization with >>> Distributed >>> > Stochastic Gradient Descent" if anyone is interested. >>> > >>> > Best wishes, >>> > Stanley Xu >>> > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <[email protected]> >>> wrote: >>> > >>> >> Yes. >>> >> >>> >> Apologies for jargon and TLA< >>> >> http://en.wikipedia.org/wiki/Three-letter_acronym> >>> >> 's >>> >> >>> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <[email protected]> >>> wrote: >>> >> >>> >> > CTR == Clickthrough Rate >>> >> > >>> >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <[email protected]> >>> >> > wrote: >>> >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <[email protected]> >>> >> wrote: >>> >> > > >>> >> > >> ... I understood as the algorithm, the time in training only relies >>> on >>> >> > the >>> >> > >> non-zero records, but per our test, there would be some overhead we >>> >> > could >>> >> > >> not ignore for thoso non-zero records, though the cost is >>> sub-linear >>> >> or >>> >> > >> logit to the length of the hashed vector. >>> >> > >> >>> >> > > >>> >> > > This is pretty close if we say "non-zero values". A record usually >>> >> > refers >>> >> > > to an entire training >>> >> > > example. >>> >> > > >>> >> > > The extra work refers mostly to deferred regularization that >>> eventually >>> >> > has >>> >> > > to be >>> >> > > applied. My guess is that it is even less than log in the feature >>> >> vector >>> >> > > size. >>> >> > > >>> >> > > >>> >> > >> And in CTR prediction, I am not pretty sure it will converge very >>> >> > quickly. >>> >> > >> >>> >> > > >>> >> > > I was saying this purely based on the number of features. >>> >> > > >>> >> > > >>> >> > >> Because we will very possibly see some records has the almost same >>> >> > feature >>> >> > >> but different result in display ads. >>> >> > > >>> >> > > >>> >> > > The algorithm can still converge to an estimate of the probability >>> >> here. >>> >> > > >>> >> > > >>> >> > >> But we will see the result in the >>> >> > >> future. >>> >> > > >>> >> > > >>> >> > > You have to be *very* careful about this to avoid prejudicing the >>> model >>> >> > > against >>> >> > > recent impressions. If you have a fast feedback to the ad targeting >>> >> > system, >>> >> > > you >>> >> > > can have severely instability. >>> >> > > >>> >> > > The key thing that you have to do to avoid these biases is to define >>> a >>> >> > > maximum >>> >> > > delay before click for the purposes of modeling. You need to ignore >>> >> all >>> >> > > impressions >>> >> > > younger than this delay (because they may still get a click) and you >>> >> need >>> >> > to >>> >> > > ignore >>> >> > > all clicks after this delay (to avoid bias in favor of old >>> >> impressions). >>> >> > > For on-line ads >>> >> > > you can probably use a maximum delay of a few minutes because most >>> >> clicks >>> >> > > will >>> >> > > happen by then. >>> >> > > >>> >> > > To find a good value for maximum delay, you should plot the CTR for >>> a >>> >> > bunch >>> >> > > of >>> >> > > ads versus delay. This will increase rapidly shortly after zero >>> delay, >>> >> > but >>> >> > > then will >>> >> > > level off. The ordering of ads by CTR is what you care about so you >>> >> can >>> >> > > follow the >>> >> > > curves back and find the shortest delay where the ordering is >>> clearly >>> >> > > preserved. Use >>> >> > > that as your maximum delay. Typically this is roughly where your >>> CTR >>> >> is >>> >> > at >>> >> > > about >>> >> > > 80-90% of the final value. >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > >> (We were still working on creating a framework to digg all the >>> >> > >> features we need from the log, I would like to share our experience >>> by >>> >> > >> using >>> >> > >> Mahout SGD once we got our CTR prediction model release.) >>> >> > >> >>> >> > >> And for parallelize SGD, what do you mean for help with sparse >>> inputs >>> >> > that >>> >> > >> exhibit long-tail frequency distribution? Would you like to share >>> some >>> >> > of >>> >> > >> your ideas, Ted? >>> >> > >> >>> >> > >> Currently, what I could think about is split the data to multiple >>> >> mapper >>> >> > >> randomly and let every mapper to learn from the local data and get >>> an >>> >> > >> average on the whole model, or let multiple model to vote for every >>> >> > >> feature's weight. A little like the idea of AdaBoost or >>> RandomForest. >>> >> > But I >>> >> > >> am not a scientist or mathematician, so no idea if it is correct or >>> >> not. >>> >> > >> >>> >> > >> >>> >> > >> Thanks so much. >>> >> > >> Stanley Xu >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning < >>> [email protected]> >>> >> > >> wrote: >>> >> > >> >>> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu < >>> [email protected]> >>> >> > >> wrote: >>> >> > >> > >>> >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data >>> scale >>> >> I >>> >> > >> mean >>> >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines >>> of >>> >> > >> training >>> >> > >> > > set(900M training example.). If every training data has 1000 >>> >> > dimension, >>> >> > >> > it >>> >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs >>> >> > collected >>> >> > >> > to >>> >> > >> > > 14 days, it would be still 2-3TB data. >>> >> > >> > > >>> >> > >> > >>> >> > >> > Oops. Forgot that last multiplier. >>> >> > >> > >>> >> > >> > >>> >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, >>> it >>> >> > will >>> >> > >> > take >>> >> > >> > > about 1-2 hours to do the training, so 90M lines of data will >>> cost >>> >> > at >>> >> > >> > least >>> >> > >> > > 90 hours, is that correct? >>> >> > >> > > >>> >> > >> > >>> >> > >> > 10M x 1000 x 8 = 80 GB. >>> >> > >> > >>> >> > >> > 1-2 hours = (approx) 5000 seconds. So this is >>> >> > >> > >>> >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s >>> >> > >> > >>> >> > >> > Yes. This is reasonable speed. I think you can get a small >>> factor >>> >> > >> faster >>> >> > >> > than this with SGD. I have seen 100 million records with more >>> >> > non-zero >>> >> > >> > values than you describe with a training time of 3 hours. I >>> would >>> >> not >>> >> > >> > expect even as much as a factor of 10 speedup here. >>> >> > >> > >>> >> > >> > >>> >> > >> > > >>> >> > >> > > And from the PPT you provided >>> >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010 >>> >> > >> > > You said it would take less than an hour for 20M data records >>> for >>> >> > >> > > numeric/category mixed dimensions. I am wondering, how many >>> >> > dimensions >>> >> > >> > per >>> >> > >> > > record? >>> >> > >> > > >>> >> > >> > >>> >> > >> > These are sparse records records with about a thousand non-zero >>> >> > elements >>> >> > >> > per >>> >> > >> > record. >>> >> > >> > >>> >> > >> > >>> >> > >> > But let's step back to your data for a moment. Where do these >>> >> > thousand >>> >> > >> > dimensions come from? Do you really have a thousand hand-built >>> >> > features? >>> >> > >> > Do you not have any sparse, text-like features? >>> >> > >> > >>> >> > >> > If you really only have a thousand dimensional problem, then I >>> think >>> >> > your >>> >> > >> > model might exhibit early convergence. >>> >> > >> > >>> >> > >> > If not, it is quite possible to parallelize SGD, but this is only >>> >> > likely >>> >> > >> to >>> >> > >> > help with sparse inputs that exhibit long-tail frequency >>> >> distribution. >>> >> > >> > >>> >> > >> >>> >> > > >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Lance Norskog >>> >> > [email protected] >>> >> > >>> >> >>> > >>> >> >
