Re: Which exact algorithm is used in the Mahout SGD?

Ted Dunning Thu, 28 Apr 2011 23:09:17 -0700

Yes.

Apologies for jargon and TLA<http://en.wikipedia.org/wiki/Three-letter_acronym>
's


On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <[email protected]> wrote:

> CTR == Clickthrough Rate
>
> On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <[email protected]>
> wrote:
> > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <[email protected]> wrote:
> >
> >> ... I understood as the algorithm, the time in training only relies on
> the
> >> non-zero records, but per our test, there would be some overhead we
> could
> >> not ignore for thoso non-zero records, though the cost is sub-linear or
> >> logit to the length of the hashed vector.
> >>
> >
> > This is pretty close if we say "non-zero values".  A record usually
> refers
> > to an entire training
> > example.
> >
> > The extra work refers mostly to deferred regularization that eventually
> has
> > to be
> > applied.  My guess is that it is even less than log in the feature vector
> > size.
> >
> >
> >> And in CTR prediction, I am not pretty sure it will converge very
> quickly.
> >>
> >
> > I was saying this purely based on the number of features.
> >
> >
> >> Because we will very possibly see some records has the almost same
> feature
> >> but different result in display ads.
> >
> >
> > The algorithm can still converge to an estimate of the probability here.
> >
> >
> >> But we will see the result in the
> >> future.
> >
> >
> > You have to be *very* careful about this to avoid prejudicing the model
> > against
> > recent impressions.  If you have a fast feedback to the ad targeting
> system,
> > you
> > can have severely instability.
> >
> > The key thing that you have to do to avoid these biases is to define a
> > maximum
> > delay before click for the purposes of modeling.  You need to ignore all
> > impressions
> > younger than this delay (because they may still get a click) and you need
> to
> > ignore
> > all clicks after this delay (to avoid bias in favor of old impressions).
> >  For on-line ads
> > you can probably use a maximum delay of a few minutes because most clicks
> > will
> > happen by then.
> >
> > To find a good value for maximum delay, you should plot the CTR for a
> bunch
> > of
> > ads versus delay.  This will increase rapidly shortly after zero delay,
> but
> > then will
> > level off.  The ordering of ads by CTR is what you care about so you can
> > follow the
> > curves back and find the shortest delay where the ordering is clearly
> > preserved.  Use
> > that as your maximum delay.  Typically this is roughly where your CTR is
> at
> > about
> > 80-90% of the final value.
> >
> >
> >
> >
> >> (We were still working on creating a framework to digg all the
> >> features we need from the log, I would like to share our experience by
> >> using
> >> Mahout SGD once we got our CTR prediction model release.)
> >>
> >> And for parallelize SGD, what do you mean for help with sparse inputs
> that
> >> exhibit long-tail frequency distribution? Would you like to share some
> of
> >> your ideas, Ted?
> >>
> >> Currently, what I could think about is split the data to multiple mapper
> >> randomly and let every mapper to learn from the local data and get an
> >> average on the whole model, or let multiple model to vote for every
> >> feature's weight. A little like the idea of AdaBoost or RandomForest.
> But I
> >> am not a scientist or mathematician, so no idea if it is correct or not.
> >>
> >>
> >> Thanks so much.
> >> Stanley Xu
> >>
> >>
> >>
> >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <[email protected]>
> >> wrote:
> >>
> >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]>
> >> wrote:
> >> >
> >> > > 1 hour is acceptable, but I guess you misunderstand the data scale I
> >> mean
> >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
> >> training
> >> > > set(900M training example.). If every training data has 1000
> dimension,
> >> > it
> >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
> collected
> >> > to
> >> > > 14 days, it would be still 2-3TB data.
> >> > >
> >> >
> >> > Oops.  Forgot that last multiplier.
> >> >
> >> >
> >> > > Per our simple test, for 1000 dimension, 10M lines of record, it
> will
> >> > take
> >> > > about 1-2 hours to do the training, so 90M lines of data will cost
> at
> >> > least
> >> > > 90 hours, is that correct?
> >> > >
> >> >
> >> > 10M x 1000 x 8 = 80 GB.
> >> >
> >> > 1-2 hours = (approx) 5000 seconds.  So this is
> >> >
> >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> >> >
> >> > Yes.  This is reasonable speed.  I think you can get a small factor
> >> faster
> >> > than this with SGD.  I have seen 100 million records with more
> non-zero
> >> > values than you describe with a training time of 3 hours.  I would not
> >> > expect even as much as a factor of 10 speedup here.
> >> >
> >> >
> >> > >
> >> > > And from the PPT you provided
> >> > > http://www.slideshare.net/tdunning/sdforum-11042010
> >> > > You said it would take less than an hour for 20M data records for
> >> > > numeric/category mixed dimensions. I am wondering, how many
> dimensions
> >> > per
> >> > > record?
> >> > >
> >> >
> >> > These are sparse records records with about a thousand non-zero
> elements
> >> > per
> >> > record.
> >> >
> >> >
> >> > But let's step back to your data for a moment.  Where do these
> thousand
> >> > dimensions come from?  Do you really have a thousand hand-built
> features?
> >> >  Do you not have any sparse, text-like features?
> >> >
> >> > If you really only have a thousand dimensional problem, then I think
> your
> >> > model might exhibit early convergence.
> >> >
> >> > If not, it is quite possible to parallelize SGD, but this is only
> likely
> >> to
> >> > help with sparse inputs that exhibit long-tail frequency distribution.
> >> >
> >>
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: Which exact algorithm is used in the Mahout SGD?

Reply via email to