Re: Which exact algorithm is used in the Mahout SGD?

Stanley Xu Thu, 28 Apr 2011 23:31:48 -0700

Thanks Ted and Lance. And sorry for the jargon.

For the delay Ted mentioned, we have already considered that, still thanks a
lot for all the detail ideas, they were pretty helpful.
For the parallelized SGD, just found a new paper about using DSGD in matrix
factorization, it's different from logistic regression, but might helpful as
well. Put the title here "Large-Scale Matrix Factorization with Distributed
Stochastic Gradient Descent" if anyone is interested.


Best wishes,
Stanley Xu
On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <[email protected]> wrote:

> Yes.
>
> Apologies for jargon and TLA<
> http://en.wikipedia.org/wiki/Three-letter_acronym>
> 's
>
> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <[email protected]> wrote:
>
> > CTR == Clickthrough Rate
> >
> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <[email protected]>
> > wrote:
> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <[email protected]>
> wrote:
> > >
> > >> ... I understood as the algorithm, the time in training only relies on
> > the
> > >> non-zero records, but per our test, there would be some overhead we
> > could
> > >> not ignore for thoso non-zero records, though the cost is sub-linear
> or
> > >> logit to the length of the hashed vector.
> > >>
> > >
> > > This is pretty close if we say "non-zero values".  A record usually
> > refers
> > > to an entire training
> > > example.
> > >
> > > The extra work refers mostly to deferred regularization that eventually
> > has
> > > to be
> > > applied.  My guess is that it is even less than log in the feature
> vector
> > > size.
> > >
> > >
> > >> And in CTR prediction, I am not pretty sure it will converge very
> > quickly.
> > >>
> > >
> > > I was saying this purely based on the number of features.
> > >
> > >
> > >> Because we will very possibly see some records has the almost same
> > feature
> > >> but different result in display ads.
> > >
> > >
> > > The algorithm can still converge to an estimate of the probability
> here.
> > >
> > >
> > >> But we will see the result in the
> > >> future.
> > >
> > >
> > > You have to be *very* careful about this to avoid prejudicing the model
> > > against
> > > recent impressions.  If you have a fast feedback to the ad targeting
> > system,
> > > you
> > > can have severely instability.
> > >
> > > The key thing that you have to do to avoid these biases is to define a
> > > maximum
> > > delay before click for the purposes of modeling.  You need to ignore
> all
> > > impressions
> > > younger than this delay (because they may still get a click) and you
> need
> > to
> > > ignore
> > > all clicks after this delay (to avoid bias in favor of old
> impressions).
> > >  For on-line ads
> > > you can probably use a maximum delay of a few minutes because most
> clicks
> > > will
> > > happen by then.
> > >
> > > To find a good value for maximum delay, you should plot the CTR for a
> > bunch
> > > of
> > > ads versus delay.  This will increase rapidly shortly after zero delay,
> > but
> > > then will
> > > level off.  The ordering of ads by CTR is what you care about so you
> can
> > > follow the
> > > curves back and find the shortest delay where the ordering is clearly
> > > preserved.  Use
> > > that as your maximum delay.  Typically this is roughly where your CTR
> is
> > at
> > > about
> > > 80-90% of the final value.
> > >
> > >
> > >
> > >
> > >> (We were still working on creating a framework to digg all the
> > >> features we need from the log, I would like to share our experience by
> > >> using
> > >> Mahout SGD once we got our CTR prediction model release.)
> > >>
> > >> And for parallelize SGD, what do you mean for help with sparse inputs
> > that
> > >> exhibit long-tail frequency distribution? Would you like to share some
> > of
> > >> your ideas, Ted?
> > >>
> > >> Currently, what I could think about is split the data to multiple
> mapper
> > >> randomly and let every mapper to learn from the local data and get an
> > >> average on the whole model, or let multiple model to vote for every
> > >> feature's weight. A little like the idea of AdaBoost or RandomForest.
> > But I
> > >> am not a scientist or mathematician, so no idea if it is correct or
> not.
> > >>
> > >>
> > >> Thanks so much.
> > >> Stanley Xu
> > >>
> > >>
> > >>
> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <[email protected]>
> > >> wrote:
> > >>
> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]>
> > >> wrote:
> > >> >
> > >> > > 1 hour is acceptable, but I guess you misunderstand the data scale
> I
> > >> mean
> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
> > >> training
> > >> > > set(900M training example.). If every training data has 1000
> > dimension,
> > >> > it
> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
> > collected
> > >> > to
> > >> > > 14 days, it would be still 2-3TB data.
> > >> > >
> > >> >
> > >> > Oops.  Forgot that last multiplier.
> > >> >
> > >> >
> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, it
> > will
> > >> > take
> > >> > > about 1-2 hours to do the training, so 90M lines of data will cost
> > at
> > >> > least
> > >> > > 90 hours, is that correct?
> > >> > >
> > >> >
> > >> > 10M x 1000 x 8 = 80 GB.
> > >> >
> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
> > >> >
> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> > >> >
> > >> > Yes.  This is reasonable speed.  I think you can get a small factor
> > >> faster
> > >> > than this with SGD.  I have seen 100 million records with more
> > non-zero
> > >> > values than you describe with a training time of 3 hours.  I would
> not
> > >> > expect even as much as a factor of 10 speedup here.
> > >> >
> > >> >
> > >> > >
> > >> > > And from the PPT you provided
> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
> > >> > > You said it would take less than an hour for 20M data records for
> > >> > > numeric/category mixed dimensions. I am wondering, how many
> > dimensions
> > >> > per
> > >> > > record?
> > >> > >
> > >> >
> > >> > These are sparse records records with about a thousand non-zero
> > elements
> > >> > per
> > >> > record.
> > >> >
> > >> >
> > >> > But let's step back to your data for a moment.  Where do these
> > thousand
> > >> > dimensions come from?  Do you really have a thousand hand-built
> > features?
> > >> >  Do you not have any sparse, text-like features?
> > >> >
> > >> > If you really only have a thousand dimensional problem, then I think
> > your
> > >> > model might exhibit early convergence.
> > >> >
> > >> > If not, it is quite possible to parallelize SGD, but this is only
> > likely
> > >> to
> > >> > help with sparse inputs that exhibit long-tail frequency
> distribution.
> > >> >
> > >>
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > [email protected]
> >
>

Re: Which exact algorithm is used in the Mahout SGD?

Reply via email to