Re: Which exact algorithm is used in the Mahout SGD?

Lance Norskog Thu, 28 Apr 2011 19:04:49 -0700

CTR == Clickthrough Rate

On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <[email protected]> wrote:
>
>> ... I understood as the algorithm, the time in training only relies on the
>> non-zero records, but per our test, there would be some overhead we could
>> not ignore for thoso non-zero records, though the cost is sub-linear or
>> logit to the length of the hashed vector.
>>
>
> This is pretty close if we say "non-zero values".  A record usually refers
> to an entire training
> example.
>
> The extra work refers mostly to deferred regularization that eventually has
> to be
> applied.  My guess is that it is even less than log in the feature vector
> size.
>
>
>> And in CTR prediction, I am not pretty sure it will converge very quickly.
>>
>
> I was saying this purely based on the number of features.
>
>
>> Because we will very possibly see some records has the almost same feature
>> but different result in display ads.
>
>
> The algorithm can still converge to an estimate of the probability here.
>
>
>> But we will see the result in the
>> future.
>
>
> You have to be *very* careful about this to avoid prejudicing the model
> against
> recent impressions.  If you have a fast feedback to the ad targeting system,
> you
> can have severely instability.
>
> The key thing that you have to do to avoid these biases is to define a
> maximum
> delay before click for the purposes of modeling.  You need to ignore all
> impressions
> younger than this delay (because they may still get a click) and you need to
> ignore
> all clicks after this delay (to avoid bias in favor of old impressions).
>  For on-line ads
> you can probably use a maximum delay of a few minutes because most clicks
> will
> happen by then.
>
> To find a good value for maximum delay, you should plot the CTR for a bunch
> of
> ads versus delay.  This will increase rapidly shortly after zero delay, but
> then will
> level off.  The ordering of ads by CTR is what you care about so you can
> follow the
> curves back and find the shortest delay where the ordering is clearly
> preserved.  Use
> that as your maximum delay.  Typically this is roughly where your CTR is at
> about
> 80-90% of the final value.
>
>
>
>
>> (We were still working on creating a framework to digg all the
>> features we need from the log, I would like to share our experience by
>> using
>> Mahout SGD once we got our CTR prediction model release.)
>>
>> And for parallelize SGD, what do you mean for help with sparse inputs that
>> exhibit long-tail frequency distribution? Would you like to share some of
>> your ideas, Ted?
>>
>> Currently, what I could think about is split the data to multiple mapper
>> randomly and let every mapper to learn from the local data and get an
>> average on the whole model, or let multiple model to vote for every
>> feature's weight. A little like the idea of AdaBoost or RandomForest. But I
>> am not a scientist or mathematician, so no idea if it is correct or not.
>>
>>
>> Thanks so much.
>> Stanley Xu
>>
>>
>>
>> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <[email protected]>
>> wrote:
>>
>> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]>
>> wrote:
>> >
>> > > 1 hour is acceptable, but I guess you misunderstand the data scale I
>> mean
>> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
>> training
>> > > set(900M training example.). If every training data has 1000 dimension,
>> > it
>> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected
>> > to
>> > > 14 days, it would be still 2-3TB data.
>> > >
>> >
>> > Oops.  Forgot that last multiplier.
>> >
>> >
>> > > Per our simple test, for 1000 dimension, 10M lines of record, it will
>> > take
>> > > about 1-2 hours to do the training, so 90M lines of data will cost at
>> > least
>> > > 90 hours, is that correct?
>> > >
>> >
>> > 10M x 1000 x 8 = 80 GB.
>> >
>> > 1-2 hours = (approx) 5000 seconds.  So this is
>> >
>> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>> >
>> > Yes.  This is reasonable speed.  I think you can get a small factor
>> faster
>> > than this with SGD.  I have seen 100 million records with more non-zero
>> > values than you describe with a training time of 3 hours.  I would not
>> > expect even as much as a factor of 10 speedup here.
>> >
>> >
>> > >
>> > > And from the PPT you provided
>> > > http://www.slideshare.net/tdunning/sdforum-11042010
>> > > You said it would take less than an hour for 20M data records for
>> > > numeric/category mixed dimensions. I am wondering, how many dimensions
>> > per
>> > > record?
>> > >
>> >
>> > These are sparse records records with about a thousand non-zero elements
>> > per
>> > record.
>> >
>> >
>> > But let's step back to your data for a moment.  Where do these thousand
>> > dimensions come from?  Do you really have a thousand hand-built features?
>> >  Do you not have any sparse, text-like features?
>> >
>> > If you really only have a thousand dimensional problem, then I think your
>> > model might exhibit early convergence.
>> >
>> > If not, it is quite possible to parallelize SGD, but this is only likely
>> to
>> > help with sparse inputs that exhibit long-tail frequency distribution.
>> >
>>
>




-- 
Lance Norskog
[email protected]

Re: Which exact algorithm is used in the Mahout SGD?

Reply via email to