Hi Ted,

For the data, currently, we digg the logs for a specific cookie. For
example, we will check how many times has he seen the banner from the
advertiser in last 7 days. We didn't has 1000 non-zero value now, I thought
we will only have 100-200 now, but we expect to have 1000 at most I thought.
We won't have lots of text like features for what we were predicting is for
display ads rather than search ads. We might introduce a tagging system for
ad management to see if we will get better result by adding these features
in the future.

I understood as the algorithm, the time in training only relies on the
non-zero records, but per our test, there would be some overhead we could
not ignore for thoso non-zero records, though the cost is sub-linear or
logit to the length of the hashed vector.

And in CTR prediction, I am not pretty sure it will converge very quickly.
Because we will very possibly see some records has the almost same feature
but different result in display ads. But we will see the result in the
future. (We were still working on creating a framework to digg all the
features we need from the log, I would like to share our experience by using
Mahout SGD once we got our CTR prediction model release.)

And for parallelize SGD, what do you mean for help with sparse inputs that
exhibit long-tail frequency distribution? Would you like to share some of
your ideas, Ted?

Currently, what I could think about is split the data to multiple mapper
randomly and let every mapper to learn from the local data and get an
average on the whole model, or let multiple model to vote for every
feature's weight. A little like the idea of AdaBoost or RandomForest. But I
am not a scientist or mathematician, so no idea if it is correct or not.


Thanks so much.
Stanley Xu



On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <[email protected]> wrote:

> On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]> wrote:
>
> > 1 hour is acceptable, but I guess you misunderstand the data scale I mean
> > here. The 900M records didn't mean 900M Bytes, but 900M lines of training
> > set(900M training example.). If every training data has 1000 dimension,
> it
> > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected
> to
> > 14 days, it would be still 2-3TB data.
> >
>
> Oops.  Forgot that last multiplier.
>
>
> > Per our simple test, for 1000 dimension, 10M lines of record, it will
> take
> > about 1-2 hours to do the training, so 90M lines of data will cost at
> least
> > 90 hours, is that correct?
> >
>
> 10M x 1000 x 8 = 80 GB.
>
> 1-2 hours = (approx) 5000 seconds.  So this is
>
> 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>
> Yes.  This is reasonable speed.  I think you can get a small factor faster
> than this with SGD.  I have seen 100 million records with more non-zero
> values than you describe with a training time of 3 hours.  I would not
> expect even as much as a factor of 10 speedup here.
>
>
> >
> > And from the PPT you provided
> > http://www.slideshare.net/tdunning/sdforum-11042010
> > You said it would take less than an hour for 20M data records for
> > numeric/category mixed dimensions. I am wondering, how many dimensions
> per
> > record?
> >
>
> These are sparse records records with about a thousand non-zero elements
> per
> record.
>
>
> But let's step back to your data for a moment.  Where do these thousand
> dimensions come from?  Do you really have a thousand hand-built features?
>  Do you not have any sparse, text-like features?
>
> If you really only have a thousand dimensional problem, then I think your
> model might exhibit early convergence.
>
> If not, it is quite possible to parallelize SGD, but this is only likely to
> help with sparse inputs that exhibit long-tail frequency distribution.
>

Reply via email to