On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <[email protected]> wrote:

> 1 hour is acceptable, but I guess you misunderstand the data scale I mean
> here. The 900M records didn't mean 900M Bytes, but 900M lines of training
> set(900M training example.). If every training data has 1000 dimension, it
> means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected to
> 14 days, it would be still 2-3TB data.
>

Oops.  Forgot that last multiplier.


> Per our simple test, for 1000 dimension, 10M lines of record, it will take
> about 1-2 hours to do the training, so 90M lines of data will cost at least
> 90 hours, is that correct?
>

10M x 1000 x 8 = 80 GB.

1-2 hours = (approx) 5000 seconds.  So this is

80 GB / 5000 s = 80/5 MB /s = 16MB / s

Yes.  This is reasonable speed.  I think you can get a small factor faster
than this with SGD.  I have seen 100 million records with more non-zero
values than you describe with a training time of 3 hours.  I would not
expect even as much as a factor of 10 speedup here.


>
> And from the PPT you provided
> http://www.slideshare.net/tdunning/sdforum-11042010
> You said it would take less than an hour for 20M data records for
> numeric/category mixed dimensions. I am wondering, how many dimensions per
> record?
>

These are sparse records records with about a thousand non-zero elements per
record.


But let's step back to your data for a moment.  Where do these thousand
dimensions come from?  Do you really have a thousand hand-built features?
 Do you not have any sparse, text-like features?

If you really only have a thousand dimensional problem, then I think your
model might exhibit early convergence.

If not, it is quite possible to parallelize SGD, but this is only likely to
help with sparse inputs that exhibit long-tail frequency distribution.

Reply via email to