These are fairly straightforward to generate from random data.

Not particularly realistic, but highly parametrizable.

RCV1 should be almost in that range.  I think that the recent KDD music
classification exercise would be in that range if viewed as a
classification exercise.  See
http://jmlr.csail.mit.edu/proceedings/papers/v18/xie12a/xie12a.pdf for an
example of how this can be done.

On Tue, Aug 28, 2012 at 11:07 AM, Josh Patterson <[email protected]> wrote:

> Does anyone have any great suggestions for open datasets to run/test
> SGD on that are in the 500MB - 1GB range?
>
> Just looking for nice benchmarking datasets, wondered what the
> community thought here.
>
> Thanks,
>
> Josh
>
> --
> Twitter: @jpatanooga
> Principal Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com
>

Reply via email to