These are fairly straightforward to generate from random data. Not particularly realistic, but highly parametrizable.
RCV1 should be almost in that range. I think that the recent KDD music classification exercise would be in that range if viewed as a classification exercise. See http://jmlr.csail.mit.edu/proceedings/papers/v18/xie12a/xie12a.pdf for an example of how this can be done. On Tue, Aug 28, 2012 at 11:07 AM, Josh Patterson <[email protected]> wrote: > Does anyone have any great suggestions for open datasets to run/test > SGD on that are in the 500MB - 1GB range? > > Just looking for nice benchmarking datasets, wondered what the > community thought here. > > Thanks, > > Josh > > -- > Twitter: @jpatanooga > Principal Solution Architect @ Cloudera > hadoop: http://www.cloudera.com >
