Dataset announcement

2015-04-15 Thread Olivier Chapelle
Dear Spark users,

I would like to draw your attention to a dataset that we recently released,
which is as of now the largest machine learning dataset ever released; see
the following blog announcements:
 - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
 -
http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx

The characteristics of this dataset are:
 - 1 TB of data
 - binary classification
 - 13 integer features
 - 26 categorical features, some of them taking millions of values.
 - 4B rows

Hopefully this dataset will be useful to assess and push further the
scalability of Spark and MLlib.

Cheers,
Olivier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dataset announcement

2015-04-15 Thread Simon Edelhaus
Greetings!

How about medical data sets, and specifically longitudinal vital signs.

Can people send good pointers?

Thanks in advance,


-- ttfn
Simon Edelhaus
California 2015

On Wed, Apr 15, 2015 at 6:01 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Very neat, Olivier; thanks for sharing this.

 Matei

  On Apr 15, 2015, at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc
 wrote:
 
  Dear Spark users,
 
  I would like to draw your attention to a dataset that we recently
 released,
  which is as of now the largest machine learning dataset ever released;
 see
  the following blog announcements:
  - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
  -
 
 http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx
 
  The characteristics of this dataset are:
  - 1 TB of data
  - binary classification
  - 13 integer features
  - 26 categorical features, some of them taking millions of values.
  - 4B rows
 
  Hopefully this dataset will be useful to assess and push further the
  scalability of Spark and MLlib.
 
  Cheers,
  Olivier
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this.

Matei

 On Apr 15, 2015, at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote:
 
 Dear Spark users,
 
 I would like to draw your attention to a dataset that we recently released,
 which is as of now the largest machine learning dataset ever released; see
 the following blog announcements:
 - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
 -
 http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx
 
 The characteristics of this dataset are:
 - 1 TB of data
 - binary classification
 - 13 integer features
 - 26 categorical features, some of them taking millions of values.
 - 4B rows
 
 Hopefully this dataset will be useful to assess and push further the
 scalability of Spark and MLlib.
 
 Cheers,
 Olivier
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work.
Interesting in more than one ways - including training, benchmarking,
testing new releases et al.
One quick question - do you plan to make it available as an S3 bucket ?

Cheers
k/

On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc
wrote:

 Dear Spark users,

 I would like to draw your attention to a dataset that we recently released,
 which is as of now the largest machine learning dataset ever released; see
 the following blog announcements:
  - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
  -

 http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx

 The characteristics of this dataset are:
  - 1 TB of data
  - binary classification
  - 13 integer features
  - 26 categorical features, some of them taking millions of values.
  - 4B rows

 Hopefully this dataset will be useful to assess and push further the
 scalability of Spark and MLlib.

 Cheers,
 Olivier



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org