[ https://issues.apache.org/jira/browse/SPARK-10870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Rudenko updated SPARK-10870: ---------------------------------- Description: Very useful dataset to test pipeline because of: # "Big data" dataset - original Kaggle competition dataset is 12 gb, but there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] dataset of the same schema as well. # Sparse models - categorical features has high cardinality # Reproducible results - because it's public and many other distributed machine learning libraries (e.g. [wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst], [parameter server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md], [azure ml|https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-hive-criteo-walkthrough/#mltasks] etc.) have made a base line benchmarks on which we could compare. I have some base line results with custom models (GBDT encoders and tuned LR) on spark-1.4. Will make pipelines using public spark model. [Winning solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used GBDT encoder (not available in spark, but not difficult to make one from GBT from mllib) + hashing + factorization machine (planned for spark-1.6). was: Very useful dataset to test pipeline because of: # "Big data" dataset - original Kaggle competition dataset is 12 gb, but there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] dataset of the same schema as well. # Sparse models - categorical features has high cardinality # Reproducible results - because it's public and many other distributed machine learning libraries (e.g. [wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst], [parameter server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md] etc.) have made a base line benchmarks on which we could compare. I have some base line results with custom models (GBDT encoders and tuned LR) on spark-1.4. Will make pipelines using public spark model. [Winning solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used GBDT encoder (not available in spark, but not difficult to make one from GBT from mllib) + hashing + factorization machine (planned for spark-1.6). > Criteo Display Advertising Challenge dataset > -------------------------------------------- > > Key: SPARK-10870 > URL: https://issues.apache.org/jira/browse/SPARK-10870 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Peter Rudenko > > Very useful dataset to test pipeline because of: > # "Big data" dataset - original Kaggle competition dataset is 12 gb, but > there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] > dataset of the same schema as well. > # Sparse models - categorical features has high cardinality > # Reproducible results - because it's public and many other distributed > machine learning libraries (e.g. > [wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst], > [parameter > server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md], > [azure > ml|https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-hive-criteo-walkthrough/#mltasks] > etc.) have made a base line benchmarks on which we could compare. > I have some base line results with custom models (GBDT encoders and tuned LR) > on spark-1.4. Will make pipelines using public spark model. [Winning > solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used > GBDT encoder (not available in spark, but not difficult to make one from GBT > from mllib) + hashing + factorization machine (planned for spark-1.6). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org