Hello,
I am interested in benchmarking Mahout on different hardware/software platforms, and I am looking for (real/synthetic) dataset (ideally between tens of GBs to couple of TBs). I am particularly interested in the K-means, (naive) Bayesian Network and Collaborative Filtering (ALS-WR) implementation. I found some potentially interesting (synthetic/real) benchmarks, but since I have never really tried any of those. I would like to hear if there is any recommendation of which one is better(in terms of easiness and validity) to use or is there any other alternative ? (1) BigDataBench from ICT, Chinese Academy of Sciences http://prof.ict.ac.cn/BigDataBench/ It has all the benchmarks for the 3 applications that I am interested in. (real / synthetic) (2) HiBench from Intel https://github.com/intel-hadoop/HiBench/wiki It has data for K-means (synthetic) (3) SNAP from stanford http://snap.stanford.edu/ It has data for collaborative filtering (real) Thank you very much! Wei
