Hi Burak, Thank you for your pointer, it is really helping out. I do have some consecutive questions though.
After looking at the Big Data Benchmark page <https://amplab.cs.berkeley.edu/benchmark/> (Section "Run this benchmark yourself), I was expecting the following combination of files: Sets: Uservisits, Rankings, Crawl Size: tiny, 1node, 5node Both in text and Sequence file. When looking at http://s3.amazonaws.com/big-data-benchmark/, I only see sequence-snappy/5nodes/_distcp_logs_44js2v part 0 to 103 sequence-snappy/5nodes/_distcp_logs_nclxhd part 0 to 102 sequence-snappy/5nodes/_distcp_logs_vnuhym part 0 to 24 sequence-snappy/5nodes/crawl part 0 to 743 As "Crawl" is the name of a set I am looking for, I started to download it. Since it was the end of the day and I was going to download it overnight, I just wrote a for loop from 0 to 999 with wget, expecting it to download until 7-something and 404 errors for the others. When I looked at it this morning, I noticed that it all completed downloading. The total Crawl set for 5 nodes should be ~30Gb, I am currently at part 1020 with a total set of 40G. This leads to my (sub)questions: Does anybody know what exactly is still hosted: - Are the tiny and 1node sets still available? - Are the Uservisits and Rankings still available? - Why is the crawl set bigger than expected, and how big is it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p9938.html Sent from the Apache Spark User List mailing list archive at Nabble.com.