Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.
Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the terasort source code that was used, but can not find it anywhere. We did find two candidates: A version posted by Reynold [1], the posted of the message above. This version is stuck at " // TODO: Add partition-local (external) sorting using TeraSortRecordOrdering", only generating data. Here, Ewan noticed that "it didn't appear to be similar to Hadoop TeraSort." [2] After this he created a version on his own [3]. With this version, we noticed problems with TeraValidate with datasets above ~10G (as mentioned by others at [4]. When examining the raw input and output files, it actually appears that the input data is sorted and the output data unsorted in both cases. Because of this, we believe we did not yet find the actual used source code. I've tried to search in the Spark User forum archive's, seeing request of people, indicating a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 [2] http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E [3] https://github.com/ehiggs/spark-terasort [4] http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org