+1. I would love to have the code for this as well. Pramod
On Fri, Apr 3, 2015 at 12:47 PM, Tom <thubregt...@gmail.com> wrote: > Hi all, > > As we all know, Spark has set the record for sorting data, as published on: > https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. > > Here at our group, we would love to verify these results, and compare > machine using this benchmark. We've spend quite some time trying to find > the > terasort source code that was used, but can not find it anywhere. > > We did find two candidates: > > A version posted by Reynold [1], the posted of the message above. This > version is stuck at " // TODO: Add partition-local (external) sorting > using TeraSortRecordOrdering", only generating data. > > Here, Ewan noticed that "it didn't appear to be similar to Hadoop > TeraSort." > [2] After this he created a version on his own [3]. With this version, we > noticed problems with TeraValidate with datasets above ~10G (as mentioned > by > others at [4]. When examining the raw input and output files, it actually > appears that the input data is sorted and the output data unsorted in both > cases. > > Because of this, we believe we did not yet find the actual used source > code. > I've tried to search in the Spark User forum archive's, seeing request of > people, indicating a demand, but did not succeed in finding the actual > source code. > > My question: > Could you guys please make the source code of the used TeraSort program, > preferably with settings, available? If not, what are the reasons that this > seems to be withheld? > > Thanks for any help, > > Tom Hubregtsen > > [1] > > https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 > [2] > > http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E > [3] https://github.com/ehiggs/spark-terasort > [4] > > http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >