+1. I would love to have the code for this as well.

Pramod

On Fri, Apr 3, 2015 at 12:47 PM, Tom <thubregt...@gmail.com> wrote:

> Hi all,
>
> As we all know, Spark has set the record for sorting data, as published on:
> https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.
>
> Here at our group, we would love to verify these results, and compare
> machine using this benchmark. We've spend quite some time trying to find
> the
> terasort source code that was used, but can not find it anywhere.
>
> We did find two candidates:
>
> A version posted by Reynold [1], the posted of the message above. This
> version is stuck at "    // TODO: Add partition-local (external) sorting
> using TeraSortRecordOrdering", only generating data.
>
> Here, Ewan noticed that "it didn't appear to be similar to Hadoop
> TeraSort."
> [2] After this he created a version on his own [3]. With this version, we
> noticed problems with TeraValidate with datasets above ~10G (as mentioned
> by
> others at [4]. When examining the raw input and output files, it actually
> appears that the input data is sorted and the output data unsorted in both
> cases.
>
> Because of this, we believe we did not yet find the actual used source
> code.
> I've tried to search in the Spark User forum archive's, seeing request of
> people, indicating a demand, but did not succeed in finding the actual
> source code.
>
> My question:
> Could you guys please make the source code of the used TeraSort program,
> preferably with settings, available? If not, what are the reasons that this
> seems to be withheld?
>
> Thanks for any help,
>
> Tom Hubregtsen
>
> [1]
>
> https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
> [2]
>
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
> [3] https://github.com/ehiggs/spark-terasort
> [4]
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to