Re: Spark TeraSort source request

Ewan Higgs Sun, 12 Apr 2015 04:59:47 -0700

Hi all.
The code is linked from my repo:

https://github.com/ehiggs/spark-terasort
"

This is an example Spark program for running TeraSort benchmarks. It isbased on work from Reynold Xin's branch<https://github.com/rxin/spark/tree/terasort>, but it is not the sameTeraSort program that currently holds the record<http://sortbenchmark.org/>. That program is here<https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort>.


"That program is here" links to:
https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort

I've been working on other projects at the moment so I haven't returnedto the spark-terasort stuff. If you have any pull requests, I would bevery grateful.


Yours,
Ewan

On 08/04/15 03:26, Pramod Biligiri wrote:

+1. I would love to have the code for this as well.

Pramod

On Fri, Apr 3, 2015 at 12:47 PM, Tom <thubregt...@gmail.com<mailto:thubregt...@gmail.com>> wrote:


    Hi all,

    As we all know, Spark has set the record for sorting data, as
    published on:
    https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

    Here at our group, we would love to verify these results, and compare
    machine using this benchmark. We've spend quite some time trying
    to find the
    terasort source code that was used, but can not find it anywhere.

    We did find two candidates:

    A version posted by Reynold [1], the posted of the message above. This
    version is stuck at "    // TODO: Add partition-local (external)
    sorting
    using TeraSortRecordOrdering", only generating data.

    Here, Ewan noticed that "it didn't appear to be similar to Hadoop
    TeraSort."
    [2] After this he created a version on his own [3]. With this
    version, we
    noticed problems with TeraValidate with datasets above ~10G (as
    mentioned by
    others at [4]. When examining the raw input and output files, it
    actually
    appears that the input data is sorted and the output data unsorted
    in both
    cases.

    Because of this, we believe we did not yet find the actual used
    source code.
    I've tried to search in the Spark User forum archive's, seeing
    request of
    people, indicating a demand, but did not succeed in finding the actual
    source code.

    My question:
    Could you guys please make the source code of the used TeraSort
    program,
    preferably with settings, available? If not, what are the reasons
    that this
    seems to be withheld?

    Thanks for any help,

    Tom Hubregtsen

    [1]
    
https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
    [2]
    
http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
    [3] https://github.com/ehiggs/spark-terasort
    [4]
    
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



    --
    View this message in context:
    
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
    Sent from the Apache Spark User List mailing list archive at
    Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail: user-h...@spark.apache.org
    <mailto:user-h...@spark.apache.org>

Re: Spark TeraSort source request

Reply via email to