Re: Spark TeraSort source request
Tom, According to Github's public activity log, Reynold Xin (in CC) deleted his sort-benchmark branch yesterday. I didn't have a local copy aside from the Daytona Partitioner (attached). Reynold, is it possible to reinstate your branch? -Ewan On 13/04/15 16:41, Tom Hubregtsen wrote: Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be mailto:ewan.hi...@ugent.be wrote: Hi all. The code is linked from my repo: https://github.com/ehiggs/spark-terasort This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch https://github.com/rxin/spark/tree/terasort, but it is not the same TeraSort program that currently holds the record http://sortbenchmark.org/. That program is here https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort. That program is here links to: https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort I've been working on other projects at the moment so I haven't returned to the spark-terasort stuff. If you have any pull requests, I would be very grateful. Yours, Ewan On 08/04/15 03:26, Pramod Biligiri wrote: +1. I would love to have the code for this as well. Pramod On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com mailto:thubregt...@gmail.com wrote: Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the terasort source code that was used, but can not find it anywhere. We did find two candidates: A version posted by Reynold [1], the posted of the message above. This version is stuck at // TODO: Add partition-local (external) sorting using TeraSortRecordOrdering, only generating data. Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort. [2] After this he created a version on his own [3]. With this version, we noticed problems with TeraValidate with datasets above ~10G (as mentioned by others at [4]. When examining the raw input and output files, it actually appears that the input data is sorted and the output data unsorted in both cases. Because of this, we believe we did not yet find the actual used source code. I've tried to search in the Spark User forum archive's, seeing request of people, indicating a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 [2] http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E [3] https://github.com/ehiggs/spark-terasort [4] http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the License); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * *http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS
Re: Spark TeraSort source request
Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote: Hi all. The code is linked from my repo: https://github.com/ehiggs/spark-terasort This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch https://github.com/rxin/spark/tree/terasort, but it is not the same TeraSort program that currently holds the record http://sortbenchmark.org/. That program is here https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort . That program is here links to: https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort I've been working on other projects at the moment so I haven't returned to the spark-terasort stuff. If you have any pull requests, I would be very grateful. Yours, Ewan On 08/04/15 03:26, Pramod Biligiri wrote: +1. I would love to have the code for this as well. Pramod On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com wrote: Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the terasort source code that was used, but can not find it anywhere. We did find two candidates: A version posted by Reynold [1], the posted of the message above. This version is stuck at // TODO: Add partition-local (external) sorting using TeraSortRecordOrdering, only generating data. Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort. [2] After this he created a version on his own [3]. With this version, we noticed problems with TeraValidate with datasets above ~10G (as mentioned by others at [4]. When examining the raw input and output files, it actually appears that the input data is sorted and the output data unsorted in both cases. Because of this, we believe we did not yet find the actual used source code. I've tried to search in the Spark User forum archive's, seeing request of people, indicating a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 [2] http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E [3] https://github.com/ehiggs/spark-terasort [4] http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark TeraSort source request
Hi all. The code is linked from my repo: https://github.com/ehiggs/spark-terasort This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch https://github.com/rxin/spark/tree/terasort, but it is not the same TeraSort program that currently holds the record http://sortbenchmark.org/. That program is here https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort. That program is here links to: https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort I've been working on other projects at the moment so I haven't returned to the spark-terasort stuff. If you have any pull requests, I would be very grateful. Yours, Ewan On 08/04/15 03:26, Pramod Biligiri wrote: +1. I would love to have the code for this as well. Pramod On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com mailto:thubregt...@gmail.com wrote: Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the terasort source code that was used, but can not find it anywhere. We did find two candidates: A version posted by Reynold [1], the posted of the message above. This version is stuck at // TODO: Add partition-local (external) sorting using TeraSortRecordOrdering, only generating data. Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort. [2] After this he created a version on his own [3]. With this version, we noticed problems with TeraValidate with datasets above ~10G (as mentioned by others at [4]. When examining the raw input and output files, it actually appears that the input data is sorted and the output data unsorted in both cases. Because of this, we believe we did not yet find the actual used source code. I've tried to search in the Spark User forum archive's, seeing request of people, indicating a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 [2] http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E [3] https://github.com/ehiggs/spark-terasort [4] http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Spark TeraSort source request
+1. I would love to have the code for this as well. Pramod On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com wrote: Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the terasort source code that was used, but can not find it anywhere. We did find two candidates: A version posted by Reynold [1], the posted of the message above. This version is stuck at // TODO: Add partition-local (external) sorting using TeraSortRecordOrdering, only generating data. Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort. [2] After this he created a version on his own [3]. With this version, we noticed problems with TeraValidate with datasets above ~10G (as mentioned by others at [4]. When examining the raw input and output files, it actually appears that the input data is sorted and the output data unsorted in both cases. Because of this, we believe we did not yet find the actual used source code. I've tried to search in the Spark User forum archive's, seeing request of people, indicating a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 [2] http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E [3] https://github.com/ehiggs/spark-terasort [4] http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark TeraSort source request
Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the terasort source code that was used, but can not find it anywhere. We did find two candidates: A version posted by Reynold [1], the posted of the message above. This version is stuck at // TODO: Add partition-local (external) sorting using TeraSortRecordOrdering, only generating data. Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort. [2] After this he created a version on his own [3]. With this version, we noticed problems with TeraValidate with datasets above ~10G (as mentioned by others at [4]. When examining the raw input and output files, it actually appears that the input data is sorted and the output data unsorted in both cases. Because of this, we believe we did not yet find the actual used source code. I've tried to search in the Spark User forum archive's, seeing request of people, indicating a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 [2] http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E [3] https://github.com/ehiggs/spark-terasort [4] http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org