Re: Spark TeraSort source request

2015-04-13 Thread Ewan Higgs

Tom,
According to Github's public activity log, Reynold Xin (in CC) deleted 
his sort-benchmark branch yesterday. I didn't have a local copy aside 
from the Daytona Partitioner (attached).


Reynold, is it possible to reinstate your branch?

-Ewan

On 13/04/15 16:41, Tom Hubregtsen wrote:
Thank you for your response Ewan. I quickly looked yesterday and it 
was there, but today at work I tried to open it again to start working 
on it, but it appears to be removed. Is this correct?


Thanks,

Tom

On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be 
mailto:ewan.hi...@ugent.be wrote:


Hi all.
The code is linked from my repo:

https://github.com/ehiggs/spark-terasort

This is an example Spark program for running TeraSort benchmarks.
It is based on work from Reynold Xin's branch
https://github.com/rxin/spark/tree/terasort, but it is not the
same TeraSort program that currently holds the record
http://sortbenchmark.org/. That program is here

https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort.


That program is here links to:

https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort

I've been working on other projects at the moment so I haven't
returned to the spark-terasort stuff. If you have any pull
requests, I would be very grateful.

Yours,
Ewan


On 08/04/15 03:26, Pramod Biligiri wrote:

+1. I would love to have the code for this as well.

Pramod

On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com
mailto:thubregt...@gmail.com wrote:

Hi all,

As we all know, Spark has set the record for sorting data, as
published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

Here at our group, we would love to verify these results, and
compare
machine using this benchmark. We've spend quite some time
trying to find the
terasort source code that was used, but can not find it anywhere.

We did find two candidates:

A version posted by Reynold [1], the posted of the message
above. This
version is stuck at // TODO: Add partition-local
(external) sorting
using TeraSortRecordOrdering, only generating data.

Here, Ewan noticed that it didn't appear to be similar to
Hadoop TeraSort.
[2] After this he created a version on his own [3]. With this
version, we
noticed problems with TeraValidate with datasets above ~10G
(as mentioned by
others at [4]. When examining the raw input and output files,
it actually
appears that the input data is sorted and the output data
unsorted in both
cases.

Because of this, we believe we did not yet find the actual
used source code.
I've tried to search in the Spark User forum archive's,
seeing request of
people, indicating a demand, but did not succeed in finding
the actual
source code.

My question:
Could you guys please make the source code of the used
TeraSort program,
preferably with settings, available? If not, what are the
reasons that this
seems to be withheld?

Thanks for any help,

Tom Hubregtsen

[1]

https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
[2]

http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
[3] https://github.com/ehiggs/spark-terasort
[4]

http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org







/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the License); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS

Re: Spark TeraSort source request

2015-04-13 Thread Tom Hubregtsen
Thank you for your response Ewan. I quickly looked yesterday and it was
there, but today at work I tried to open it again to start working on it,
but it appears to be removed. Is this correct?

Thanks,

Tom

On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote:

  Hi all.
 The code is linked from my repo:

 https://github.com/ehiggs/spark-terasort
 
 This is an example Spark program for running TeraSort benchmarks. It is
 based on work from Reynold Xin's branch
 https://github.com/rxin/spark/tree/terasort, but it is not the same
 TeraSort program that currently holds the record
 http://sortbenchmark.org/. That program is here
 https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort
 .
 

 That program is here links to:

 https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort

 I've been working on other projects at the moment so I haven't returned to
 the spark-terasort stuff. If you have any pull requests, I would be very
 grateful.

 Yours,
 Ewan


 On 08/04/15 03:26, Pramod Biligiri wrote:

 +1. I would love to have the code for this as well.

  Pramod

 On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com wrote:

 Hi all,

 As we all know, Spark has set the record for sorting data, as published
 on:
 https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

 Here at our group, we would love to verify these results, and compare
 machine using this benchmark. We've spend quite some time trying to find
 the
 terasort source code that was used, but can not find it anywhere.

 We did find two candidates:

 A version posted by Reynold [1], the posted of the message above. This
 version is stuck at // TODO: Add partition-local (external) sorting
 using TeraSortRecordOrdering, only generating data.

 Here, Ewan noticed that it didn't appear to be similar to Hadoop
 TeraSort.
 [2] After this he created a version on his own [3]. With this version, we
 noticed problems with TeraValidate with datasets above ~10G (as mentioned
 by
 others at [4]. When examining the raw input and output files, it actually
 appears that the input data is sorted and the output data unsorted in both
 cases.

 Because of this, we believe we did not yet find the actual used source
 code.
 I've tried to search in the Spark User forum archive's, seeing request of
 people, indicating a demand, but did not succeed in finding the actual
 source code.

 My question:
 Could you guys please make the source code of the used TeraSort program,
 preferably with settings, available? If not, what are the reasons that
 this
 seems to be withheld?

 Thanks for any help,

 Tom Hubregtsen

 [1]

 https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
 [2]

 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
 [3] https://github.com/ehiggs/spark-terasort
 [4]

 http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Spark TeraSort source request

2015-04-12 Thread Ewan Higgs

Hi all.
The code is linked from my repo:

https://github.com/ehiggs/spark-terasort

This is an example Spark program for running TeraSort benchmarks. It is 
based on work from Reynold Xin's branch 
https://github.com/rxin/spark/tree/terasort, but it is not the same 
TeraSort program that currently holds the record 
http://sortbenchmark.org/. That program is here 
https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort.



That program is here links to:
https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort

I've been working on other projects at the moment so I haven't returned 
to the spark-terasort stuff. If you have any pull requests, I would be 
very grateful.


Yours,
Ewan

On 08/04/15 03:26, Pramod Biligiri wrote:

+1. I would love to have the code for this as well.

Pramod

On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com 
mailto:thubregt...@gmail.com wrote:


Hi all,

As we all know, Spark has set the record for sorting data, as
published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

Here at our group, we would love to verify these results, and compare
machine using this benchmark. We've spend quite some time trying
to find the
terasort source code that was used, but can not find it anywhere.

We did find two candidates:

A version posted by Reynold [1], the posted of the message above. This
version is stuck at // TODO: Add partition-local (external)
sorting
using TeraSortRecordOrdering, only generating data.

Here, Ewan noticed that it didn't appear to be similar to Hadoop
TeraSort.
[2] After this he created a version on his own [3]. With this
version, we
noticed problems with TeraValidate with datasets above ~10G (as
mentioned by
others at [4]. When examining the raw input and output files, it
actually
appears that the input data is sorted and the output data unsorted
in both
cases.

Because of this, we believe we did not yet find the actual used
source code.
I've tried to search in the Spark User forum archive's, seeing
request of
people, indicating a demand, but did not succeed in finding the actual
source code.

My question:
Could you guys please make the source code of the used TeraSort
program,
preferably with settings, available? If not, what are the reasons
that this
seems to be withheld?

Thanks for any help,

Tom Hubregtsen

[1]

https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
[2]

http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
[3] https://github.com/ehiggs/spark-terasort
[4]

http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org






Re: Spark TeraSort source request

2015-04-07 Thread Pramod Biligiri
+1. I would love to have the code for this as well.

Pramod

On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com wrote:

 Hi all,

 As we all know, Spark has set the record for sorting data, as published on:
 https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

 Here at our group, we would love to verify these results, and compare
 machine using this benchmark. We've spend quite some time trying to find
 the
 terasort source code that was used, but can not find it anywhere.

 We did find two candidates:

 A version posted by Reynold [1], the posted of the message above. This
 version is stuck at // TODO: Add partition-local (external) sorting
 using TeraSortRecordOrdering, only generating data.

 Here, Ewan noticed that it didn't appear to be similar to Hadoop
 TeraSort.
 [2] After this he created a version on his own [3]. With this version, we
 noticed problems with TeraValidate with datasets above ~10G (as mentioned
 by
 others at [4]. When examining the raw input and output files, it actually
 appears that the input data is sorted and the output data unsorted in both
 cases.

 Because of this, we believe we did not yet find the actual used source
 code.
 I've tried to search in the Spark User forum archive's, seeing request of
 people, indicating a demand, but did not succeed in finding the actual
 source code.

 My question:
 Could you guys please make the source code of the used TeraSort program,
 preferably with settings, available? If not, what are the reasons that this
 seems to be withheld?

 Thanks for any help,

 Tom Hubregtsen

 [1]

 https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
 [2]

 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
 [3] https://github.com/ehiggs/spark-terasort
 [4]

 http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark TeraSort source request

2015-04-03 Thread Tom
Hi all,

As we all know, Spark has set the record for sorting data, as published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

Here at our group, we would love to verify these results, and compare
machine using this benchmark. We've spend quite some time trying to find the
terasort source code that was used, but can not find it anywhere.

We did find two candidates: 

A version posted by Reynold [1], the posted of the message above. This
version is stuck at // TODO: Add partition-local (external) sorting
using TeraSortRecordOrdering, only generating data. 

Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort.
[2] After this he created a version on his own [3]. With this version, we
noticed problems with TeraValidate with datasets above ~10G (as mentioned by
others at [4]. When examining the raw input and output files, it actually
appears that the input data is sorted and the output data unsorted in both
cases. 

Because of this, we believe we did not yet find the actual used source code.
I've tried to search in the Spark User forum archive's, seeing request of
people, indicating a demand, but did not succeed in finding the actual
source code. 

My question:
Could you guys please make the source code of the used TeraSort program,
preferably with settings, available? If not, what are the reasons that this
seems to be withheld?

Thanks for any help,

Tom Hubregtsen 

[1]
https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
[2]
http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
[3] https://github.com/ehiggs/spark-terasort
[4]
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org