Re: Spark TeraSort source request

Ewan Higgs Mon, 13 Apr 2015 12:20:07 -0700

Tom,

According to Github's public activity log, Reynold Xin (in CC) deletedhis sort-benchmark branch yesterday. I didn't have a local copy asidefrom the Daytona Partitioner (attached).


Reynold, is it possible to reinstate your branch?

-Ewan

On 13/04/15 16:41, Tom Hubregtsen wrote:

Thank you for your response Ewan. I quickly looked yesterday and itwas there, but today at work I tried to open it again to start workingon it, but it appears to be removed. Is this correct?


Thanks,

Tom

On 12 April 2015 at 06:58, Ewan Higgs <ewan.hi...@ugent.be<mailto:ewan.hi...@ugent.be>> wrote:


    Hi all.
    The code is linked from my repo:

    https://github.com/ehiggs/spark-terasort
    "
    This is an example Spark program for running TeraSort benchmarks.
    It is based on work from Reynold Xin's branch
    <https://github.com/rxin/spark/tree/terasort>, but it is not the
    same TeraSort program that currently holds the record
    <http://sortbenchmark.org/>. That program is here
    
<https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort>.
    "

    "That program is here" links to:
    
https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort

    I've been working on other projects at the moment so I haven't
    returned to the spark-terasort stuff. If you have any pull
    requests, I would be very grateful.

    Yours,
    Ewan


    On 08/04/15 03:26, Pramod Biligiri wrote:

    +1. I would love to have the code for this as well.

    Pramod

    On Fri, Apr 3, 2015 at 12:47 PM, Tom <thubregt...@gmail.com
    <mailto:thubregt...@gmail.com>> wrote:

        Hi all,

        As we all know, Spark has set the record for sorting data, as
        published on:
        https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

        Here at our group, we would love to verify these results, and
        compare
        machine using this benchmark. We've spend quite some time
        trying to find the
        terasort source code that was used, but can not find it anywhere.

        We did find two candidates:

        A version posted by Reynold [1], the posted of the message
        above. This
        version is stuck at "    // TODO: Add partition-local
        (external) sorting
        using TeraSortRecordOrdering", only generating data.

        Here, Ewan noticed that "it didn't appear to be similar to
        Hadoop TeraSort."
        [2] After this he created a version on his own [3]. With this
        version, we
        noticed problems with TeraValidate with datasets above ~10G
        (as mentioned by
        others at [4]. When examining the raw input and output files,
        it actually
        appears that the input data is sorted and the output data
        unsorted in both
        cases.

        Because of this, we believe we did not yet find the actual
        used source code.
        I've tried to search in the Spark User forum archive's,
        seeing request of
        people, indicating a demand, but did not succeed in finding
        the actual
        source code.

        My question:
        Could you guys please make the source code of the used
        TeraSort program,
        preferably with settings, available? If not, what are the
        reasons that this
        seems to be withheld?

        Thanks for any help,

        Tom Hubregtsen

        [1]
        
https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
        [2]
        
http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
        [3] https://github.com/ehiggs/spark-terasort
        [4]
        
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



        --
        View this message in context:
        
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
        Sent from the Apache Spark User List mailing list archive at
        Nabble.com.

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.github.ehiggs.spark.terasort

import org.apache.spark.Partitioner


final class DaytonaPartitioner(rangeBounds: Array[Long]) extends Partitioner {

  private[this] var currentPart: Int = 0
  private[this] var currentHiKey: Long = 0L
  private[this] var currentLoKey: Long = 0L

  private[this] val lastPart: Int = rangeBounds.length / 2

  def setKeys() {
    currentPart = 0
    currentHiKey = rangeBounds(0)
    currentLoKey = rangeBounds(1)
  }

  override def numPartitions: Int = rangeBounds.length / 2 + 1

  override def getPartition(key: Any): Int = ???

  def getPartitionSpecialized(key1: Long, key2: Long): Int = {
    while (currentPart <= lastPart) {
      if (currentPart == lastPart) {
        return lastPart
      } else {
        val c1 = java.lang.Long.compare(key1, currentHiKey)
        if (c1 < 0) {
          return currentPart
        } else if (c1 == 0) {
          val c2 = java.lang.Long.compare(key2, currentLoKey)
          if (c2 <= 0) {
            return currentPart
          }
        }
      }
      currentPart += 1
      if (currentPart < lastPart) {
        currentHiKey = rangeBounds(currentPart * 2)
        currentLoKey = rangeBounds(currentPart * 2 + 1)
      }
    }
    assert(false, "something is wrong!!!!!!!!!!!!")
    return currentPart
  }
}




final class DaytonaPartitionerSkew(rangeBounds: Array[Long]) extends Partitioner {

  private[this] val inclusive = new Array[Boolean](rangeBounds.length / 2)

  private[this] var currentPart: Int = 0

  private[this] var prevHiKey: Long = -1L
  private[this] var prevLoKey: Long = -1L

  private[this] var currentBoundHiKey: Long = 0L
  private[this] var currentBoundLoKey: Long = 0L

  private[this] val lastPart: Int = rangeBounds.length / 2

  def setKeys() {
    currentPart = 0
    currentBoundHiKey = rangeBounds(0)
    currentBoundLoKey = rangeBounds(1)
  }

  override def numPartitions: Int = rangeBounds.length / 2 + 1

  override def getPartition(key: Any): Int = ???

  def getPartitionSpecialized(key1: Long, key2: Long): Int = {
    if (key1 == prevHiKey && key2 == prevLoKey) {
      return currentPart
    } else {
      prevHiKey = key1
      prevLoKey = key2
      while (currentPart <= lastPart) {
        if (currentPart == lastPart) {
          return lastPart
        } else {
          val c1 = java.lang.Long.compare(key1, currentBoundHiKey)
          if (c1 < 0) {
            return currentPart
          } else if (c1 == 0) {
            val c2 = java.lang.Long.compare(key2, currentBoundLoKey)
            if (c2 < 0) {
              return currentPart
            } else if (c2 == 0) {
              val nextBoundHikey = rangeBounds((currentPart + 1) * 2)
              val nextBoundLokey = rangeBounds((currentPart + 1) * 2 + 1)
              if (nextBoundHikey == currentBoundHiKey && nextBoundLokey == currentBoundLoKey) {
                currentPart += 1
                currentBoundHiKey = rangeBounds(currentPart * 2)
                currentBoundLoKey = rangeBounds(currentPart * 2 + 1)
                return currentPart
              }
            }
          }
        }
        currentPart += 1
        if (currentPart < lastPart) {
          currentBoundHiKey = rangeBounds(currentPart * 2)
          currentBoundLoKey = rangeBounds(currentPart * 2 + 1)
        }
      }
      assert(false, "something is wrong!!!!!!!!!!!!")
      return currentPart
    }
  }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark TeraSort source request

Reply via email to