Could this be a build issue (i.e., sbt package)?

  If I ran the same jar build for 1.4.1 in 1.5, I am seeing large
regression too in queries (all other things identical)...

  I am curious, to build 1.5 (when it isn't released yet), what do I need
to do with the build.sbt file?

  any special parameters i should be using to make sure I load the latest
hive dependencies?



From:   Michael Armbrust <mich...@databricks.com>
To:     Todd <bit1...@163.com>
Cc:     "user@spark.apache.org" <user@spark.apache.org>
Date:   09/10/2015 11:07 AM
Subject:        Re: spark 1.5 SQL slows down dramatically by 50%+ compared with
            spark 1.4.1 SQL



I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3,
so this is surprising.  In my experiments Spark 1.5 is either the same or
faster than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd <bit1...@163.com> wrote:
  Hi,

  I am using data generated with sparksqlperf(
  https://github.com/databricks/spark-sql-perf) to test the spark sql
  performance (spark on yarn, with 10 nodes) with the following code (The
  table store_sales is about 90 million records, 6G in size)

  val
  outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"

  val name="store_sales"
      sqlContext.sql(
        s"""
            |CREATE TEMPORARY TABLE ${name}
            |USING org.apache.spark.sql.parquet
            |OPTIONS (
            |  path '${outputDir}'
            |)
          """.stripMargin)

  val sql="""
           |select
           |  t1.ss_quantity,
           |  t1.ss_list_price,
           |  t1.ss_coupon_amt,
           |  t1.ss_cdemo_sk,
           |  t1.ss_item_sk,
           |  t1.ss_promo_sk,
           |  t1.ss_sold_date_sk
           |from store_sales t1 join store_sales t2 on t1.ss_item_sk =
  t2.ss_item_sk
           |where
           |  t1.ss_sold_date_sk between 2450815 and 2451179
         """.stripMargin

  val df = sqlContext.sql(sql)
  df.rdd.foreach(row=>Unit)

  With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes
  with 1.5.

  The configuration are basically the same, since I copy the configuration
  from 1.4.1 to 1.5:

  sparkVersion    1.4.1        1.5.0
  scaleFactor    30        30
  spark.sql.shuffle.partitions    600        600
  spark.sql.sources.partitionDiscovery.enabled    true        true
  spark.default.parallelism    200        200
  spark.driver.memory    4G    4G        4G
  spark.executor.memory    4G        4G
  spark.executor.instances    10        10
  spark.shuffle.consolidateFiles    true        true
  spark.storage.memoryFraction    0.4        0.4
  spark.executor.cores    3        3

  I am not sure where is going wrong,any ideas?


Reply via email to