Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
sql performance (spark on yarn, with 10 nodes) with the following code (The 
table store_sales is about 90 million records, 6G in sizeļ¼‰
 
val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
val name="store_sales"
    sqlContext.sql(
      s"""
          |CREATE TEMPORARY TABLE ${name}
          |USING org.apache.spark.sql.parquet
          |OPTIONS (
          |  path '${outputDir}'
          |)
        """.stripMargin)
        
val sql="""
         |select
         |  t1.ss_quantity,
         |  t1.ss_list_price,
         |  t1.ss_coupon_amt,
         |  t1.ss_cdemo_sk,
         |  t1.ss_item_sk,
         |  t1.ss_promo_sk,
         |  t1.ss_sold_date_sk
         |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
t2.ss_item_sk
         |where
         |  t1.ss_sold_date_sk between 2450815 and 2451179
       """.stripMargin
        
val df = sqlContext.sql(sql)
df.rdd.foreach(row=>Unit)

With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes with 
1.5.

The configuration are basically the same, since I copy the configuration from 
1.4.1 to 1.5:

sparkVersion    1.4.1        1.5.0
scaleFactor    30        30
spark.sql.shuffle.partitions    600        600
spark.sql.sources.partitionDiscovery.enabled    true        true
spark.default.parallelism    200        200
spark.driver.memory    4G    4G        4G
spark.executor.memory    4G        4G
spark.executor.instances    10        10
spark.shuffle.consolidateFiles    true        true
spark.storage.memoryFraction    0.4        0.4
spark.executor.cores    3        3

I am not sure where is going wrong,any ideas?


Reply via email to