I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising. In my experiments Spark 1.5 is either the same or faster than 1.4 with only small exceptions. A few thoughts,
- 600 partitions is probably way too many for 6G of data. - Providing the output of explain for both runs would be helpful whenever reporting performance changes. On Thu, Sep 10, 2015 at 1:24 AM, Todd <bit1...@163.com> wrote: > Hi, > > I am using data generated with sparksqlperf( > https://github.com/databricks/spark-sql-perf) to test the spark sql > performance (spark on yarn, with 10 nodes) with the following code (The > table store_sales is about 90 million records, 6G in size) > > val > outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales" > val name="store_sales" > sqlContext.sql( > s""" > |CREATE TEMPORARY TABLE ${name} > |USING org.apache.spark.sql.parquet > |OPTIONS ( > | path '${outputDir}' > |) > """.stripMargin) > > val sql=""" > |select > | t1.ss_quantity, > | t1.ss_list_price, > | t1.ss_coupon_amt, > | t1.ss_cdemo_sk, > | t1.ss_item_sk, > | t1.ss_promo_sk, > | t1.ss_sold_date_sk > |from store_sales t1 join store_sales t2 on t1.ss_item_sk = > t2.ss_item_sk > |where > | t1.ss_sold_date_sk between 2450815 and 2451179 > """.stripMargin > > val df = sqlContext.sql(sql) > df.rdd.foreach(row=>Unit) > > With 1.4.1, I can finish the query in 6 minutes, but I need 10+ minutes > with 1.5. > > The configuration are basically the same, since I copy the configuration > from 1.4.1 to 1.5: > > sparkVersion 1.4.1 1.5.0 > scaleFactor 30 30 > spark.sql.shuffle.partitions 600 600 > spark.sql.sources.partitionDiscovery.enabled true true > spark.default.parallelism 200 200 > spark.driver.memory 4G 4G 4G > spark.executor.memory 4G 4G > spark.executor.instances 10 10 > spark.shuffle.consolidateFiles true true > spark.storage.memoryFraction 0.4 0.4 > spark.executor.cores 3 3 > > I am not sure where is going wrong,any ideas? > > >