Re: slow SQL query with cached dataset

Jörn Franke Mon, 25 Apr 2016 22:30:46 -0700

I do not know your data, but it looks that you have too many partitions for 
such a small data set.


> On 26 Apr 2016, at 00:47, Imran Akbar <skunkw...@gmail.com> wrote:
> 
> Hi,
> 
> I'm running a simple query like this through Spark SQL:
> 
> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND 
> dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN ('cereal')").show()
> 
> which takes 3 minutes to run against an in-memory cache of 9 GB of data.
> 
> The data was 100% cached in memory before I ran the query (see screenshot 1).
> The data was cached like this:
> data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR 
> dt_year=2016)")
> data.cache()
> data.registerTempTable("data")
> and then I ran an action query to load the data into the cache.
> 
> I see lots of rows of logs like this:
> 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as values in 
> memory (estimated size 2.5 MB, free 9.7 GB)
> 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally
> 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping
> 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from memory
> 
> Screenshot 2 shows the job page of the longest job.
> 
> The data was partitioned in Parquet by month, country, and product before I 
> cached it.
> 
> Any ideas what the issue could be?  This is running on localhost.
> 
> regards,
> imran
> <Screen Shot 2016-04-25 at 3.43.03 PM.png>
> <Screen Shot 2016-04-25 at 3.42.15 PM.png>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: slow SQL query with cached dataset

Reply via email to