Are you sure it is not spilling to disk? How many rows are cached in your result set -> sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR dt_year=2016)")
HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 25 April 2016 at 23:47, Imran Akbar <skunkw...@gmail.com> wrote: > Hi, > > I'm running a simple query like this through Spark SQL: > > sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND > dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN > ('cereal')").show() > > which takes 3 minutes to run against an in-memory cache of 9 GB of data. > > The data was 100% cached in memory before I ran the query (see screenshot > 1). > The data was cached like this: > data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR > dt_year=2016)") > data.cache() > data.registerTempTable("data") > and then I ran an action query to load the data into the cache. > > I see lots of rows of logs like this: > 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as values > in memory (estimated size 2.5 MB, free 9.7 GB) > 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally > 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping > 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from > memory > > Screenshot 2 shows the job page of the longest job. > > The data was partitioned in Parquet by month, country, and product before > I cached it. > > Any ideas what the issue could be? This is running on localhost. > > regards, > imran > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >