Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively.
Few questions : 1. When I do the below (persist to memory after reading from disk), it takes lot of time to persist to memory, any suggestions of how to tune this? val inputP = sqlContext.parquetFile("some HDFS path") inputP.registerTempTable("sample_table") inputP.persist(MEMORY_ONLY) val result = sqlContext.sql("some sql query") result.count Note : Once the data is persisted to memory, it takes fraction of seconds to return query result from the second query onwards. So my concern is how to reduce the time when the data is first loaded to cache. 2. I have observed that if I omit the below line, inputP.persist(MEMORY_ONLY) the first time Query execution is comparatively quick (say it take 1min), as the load to Memory time is saved, but to my surprise the second time I run the same query it takes 30 sec as the inputP is not constructed from disk (checked from UI). So my question is, Does spark use some kind of internal caching for inputP in this scenario? Thanks in advance Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org