Hi Experts,

I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
queries repetitively.

Few questions : 

1. When I do the below (persist to memory after reading from disk), it takes
lot of time to persist to memory, any suggestions of how to tune this?
     
     val inputP  = sqlContext.parquetFile("some HDFS path")
     inputP.registerTempTable("sample_table")
     inputP.persist(MEMORY_ONLY)
     val result = sqlContext.sql("some sql query")
     result.count

Note : Once the data is persisted to memory, it takes fraction of seconds to
return query result from the second query onwards. So my concern is how to
reduce the time when the data is first loaded to cache.


2. I have observed that if I omit the below line, 
     inputP.persist(MEMORY_ONLY)
      the first time Query execution is comparatively quick (say it take
1min), as the load to Memory time is saved, but to my surprise the second
time I run the same query it takes 30 sec as the inputP is not constructed
from disk (checked from UI).

 So my question is, Does spark use some kind of internal caching for inputP
in this scenario?

Thanks in advance

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to