I believe the in-memory solution misses the storage indexes that parquet / orc have.
The in-memory solution is more suitable if you iterate in the whole set of data frequently. > Am 15.01.2019 um 19:20 schrieb Tomas Bartalos <tomas.barta...@gmail.com>: > > Hello, > > I'm using spark-thrift server and I'm searching for best performing solution > to query hot set of data. I'm processing records with nested structure, > containing subtypes and arrays. 1 record takes up several KB. > > I tried to make some improvement with cache table: > cache table event_jan_01 as select * from events where day_registered = > 20190102; > > If I understood correctly, the data should be stored in in-memory columnar > format with storage level MEMORY_AND_DISK. So data which doesn't fit to > memory will be spille to disk (I assume also in columnar format (?)) > I cached 1 day of data (1 M records) and according to spark UI storage tab > none of the data was cached to memory and everything was spilled to disk. The > size of the data was 5.7 GB. > Typical queries took ~ 20 sec. > > Then I tried to store the data to parquet format: > CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as > select * from event_jan_01; > > The whole parquet took up only 178MB. > And typical queries took 5-10 sec. > > Is it possible to tune spark to spill the cached data in parquet format ? > Why the whole cached table was spilled to disk and nothing stayed in memory ? > > Spark version: 2.4.0 > > Best regards, > Tomas >