I believe the in-memory solution misses the storage indexes that parquet / orc 
have.

The in-memory solution is more suitable if you iterate in the whole set of data 
frequently.

> Am 15.01.2019 um 19:20 schrieb Tomas Bartalos <tomas.barta...@gmail.com>:
> 
> Hello,
> 
> I'm using spark-thrift server and I'm searching for best performing solution 
> to query hot set of data. I'm processing records with nested structure, 
> containing subtypes and arrays. 1 record takes up several KB.
> 
> I tried to make some improvement with cache table:
> cache table event_jan_01 as select * from events where day_registered = 
> 20190102;
> 
> If I understood correctly, the data should be stored in in-memory columnar 
> format with storage level MEMORY_AND_DISK. So data which doesn't fit to 
> memory will be spille to disk (I assume also in columnar format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab 
> none of the data was cached to memory and everything was spilled to disk. The 
> size of the data was 5.7 GB.
> Typical queries took ~ 20 sec.
> 
> Then I tried to store the data to parquet format:
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 
> select * from event_jan_01;
> 
> The whole parquet took up only 178MB.
> And typical queries took 5-10 sec.
> 
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in memory ?
> 
> Spark version: 2.4.0
> 
> Best regards,
> Tomas
> 

Reply via email to