So I think cache large data is not a best practice.

At 2019-01-16 12:24:34, "大啊" <belie...@163.com> wrote:

Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually 
stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.


At 2019-01-16 02:20:56, "Tomas Bartalos" <tomas.barta...@gmail.com> wrote:

Hello,


I'm using spark-thrift server and I'm searching for best performing solution to 
query hot set of data. I'm processing records with nested structure, containing 
subtypes and arrays. 1 record takes up several KB.


I tried to make some improvement with cache table:

cache table event_jan_01 asselect * from events where day_registered = 20190102;




If I understood correctly, the data should be stored in in-memory columnar 
format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory 
will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none 
of the data was cached to memory and everything was spilled to disk. The size 
of the data was 5.7 GB.
Typical queries took ~ 20 sec.


Then I tried to store the data to parquet format:

CREATETABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"as 

select * from event_jan_01;




The whole parquet took up only 178MB.
And typical queries took 5-10 sec.


Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?


Spark version: 2.4.0


Best regards,
Tomas






 

Reply via email to