Size of RDD larger than Size of data on disk

Suraj Satishkumar Sheth Tue, 25 Feb 2014 06:47:57 -0800

Hi All,
I have a folder in HDFS which has files with size of 47GB. I am loading this in 
Spark as RDD[String] and caching it. The total amount of RAM that Spark uses to 
cache it is around 97GB. I want to know why Spark is taking up so much of Space 
for the RDD? Can we reduce the RDD size in Spark and make it similar to it's 
size on disk?


Thanks and Regards,
Suraj Sheth

Size of RDD larger than Size of data on disk

Reply via email to