The problem is that Java objects can take more space than the underlying data, 
but there are options in Spark to store data in serialized form to get around 
this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html.

Matei

On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar Sheth <suraj...@adobe.com> 
wrote:

> Hi Mayur,
> Thanks for replying. Is it usually double the size of data on disk?
> I have observed this many times. Storage section of Spark is telling me that 
> 100% of RDD is cached using 97 GB of RAM while the data in HDFS is only 47 GB.
>  
> Thanks and Regards,
> Suraj Sheth
>  
> From: Mayur Rustagi [mailto:mayur.rust...@gmail.com] 
> Sent: Tuesday, February 25, 2014 11:19 PM
> To: user@spark.apache.org
> Cc: u...@spark.incubator.apache.org
> Subject: Re: Size of RDD larger than Size of data on disk
>  
> Spark may take more RAM than reqiured by RDD, can you look at storage section 
> of Spark & see how much space RDD is taking in memory. It may still take more 
> storage than disk as Java objects have some overhead. 
> Consider enabling compression in RDD. 
> 
> Mayur Rustagi
> Ph: +919632149971
> http://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>  
>  
> 
> On Tue, Feb 25, 2014 at 6:47 AM, Suraj Satishkumar Sheth <suraj...@adobe.com> 
> wrote:
> Hi All,
> I have a folder in HDFS which has files with size of 47GB. I am loading this 
> in Spark as RDD[String] and caching it. The total amount of RAM that Spark 
> uses to cache it is around 97GB. I want to know why Spark is taking up so 
> much of Space for the RDD? Can we reduce the RDD size in Spark and make it 
> similar to it’s size on disk?
>  
> Thanks and Regards,
> Suraj Sheth

Reply via email to