Re: Understanding and optimizing spark disk usage during a job.

2014-11-29 Thread Vikas Agarwal
I may not be correct (in fact I may be completely opposite), but here is my guess: Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images, would require 153.6 GB (12k*4000*400*8) of data which may justify the amount of data to be written to the disk. Without compression, it

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
Which persistence level are you talking about? MEMORY_AND_DISK ? Sent from my mobile phone On Apr 9, 2014 2:28 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Thanks, Andrew. That helps. For 1, it sounds like the data for the RDD is held in memory and then only written to disk after

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
The groupByKey would be aware of the subsequent persist -- that's part of the reason why operations are lazy. As far as whether it's materialized in memory first and then flushed to disk vs streamed to disk I'm not sure the exact behavior. What I'd expect to happen would be that the RDD is

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're

Spark Disk Usage

2014-04-03 Thread Surendranauth Hiraman
Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is