I may not be correct (in fact I may be completely opposite), but here is my
guess:
Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images,
would require 153.6 GB (12k*4000*400*8) of data which may justify the
amount of data to be written to the disk. Without compression, it
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used. It's applied to that RDD, so that subsequent uses of the RDD can
use the
Which persistence level are you talking about? MEMORY_AND_DISK ?
Sent from my mobile phone
On Apr 9, 2014 2:28 PM, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Thanks, Andrew. That helps.
For 1, it sounds like the data for the RDD is held in memory and then only
written to disk after
The groupByKey would be aware of the subsequent persist -- that's part of
the reason why operations are lazy. As far as whether it's materialized in
memory first and then flushed to disk vs streamed to disk I'm not sure the
exact behavior.
What I'd expect to happen would be that the RDD is
Hi,
Any thoughts on this? Thanks.
-Suren
On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Hi,
I know if we call persist with the right options, we can have Spark
persist an RDD's data on disk.
I am wondering what happens in intermediate operations
It might help if I clarify my questions. :-)
1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're
Hi,
I know if we call persist with the right options, we can have Spark persist
an RDD's data on disk.
I am wondering what happens in intermediate operations that could
conceivably create large collections/Sequences, like GroupBy and shuffling.
Basically, one part of the question is when is