If you’re trying to measure the performance assuming that a dataset is already 
in memory, then doing cache() and count() would work. However if you want to 
measure an end-to-end workflow, it might be good to leave the operations and 
the data loading to happen together, as Spark does by default. This gives the 
engine room to pipeline these and might result in a faster time than “loading” 
first (where you’re IO-bound) and then computing after (where you’re CPU or 
communication bound).

Matei

On Dec 27, 2013, at 7:29 AM, Sebastian Schelter <[email protected]> wrote:

> Hi,
> 
> I want to run some benchmarks using Spark for which I need to explicitly
> control the lazy execution. My benchmarks basically consist of two steps:
> 
> 1. loading and transforming a dataset
> 2. applying an operation to the transformed dataset, where I want to
> measure the runtime
> 
> How can I make sure that the operations for step 1 are fully executed
> before I start step 2, whose time I'd like to measure?
> 
> Would a solution be to invoke cache() and count() on the RDD holding my
> dataset after step 1?
> 
> --sebastian

Reply via email to