Hi Matei,

I want to get a feeling for the pro's and con's of executing a certain
computation in a certain way (and not really benchmark Spark). So I'll
go with the cache() and count() approach.

--sebastian

On 27.12.2013 17:35, Matei Zaharia wrote:
> If you’re trying to measure the performance assuming that a dataset is 
> already in memory, then doing cache() and count() would work.

However if you want to measure an end-to-end workflow, it might be good
to leave the operations and the data loading to happen together, as
Spark does by default. This gives the engine room to pipeline these and
might result in a faster time than “loading” first (where you’re
IO-bound) and then computing after (where you’re CPU or communication
bound).
> 
> Matei
> 
> On Dec 27, 2013, at 7:29 AM, Sebastian Schelter <[email protected]> wrote:
> 
>> Hi,
>>
>> I want to run some benchmarks using Spark for which I need to explicitly
>> control the lazy execution. My benchmarks basically consist of two steps:
>>
>> 1. loading and transforming a dataset
>> 2. applying an operation to the transformed dataset, where I want to
>> measure the runtime
>>
>> How can I make sure that the operations for step 1 are fully executed
>> before I start step 2, whose time I'd like to measure?
>>
>> Would a solution be to invoke cache() and count() on the RDD holding my
>> dataset after step 1?
>>
>> --sebastian
> 

Reply via email to