Hi,

I want to run some benchmarks using Spark for which I need to explicitly
control the lazy execution. My benchmarks basically consist of two steps:

1. loading and transforming a dataset
2. applying an operation to the transformed dataset, where I want to
measure the runtime

How can I make sure that the operations for step 1 are fully executed
before I start step 2, whose time I'd like to measure?

Would a solution be to invoke cache() and count() on the RDD holding my
dataset after step 1?

--sebastian

Reply via email to