Hi, I want to run some benchmarks using Spark for which I need to explicitly control the lazy execution. My benchmarks basically consist of two steps:
1. loading and transforming a dataset 2. applying an operation to the transformed dataset, where I want to measure the runtime How can I make sure that the operations for step 1 are fully executed before I start step 2, whose time I'd like to measure? Would a solution be to invoke cache() and count() on the RDD holding my dataset after step 1? --sebastian
