Hi everyone,

I have a question about RDD.takeSample(). This is an action, not a 
transformation – but is any optimization made to reduce the amount of 
computation that's done, for example only running the transformations over a 
smaller subset of the data since only a sample will be returned as a result?

The context is, I'm trying to measure the amount of time a set of 
transformations takes on our dataset without persisting to disk. So I want to 
stack the operations on the RDD and then invoke an action that doesn't save the 
result to disk but can still give me a good idea of how long transforming the 
whole dataset takes.

Thanks,

-Matt Cheah

Reply via email to