Actually, we want the opposite – we want as much data to be computed as possible.
It's only for benchmarking purposes, of course. -Matt Cheah From: Matei Zaharia <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Thursday, December 5, 2013 10:31 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Cc: Mingyu Kim <[email protected]<mailto:[email protected]>> Subject: Re: takeSample() computation Hi Matt, Try using take() instead, which will only begin computing from the start of the RDD (first partition) if the number of elements you ask for is small. Note that if you’re doing any shuffle operations, like groupBy or sort, then the stages before that do have to be computed fully. Matei On Dec 5, 2013, at 10:13 AM, Matt Cheah <[email protected]<mailto:[email protected]>> wrote: Hi everyone, I have a question about RDD.takeSample(). This is an action, not a transformation – but is any optimization made to reduce the amount of computation that's done, for example only running the transformations over a smaller subset of the data since only a sample will be returned as a result? The context is, I'm trying to measure the amount of time a set of transformations takes on our dataset without persisting to disk. So I want to stack the operations on the RDD and then invoke an action that doesn't save the result to disk but can still give me a good idea of how long transforming the whole dataset takes. Thanks, -Matt Cheah
