Actually, we want the opposite – we want as much data to be computed as 
possible.

It's only for benchmarking purposes, of course.

-Matt Cheah

From: Matei Zaharia <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, December 5, 2013 10:31 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: Mingyu Kim <[email protected]<mailto:[email protected]>>
Subject: Re: takeSample() computation

Hi Matt,

Try using take() instead, which will only begin computing from the start of the 
RDD (first partition) if the number of elements you ask for is small.

Note that if you’re doing any shuffle operations, like groupBy or sort, then 
the stages before that do have to be computed fully.

Matei

On Dec 5, 2013, at 10:13 AM, Matt Cheah 
<[email protected]<mailto:[email protected]>> wrote:

Hi everyone,

I have a question about RDD.takeSample(). This is an action, not a 
transformation – but is any optimization made to reduce the amount of 
computation that's done, for example only running the transformations over a 
smaller subset of the data since only a sample will be returned as a result?

The context is, I'm trying to measure the amount of time a set of 
transformations takes on our dataset without persisting to disk. So I want to 
stack the operations on the RDD and then invoke an action that doesn't save the 
result to disk but can still give me a good idea of how long transforming the 
whole dataset takes.

Thanks,

-Matt Cheah

Reply via email to