Thanks everyone – I think we're going to go with collect() and kick out things 
that attempt to obtain overly large sets.

However, I think my original concern still stands. Some reading online shows 
that Microsoft Excel, for example, supports displaying something on the order 
of 2-4 GB sized spreadsheets 
(http://social.technet.microsoft.com/Forums/office/en-US/60bf34fb-5f02-483a-a54b-645cc810b30f/excel-2013-file-size-limits-powerpivot?forum=officeitpro).
 If there is a 2GB RDD however streaming it all back to the driver seems 
wasteful where in reality we could fetch chunks of it at a time and load only 
parts in driver memory, as opposed to using 2GB of RAM on the driver. In fact I 
don't know what the maximum frame size that can be set would be via 
spark.akka.framesize.

-Matt Cheah

From: Mark Hamstra <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, October 22, 2013 3:32 PM
To: user 
<[email protected]<mailto:[email protected]>>
Subject: Re: Visitor function to RDD elements

Correct; that's the completely degenerate case where you can't do anything in 
parallel.  Often you'll also want your iterator function to send back some 
information to an accumulator (perhaps just the result calculated with the last 
element of the partition) which is then fed back into the operation on the next 
partition as either a broadcast variable or part of the closure.



On Tue, Oct 22, 2013 at 3:25 PM, Nathan Kronenfeld 
<[email protected]<mailto:[email protected]>> wrote:
You shouldn't have to fly data around

You can just run it first on partition 0, then on partition 1, etc...  I may 
have the name slightly off, but something approximately like:

for (p <- 0 until numPartitions)
  data.mapPartitionsWithIndex((i, iter) => if (0 == p) iter.map(fcn) else 
List().iterator)

should work... BUT that being said, you've now really lost the point of using 
Spark to begin with.


Reply via email to