Ah, this is a slightly different problem statement, in that you may still gain in taking advantage of Spark's parallelization for the data transformation.
If you want to avoid the serdes+network overhead of sending the results back to the driver, and instead have a consumer/sink to send the result set to, you might consider having a single reducer that streams the rows out to a local temporary file(s), then have the final reduce send (or trigger an external send) of that result set to your consumer. 2GB files are fairly small relative to TB disk sizes, and can easily stream within 10+ seconds for 100MB/s local disk or network bandwidths. If the original transformation would have taken minutes or longer sequentially then this approach is still a win in performance. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Tue, Oct 22, 2013 at 3:48 PM, Matt Cheah <[email protected]> wrote: > Thanks everyone – I think we're going to go with collect() and kick out > things that attempt to obtain overly large sets. > > However, I think my original concern still stands. Some reading online > shows that Microsoft Excel, for example, supports displaying something on > the order of 2-4 GB sized spreadsheets ( > http://social.technet.microsoft.com/Forums/office/en-US/60bf34fb-5f02-483a-a54b-645cc810b30f/excel-2013-file-size-limits-powerpivot?forum=officeitpro). > If there is a 2GB RDD however streaming it all back to the driver seems > wasteful where in reality we could fetch chunks of it at a time and load > only parts in driver memory, as opposed to using 2GB of RAM on the driver. > In fact I don't know what the maximum frame size that can be set would be > via spark.akka.framesize. > > -Matt Cheah > > From: Mark Hamstra <[email protected]> > Reply-To: "[email protected]" < > [email protected]> > Date: Tuesday, October 22, 2013 3:32 PM > To: user <[email protected]> > > Subject: Re: Visitor function to RDD elements > > Correct; that's the completely degenerate case where you can't do > anything in parallel. Often you'll also want your iterator function to > send back some information to an accumulator (perhaps just the result > calculated with the last element of the partition) which is then fed back > into the operation on the next partition as either a broadcast variable or > part of the closure. > > > > On Tue, Oct 22, 2013 at 3:25 PM, Nathan Kronenfeld < > [email protected]> wrote: > >> You shouldn't have to fly data around >> >> You can just run it first on partition 0, then on partition 1, etc... >> I may have the name slightly off, but something approximately like: >> >> for (p <- 0 until numPartitions) >> data.mapPartitionsWithIndex((i, iter) => if (0 == p) iter.map(fcn) else >> List().iterator) >> >> should work... BUT that being said, you've now really lost the point of >> using Spark to begin with. >> >> >
