Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads one RDD partition at a time. Scala iterators also have methods like grouped() that let you get fixed-size groups.
Matei On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com) wrote: I have an RDD on the cluster that I'd like to iterate over and perform some operations on each element (push data from each element to another downstream system outside of Spark). I'd like to do this at the driver so I can throttle the rate that I push to the downstream system (as opposed to submitting a job to the Spark cluster and parallelizing the work - and likely flooding the downstream system). The RDD is too big to collect() all at once back into the memory space of the driver. Ideally I'd like to be able to "page" through the dataset, iterating through a chunk of n RDD elements at a time back at the driver. It doesn't have to be _exactly_ n elements, just a reasonably small set of elements at a time. Is there a simple way to do this? It looks like I could use RDD.filter() or RDD.collect[U](f: PartialFunction[T, U]). Either of those techniques requires defining a function that will filter the RDD. But the shape of the data in the RDD could be such that, for a given function (say splitting by timestamp by hour of day), it won't reliably split up into reasonably sized "pages". Also, it requires doing some analysis to determine boundaries on the data for filtering. Point is, it's extra logic Any thoughts on if there's a simpler way to page through RDD elements back at the driver? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org