Re: paging through an RDD that's too large to collect() all at once

Matei Zaharia Thu, 18 Sep 2014 20:26:19 -0700

Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads 
one RDD partition at a time. Scala iterators also have methods like grouped() 
that let you get fixed-size groups.


Matei

On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com) 
wrote:

I have an RDD on the cluster that I'd like to iterate over and perform some 
operations on each element (push data from each element to another 
downstream system outside of Spark). I'd like to do this at the driver so I 
can throttle the rate that I push to the downstream system (as opposed to 
submitting a job to the Spark cluster and parallelizing the work - and 
likely flooding the downstream system). 

The RDD is too big to collect() all at once back into the memory space of 
the driver. Ideally I'd like to be able to "page" through the dataset, 
iterating through a chunk of n RDD elements at a time back at the driver. 
It doesn't have to be _exactly_ n elements, just a reasonably small set of 
elements at a time. 

Is there a simple way to do this? 

It looks like I could use RDD.filter() or RDD.collect[U](f: 
PartialFunction[T, U]). Either of those techniques requires defining a 
function that will filter the RDD. But the shape of the data in the RDD 
could be such that, for a given function (say splitting by timestamp by hour 
of day), it won't reliably split up into reasonably sized "pages". Also, it 
requires doing some analysis to determine boundaries on the data for 
filtering. Point is, it's extra logic 

Any thoughts on if there's a simpler way to page through RDD elements back 
at the driver? 




-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org

Re: paging through an RDD that's too large to collect() all at once

Reply via email to