Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-24 Thread Andrew Ash
I think a cheap way to repartition to a higher partition count without shuffle would be valuable too. Right now you can choose whether to execute a shuffle when going down in partition count, but going up in partition count always requires a shuffle. For the need of having a smaller partitions

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Imran Rashid
This would be pretty tricky to do -- the issue is that right now sparkContext.runJob has you pass in a function from a partition to *one* result object that gets serialized and sent back: Iterator[T] = U, and that idea is baked pretty deep into a lot of the internals, DAGScheduler, Task,

Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Andrew Ash
Hi Spark devs, I'm creating a streaming export functionality for RDDs and am having some trouble with large partitions. The RDD.toLocalIterator() call pulls over a partition at a time to the driver, and then streams the RDD out from that partition before pulling in the next one. When you have

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Mingyu Kim
Another alternative would be to compress the partition in memory in a streaming fashion instead of calling .toArray on the iterator. Would it be an easier mitigation to the problem? Or, is it hard to compress the rows one by one without materializing the full partition in memory using the