I think a cheap way to repartition to a higher partition count without
shuffle would be valuable too. Right now you can choose whether to execute
a shuffle when going down in partition count, but going up in partition
count always requires a shuffle. For the need of having a smaller
partitions
This would be pretty tricky to do -- the issue is that right now
sparkContext.runJob has you pass in a function from a partition to *one*
result object that gets serialized and sent back: Iterator[T] = U, and
that idea is baked pretty deep into a lot of the internals, DAGScheduler,
Task,
Hi Spark devs,
I'm creating a streaming export functionality for RDDs and am having some
trouble with large partitions. The RDD.toLocalIterator() call pulls over a
partition at a time to the driver, and then streams the RDD out from that
partition before pulling in the next one. When you have
Another alternative would be to compress the partition in memory in a
streaming fashion instead of calling .toArray on the iterator. Would it be
an easier mitigation to the problem? Or, is it hard to compress the rows
one by one without materializing the full partition in memory using the