I don't have a solution for you (sorry), but do note that rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you set shuffle=true then it should repartition and redistribute the data. But it uses the hash partitioner according to the ScalaDoc - I don't know of any way to supply a custom partitioner.
On Mon, Jul 14, 2014 at 4:09 PM, Ravi Pandya <r...@iecommerce.com> wrote: > I'm trying to run a job that includes an invocation of a memory & > compute-intensive multithreaded C++ program, and so I'd like to run one > task per physical node. Using rdd.coalesce(# nodes) seems to just allocate > one task per core, and so runs out of memory on the node. Is there any way > to give the scheduler a hint that the task uses lots of memory and cores so > it spreads it out more evenly? > > Thanks, > > Ravi Pandya > Microsoft Research > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io