I don't have a solution for you (sorry), but do note that
rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you
set shuffle=true then it should repartition and redistribute the data. But
it uses the hash partitioner according to the ScalaDoc - I don't know of
any way to supply a custom partitioner.


On Mon, Jul 14, 2014 at 4:09 PM, Ravi Pandya <r...@iecommerce.com> wrote:

> I'm trying to run a job that includes an invocation of a memory &
> compute-intensive multithreaded C++ program, and so I'd like to run one
> task per physical node. Using rdd.coalesce(# nodes) seems to just allocate
> one task per core, and so runs out of memory on the node. Is there any way
> to give the scheduler a hint that the task uses lots of memory and cores so
> it spreads it out more evenly?
>
> Thanks,
>
> Ravi Pandya
> Microsoft Research
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to