[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-26 Thread Craig Yoshioka
Not sure if my last reply made it to the list, but for us the root problem was that slurmd was being started during boot with limits that prevented the openmpi infiniband driver from being able to address the memory it needed. I believe these limits were from some sort of pam thing. One

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-26 Thread Owen LaGarde
Maybe a dumb question, but one possibility is that the requesting client (the rebooted node) is failing to complete it's MGID join(), so have you verified with ib_ping and checked the sm log (wherever your fabric master sits) for IB partial join refuse's? On Thu, Aug 25, 2016 at 1:04 PM, Michael

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-26 Thread Christopher Samuel
On 26/08/16 23:03, Michael Di Domenico wrote: > is it off by default? we're running the default openib stack in rhel > 6.7. i'm not even sure where to check for it being on/off, i've > never had to specifically enable/disable UD before, i thought it was > always the programs choice whether to

[slurm-dev] fwd_tree_thread ... failed to forward the message

2016-08-26 Thread Ulf Markwardt
Hello there, after an upgrade from 15.08.12 to 16.05.04 we get these new messages: slurmctld error: fwd_tree_thread: taurusi4181 failed to forward the message, expecting 51 ret got only 1 Where do they come from? - How can we get rid of them? Thanks, Ulf --

[slurm-dev] Re: setup Slurm on Ubuntu 16.04 server

2016-08-26 Thread Bancal Samuel
Hi, Yes it works by running slurmdbd and slurmctl services have to be run on the our-slurm-master node and slurmd service has to be run on each of the nodes. Cheers, Samuel On 25. 08. 16 11:42, Bancal Samuel wrote: Thank you, this helped me to fix one problem : root@our-slurm-master:~#