Not sure if my last reply made it to the list, but for us the root problem was
that slurmd was being started during boot with limits that prevented the
openmpi infiniband driver from being able to address the memory it needed. I
believe these limits were from some sort of pam thing. One
Maybe a dumb question, but one possibility is that the requesting client
(the rebooted node) is failing to complete it's MGID join(), so have you
verified with ib_ping and checked the sm log (wherever your fabric master
sits) for IB partial join refuse's?
On Thu, Aug 25, 2016 at 1:04 PM, Michael
On 26/08/16 23:03, Michael Di Domenico wrote:
> is it off by default? we're running the default openib stack in rhel
> 6.7. i'm not even sure where to check for it being on/off, i've
> never had to specifically enable/disable UD before, i thought it was
> always the programs choice whether to
Hello there,
after an upgrade from 15.08.12 to 16.05.04 we get these new messages:
slurmctld error: fwd_tree_thread: taurusi4181 failed to forward the
message, expecting 51 ret got only 1
Where do they come from? - How can we get rid of them?
Thanks,
Ulf
--
Hi,
Yes it works by running slurmdbd and slurmctl services have to be run on the
our-slurm-master node and slurmd service has to be run on each of the nodes.
Cheers,
Samuel
On 25. 08. 16 11:42, Bancal Samuel wrote:
Thank you, this helped me to fix one problem :
root@our-slurm-master:~#