Thanks for the responses. I finally fixed it, though I don't completely understand what happened. I logged into all the slave nodes, killed all slurm services/daemons with "kill" (there was some old-slurm stuff resurrecting there too). Then returned to the head node, and killed all slurm things. Then, on the head, I did /..new-slurm/slurm start; and then the same for the slave nodes. Now, for the first time, slurmctld is working properly, without generating lots of errors in the log file, and no old-slurm things are running. It seems as if old-slurm things running on the slave nodes triggered old-slurm accounting things on the head node. It has been going well for about 24 hours, so I think it is fixed. Thanks again Andrew
On Fri, Jun 5, 2015 at 3:12 AM, Loris Bennett <[email protected]> wrote: > > Christopher Samuel <[email protected]> writes: > > > On 05/06/15 16:22, Loris Bennett wrote: > > > >> 'scontrol' and 'squeue' are not daemons, so they must be being run > >> directly. > > > > It really sounds like they're being run from a cron job somewhere, do > > you have some sort of health check or accounting checks that are getting > > done regularly with the old paths hardcoded in them? > > > > What users are they running as? > > > > All the best, > > Chris > > Bright Cluster Manager has some integration with the queuing system, so > it could well be running 'scontrol' and 'squeue' periodically. You > should have the tool 'wlm-setup', which enables you to choose the > workload manager. However, I can't see a way to adjust the path. > > I'm very interested in how this pans out as we are also planning to move > from our current, old version of Slurm to a more recent version. In > this context we would also want to change the path for Slurm. > > Cheers, > > Loris > > -- > This signature is currently under construction. >
