Changing this configuration parameter changes the contents of the
RPCs. Any previously running job steps are managed by a slurmstepd
daemon that will persist through the lifetime of that job step and not
change it's RPCs. So you'll need to change this when there are no
running job steps. I'll add a note to that effect to the documentation.
Quoting Michael Gutteridge <[email protected]>:
There were active jobs, so that is likely the case. Is there a way to
install this plugin without having jobs end up in the node fail state? Or
would I need wait for a time when we are clear of jobs?
Thanks much
M
On Wed, Oct 9, 2013 at 10:46 AM, Moe Jette <[email protected]> wrote:
Perhaps you have some old slurmstepd processes running (started with the
older configuration, lacking the job accounting configuration)?
Quoting Michael Gutteridge <[email protected]>**:
Hi all
We recently added job accounting to our cluster (Slurm 2.5.4/MWM 6.1.10)
and have run into a situation where some jobs don't complete successfully.
I've added the following to slurm.conf:
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_**gather/linux
...and restarted slurmd and slurmctld. I don't know if it's related, but
we've also enabled accounting to mysql via slurmdbd:
AccountingStorageType=**accounting_storage/slurmdbd
After this change, we see the controller spewing these messages:
[2013-10-09T08:27:30-07:00] error: Malformed RPC of type 5018 received
[2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are
longer than data received
[2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are
longer than data received
These messages seem to correspond to messages on the nodes in
/var/log/slurmd.log:
[2013-10-09T08:33:48-07:00] [3905520] slurm_receive_msg: Zero Bytes were
transmitted or received
[2013-10-09T08:33:49-07:00] [3905520] Retrying job complete RPC for
3905520.4294967294
These messages would appear to be coming from the stepds.
Slurm indicates the job as running and the slurmstepd associated with the
job is still running, but the associated tasks have completed.
All the nodes seem to have the correct, identical, slurm.conf and are
running the same version of slurm and libslurm. I haven't been able to
reproduce the problem, and it doesn't seem to impact all jobs.
Have I left something out or misconfigured the gather plugin somehow?
Thanks much
Michael
--
Hey! Somebody punched the foley guy!
- Crow, MST3K ep. 508