There were active jobs, so that is likely the case. Is there a way to install this plugin without having jobs end up in the node fail state? Or would I need wait for a time when we are clear of jobs?
Thanks much M On Wed, Oct 9, 2013 at 10:46 AM, Moe Jette <[email protected]> wrote: > > Perhaps you have some old slurmstepd processes running (started with the > older configuration, lacking the job accounting configuration)? > > > > Quoting Michael Gutteridge <[email protected]>**: > > Hi all >> >> We recently added job accounting to our cluster (Slurm 2.5.4/MWM 6.1.10) >> and have run into a situation where some jobs don't complete successfully. >> >> I've added the following to slurm.conf: >> >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_**gather/linux >> >> ...and restarted slurmd and slurmctld. I don't know if it's related, but >> we've also enabled accounting to mysql via slurmdbd: >> >> AccountingStorageType=**accounting_storage/slurmdbd >> >> After this change, we see the controller spewing these messages: >> >> [2013-10-09T08:27:30-07:00] error: Malformed RPC of type 5018 received >> [2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are >> longer than data received >> [2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are >> longer than data received >> >> These messages seem to correspond to messages on the nodes in >> /var/log/slurmd.log: >> >> [2013-10-09T08:33:48-07:00] [3905520] slurm_receive_msg: Zero Bytes were >> transmitted or received >> [2013-10-09T08:33:49-07:00] [3905520] Retrying job complete RPC for >> 3905520.4294967294 >> >> These messages would appear to be coming from the stepds. >> >> Slurm indicates the job as running and the slurmstepd associated with the >> job is still running, but the associated tasks have completed. >> >> All the nodes seem to have the correct, identical, slurm.conf and are >> running the same version of slurm and libslurm. I haven't been able to >> reproduce the problem, and it doesn't seem to impact all jobs. >> >> Have I left something out or misconfigured the gather plugin somehow? >> >> Thanks much >> >> Michael >> >> > -- Hey! Somebody punched the foley guy! - Crow, MST3K ep. 508
