Hi all We recently added job accounting to our cluster (Slurm 2.5.4/MWM 6.1.10) and have run into a situation where some jobs don't complete successfully.
I've added the following to slurm.conf: JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux ...and restarted slurmd and slurmctld. I don't know if it's related, but we've also enabled accounting to mysql via slurmdbd: AccountingStorageType=accounting_storage/slurmdbd After this change, we see the controller spewing these messages: [2013-10-09T08:27:30-07:00] error: Malformed RPC of type 5018 received [2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are longer than data received [2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are longer than data received These messages seem to correspond to messages on the nodes in /var/log/slurmd.log: [2013-10-09T08:33:48-07:00] [3905520] slurm_receive_msg: Zero Bytes were transmitted or received [2013-10-09T08:33:49-07:00] [3905520] Retrying job complete RPC for 3905520.4294967294 These messages would appear to be coming from the stepds. Slurm indicates the job as running and the slurmstepd associated with the job is still running, but the associated tasks have completed. All the nodes seem to have the correct, identical, slurm.conf and are running the same version of slurm and libslurm. I haven't been able to reproduce the problem, and it doesn't seem to impact all jobs. Have I left something out or misconfigured the gather plugin somehow? Thanks much Michael
