[slurm-dev] Re: Adding JobAcctGather plugin causing RPC errors

Moe Jette Thu, 10 Oct 2013 09:12:54 -0700

Changing this configuration parameter changes the contents of theRPCs. Any previously running job steps are managed by a slurmstepddaemon that will persist through the lifetime of that job step and notchange it's RPCs. So you'll need to change this when there are norunning job steps. I'll add a note to that effect to the documentation.


Quoting Michael Gutteridge <[email protected]>:

There were active jobs, so that is likely the case.  Is there a way to
install this plugin without having jobs end up in the node fail state?  Or
would I need wait for a time when we are clear of jobs?

Thanks much

M


On Wed, Oct 9, 2013 at 10:46 AM, Moe Jette <[email protected]> wrote:


Perhaps you have some old slurmstepd processes running (started with the
older configuration, lacking the job accounting configuration)?



Quoting Michael Gutteridge <[email protected]>**:

 Hi all


We recently added job accounting to our cluster (Slurm 2.5.4/MWM 6.1.10)
and have run into a situation where some jobs don't complete successfully.

I've added the following to slurm.conf:

JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_**gather/linux

...and restarted slurmd and slurmctld.  I don't know if it's related, but
we've also enabled accounting to mysql via slurmdbd:

AccountingStorageType=**accounting_storage/slurmdbd

After this change, we see the controller spewing these messages:

[2013-10-09T08:27:30-07:00] error: Malformed RPC of type 5018 received
[2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are
longer than data received
[2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are
longer than data received

These messages seem to correspond to messages on the nodes in
/var/log/slurmd.log:

[2013-10-09T08:33:48-07:00] [3905520] slurm_receive_msg: Zero Bytes were
transmitted or received
[2013-10-09T08:33:49-07:00] [3905520] Retrying job complete RPC for
3905520.4294967294

These messages would appear to be coming from the stepds.

Slurm indicates the job as running and the slurmstepd associated with the
job is still running, but the associated tasks have completed.

All the nodes seem to have the correct, identical, slurm.conf and are
running the same version of slurm and libslurm.  I haven't been able to
reproduce the problem, and it doesn't seem to impact all jobs.

Have I left something out or misconfigured the gather plugin somehow?

Thanks much

Michael



--
Hey! Somebody punched the foley guy!
   - Crow, MST3K ep. 508

[slurm-dev] Re: Adding JobAcctGather plugin causing RPC errors

Reply via email to