There were active jobs, so that is likely the case.  Is there a way to
install this plugin without having jobs end up in the node fail state?  Or
would I need wait for a time when we are clear of jobs?

Thanks much

M


On Wed, Oct 9, 2013 at 10:46 AM, Moe Jette <[email protected]> wrote:

>
> Perhaps you have some old slurmstepd processes running (started with the
> older configuration, lacking the job accounting configuration)?
>
>
>
> Quoting Michael Gutteridge <[email protected]>**:
>
>  Hi all
>>
>> We recently added job accounting to our cluster (Slurm 2.5.4/MWM 6.1.10)
>> and have run into a situation where some jobs don't complete successfully.
>>
>> I've added the following to slurm.conf:
>>
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_**gather/linux
>>
>> ...and restarted slurmd and slurmctld.  I don't know if it's related, but
>> we've also enabled accounting to mysql via slurmdbd:
>>
>> AccountingStorageType=**accounting_storage/slurmdbd
>>
>> After this change, we see the controller spewing these messages:
>>
>> [2013-10-09T08:27:30-07:00] error: Malformed RPC of type 5018 received
>> [2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are
>> longer than data received
>> [2013-10-09T08:27:30-07:00] error: slurm_receive_msg: Header lengths are
>> longer than data received
>>
>> These messages seem to correspond to messages on the nodes in
>> /var/log/slurmd.log:
>>
>> [2013-10-09T08:33:48-07:00] [3905520] slurm_receive_msg: Zero Bytes were
>> transmitted or received
>> [2013-10-09T08:33:49-07:00] [3905520] Retrying job complete RPC for
>> 3905520.4294967294
>>
>> These messages would appear to be coming from the stepds.
>>
>> Slurm indicates the job as running and the slurmstepd associated with the
>> job is still running, but the associated tasks have completed.
>>
>> All the nodes seem to have the correct, identical, slurm.conf and are
>> running the same version of slurm and libslurm.  I haven't been able to
>> reproduce the problem, and it doesn't seem to impact all jobs.
>>
>> Have I left something out or misconfigured the gather plugin somehow?
>>
>> Thanks much
>>
>> Michael
>>
>>
>


-- 
Hey! Somebody punched the foley guy!
   - Crow, MST3K ep. 508

Reply via email to