Inconsistent slurm.conf files or incompatible versions of slurm daemons and/or commands appear to be running.
Quoting Andy Riebs <[email protected]>: > > It looks like we've got some kind of mismatch on our cluster, similar to > the recent CG problem report? > > Nothing is reported by slurmctld; but following any job, the errant > compute nodes report: > > [2013-07-23T17:10:17.148] error: Malformed RPC of type 5016 received > [2013-07-23T17:10:17.148] error: slurm_receive_msg_and_forward: Header > lengths are longer than data received > [2013-07-23T17:10:17.152] error: service_connection: slurm_receive_msg: > Header lengths are longer than data received > [2013-07-23T17:10:17.158] error: service_connection: slurm_receive_msg: > Header lengths are longer than data received > [2013-07-23T17:11:13.131] [294.0] Abandoning IO 60 secs after job > shutdown initiated > > (Lots of the "Malformed RPC 5016" messages appear per job.) > > Any thoughts? > Andy > > -- > Andy Riebs > Hewlett-Packard Company > High Performance Computing > +1 404 648 9024 > My opinions are not necessarily those of HP >
