It looks like we've got some kind of mismatch on our cluster, similar to the recent CG problem report?
Nothing is reported by slurmctld; but following any job, the errant compute nodes report: [2013-07-23T17:10:17.148] error: Malformed RPC of type 5016 received [2013-07-23T17:10:17.148] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-07-23T17:10:17.152] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-07-23T17:10:17.158] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-07-23T17:11:13.131] [294.0] Abandoning IO 60 secs after job shutdown initiated (Lots of the "Malformed RPC 5016" messages appear per job.) Any thoughts? Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP
