Hi,

I have a few nodes in the cluster which hangs every job in complete state
and will not return to idle.
I cannot find out why.
All nodes are running same OS (diskless image).
>From log I only see:
--
[2016-10-12T08:44:08.133] error: we don't have select plugin type 102
[2016-10-12T08:44:08.133] error: select_g_select_jobinfo_unpack: unpack
error
[2016-10-12T08:44:08.133] error: Malformed RPC of type
REQUEST_TERMINATE_JOB(6011) received
[2016-10-12T08:44:08.133] error: slurm_receive_msg_and_forward: Header
lengths are longer than data received
[2016-10-12T08:44:08.143] error: service_connection: slurm_receive_msg:
Header lengths are longer than data received
-- 

The first two lines I see on all nodes.

I have a cluster with ~550 nodes and about 5-10 nodes has this problem.
Mostly every job.

Any idea?

slurm version: slurm-15.08.12-1.el7.centos.x86_64
kernel version : 3.10.0-327.13.1.el7.x86_64

Thanks.

Best regards
Benedikt
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Benedikt Schaefer
benedikt.schae...@emea.nec.com <mailto:benedikt.schae...@emea.nec.com>
~
~ Senior System Analyst
~
~ NEC Deutschland GmbH
~
~ HPCE Division
~
~ Raiffeisenstr.14, 70771 Leinfelden-Echterdingen, Germany
~
~ Tel:+49  711 780 55 21  Mobile: +49 152 22851542  Fax:+49 711 780 55 25
~ 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ NEC Deutschland GmbH, Prinzenallee 11, D-40549 Duesseldorf
~
~ Geschaeftsfuehrer: Yuichi Kojima
~
~ Handelsregister Duesseldorf HRB 57941; VAT ID DE129424743
~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to