Good morning,

We have a cluster with two kind of infiniband cards, one connectx-4 and the 
other connectx-6.
Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use 
openmpi-4.0.3 (that support connectx-6) and the programs that have several 
parts, first a call to a secuencial program and inside it a call to a parallel 
program, … (in our case the program is WRF, but we have others like this with 
the same problem),  this kind of programs suddenly stop,

…..
0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       00:05:25 real.exe
0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?       00:05:33 real.exe
0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?       00:05:28 real.exe
……
The WCHAN=hrtime, and it looks that it is running, but really it doesn´t work

We don´t know if it could be  problem with slurm and this version of openmpi… 
Any idea?


________________________________________________

Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID
________________________________________________


Reply via email to