Hi all, We are issuing a problem with a few applications that using MPI.
When we executing the job, the processes looks like running well without any errors, but when the job is ending in the Slurm output we always get errors like: Slurm output file: =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 25147 RUNNING AT bee002 = EXIT CODE: 2 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed [proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event srun: error: bee001: task 0: Exited with exit code 7 srun: error: _server_read: fd 19 got error or unexpected eof reading header srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 2 [mpiexec@bee001] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@bee001] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@bee001] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion [mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion Slurm's Email Title upon failure: SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED, ExitCode 255 The output of the job itself looks fine, so it means that everything running fine and the problem is only in the job ending process. Information about our system: Applications that has that problem: FLASH, CLOUDY MPI versions that we was running those applications: openmpi 1.8.4, openmpi 1.8.5, mpich 3.1.3 Slurm version: tried to run the applications both on slurm 14.11.8 and 14.11.3 OS: CentOS 6.5 Does anyone is familiar with that kind of problem? How can we solve it? Any information regarding the issue could help us a lot. Thanks!
