My group is running a fairly large CFD code compiled with Intel Fortran 16.0.0 and OpenMPI 1.8.4. Each night we run hundreds of simple test cases, using a range of MPI processes from 1 to 16. I have noticed that if we submit these jobs on our linux cluster and assign each job exclusive rights to an entire node or two, the jobs run fine. By restrict, I mean that each job is launched via a PBS script that includes
#PBS -l nodes=X:ppn=8 because each node on our cluster has 8 cores. However, if we do not restrict the jobs to use the entire node by itself, we occasionally get seg faults during MPI_FINALIZE. When a job fails, I see that each MPI process writes out the following message, and all processes arrive at and pass the barrier: WRITE(LU_ERR,'(A,I4,A)') 'MPI process ',MYID,' has completed' CALL MPI_BARRIER(MPI_COMM_WORLD,IERR) CALL MPI_FINALIZE(IERR) But at least one MPI process gets stuck in the MPI_FINALIZE routine. I do not get back any error message other than that a seg fault occurred. I cannot nail this down any better because this happens like every other night, with about 1 out of a hundred jobs. Can anyone think of a reason why the job would seg fault in MPI_FINALIZE, but only under conditions where the jobs are tightly packed onto our cluster?