My group is running a fairly large CFD code compiled with Intel Fortran 16.0.0 
and OpenMPI 1.8.4. Each night we run hundreds of simple test cases, using a 
range of MPI processes from 1 to 16. I have noticed that if we submit these 
jobs on our linux cluster and assign each job exclusive rights to an entire 
node or two, the jobs run fine. By restrict, I mean that each job is launched 
via a PBS script that includes

#PBS -l nodes=X:ppn=8

because each node on our cluster has 8 cores. However, if we do not restrict 
the jobs to use the entire node by itself, we occasionally get seg faults 
during MPI_FINALIZE. When a  job fails, I see that each MPI process writes out 
the following message, and all processes arrive at and pass the barrier:

WRITE(LU_ERR,'(A,I4,A)') 'MPI process ',MYID,' has completed'
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
CALL MPI_FINALIZE(IERR)

But at least one MPI process gets stuck in the MPI_FINALIZE routine. I do not 
get back any error message other than that a seg fault occurred.

I cannot nail this down any better because this happens like every other night, 
with about 1 out of a hundred jobs. Can anyone think of a reason why the job 
would seg fault in MPI_FINALIZE, but only under conditions where the jobs are 
tightly packed onto our cluster?

Reply via email to