Hello everyone,

I am trying to debug through the MPI functionality at our local clusters. I
use openmpi 3.0 and the executable were compiled by PGI 10.9. The
executable is a regional air quality model called "CAMx" which is widely
used in our community. In our local clusters setting, I have a cluster
(npsx2) with 24 CPUs with 24G memory and three clusters with 40 CPUs with
65G memory (npsx4, npsx5,npsx6). The OS on all the clusters is CentOS 6.5.
I use the command "lstopo" to generate the CPU architecture and attached
below.

I can run through the CAMx benchmark case and the outputs is the same as
the benchmark outputs by using all the available CPUs across nodes with the
command:

mpirun -np 72 --hostfile [mynodes.txt] [myexe]

Then I move to run my own specific case. CAMx model has the function of MPI
as well as OpenMP to speed up the computation. Previously, our group only
use the OpenMP, it works smoothly. Now I try to run it use MPI. The wired
thing is if I assign 4 cpus, it run through and the results is correct, BUT
if I assign 5 CPUs, it will stuck at certain time steps and idle there like
forever, furthermore, if I assgin 6 CPUs or more for MPI run, it will crash
at the first few time steps and report segmentation fault. My specific case
has 5 times more total grids than the benchmark case, so my first guess is
the memory issue. However, if I try this on npsx2 with fewer total memory
or npsx5 with larger total memory, it has the same error pattern: works for
assigning 4 CPUs, idle for assigning 5 CPUs and crash for assigning 6 CPUs.

I tried to look at some hints for the previous post, but didn't find
particular insightful post. I use the valgrind tool to try to debug the
executable on cluster npsx5 as:

valgrind mpirun -np 6 [myexe]

It crashed with log file attache below and I can not find a clue how to
solve it, so please help me to troubleshooting this if you have time.
Thanks for your attention and hope your suggestions.

Best regards,
zhangrui

Attachment: log.npsx5.segmentation_fault
Description: Binary data

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to