I wrote earlier about one of my users running a third-party Fortran code on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd crash behavior.
Our cluster's nodes all have 2 single-core processors. If this code is run on 2 processors on 1 node, it runs seemingly fine. However, if the job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode), then it crashes and gives messages like: [node4][0,1,4][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] [node3][0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 mca_btl_tcp_frag_recv: readv failed with errno=104 Essentially, if any network communication is involved, the job crashes in this form. I do have another user that runs his own MPI code on 10+ of these processors for days at a time without issue, so I don't think it's hardware. The original code also runs fine across many networked nodes if the architecture is x86-64 (also running OMPI 1.2.7). We have also tried different Fortran compilers (both PathScale and gfortran) and keep getting these crashes. Are there any suggestions on how to figure out if it's a problem with the code or the OMPI installation/software on the system? We have tried "--debug-daemons" with no new/interesting information being revealed. Is there a way to trap segfault messages or more detailed MPI transaction information or anything else that could help diagnose this? Thanks. -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - Same, same, but different...