[OMPI users] Non-root install; hang there running on multiple nodes
Hi, I installed OpenMPI1.4.1 as a non-root user on a cluster. It is totally OK when I run with mpirun or mpiexec on one single node for many processes. However, when I lauch many processes on multiple nodes, I can observe jobs are distributed to those nodes (by using "top"), but all the jobs just hang there and cannot finish. I think the nodes use TCP to communicate with each other. This cluster also provides MPICH2, which was configured by the sys admin., and has no problem to do node communication in MPICH2. Besides, I read from some posts, which says this may be caused by TCP firewall. Since I have no root's right, and I don't know what shall request the admin. to do to fix this problem. So, can you tell me how to do that either by the admin root or by the non-root user (if possible)? Thank you very much. Hao
[OMPI users] Meaning and the significance of MCA parameter "opal_cr_use_thread"
The description for MCA parameter "opal_cr_use_thread" is very short at URL: http://osl.iu.edu/research/ft/ompi-cr/api.php Can someone explain the usefulness of enabling this parameter vs disabling it? In other words, what are pros/cons of disabling it? I found that this gets enabled automatically when openmpi library is configured with -ft-enable-threads option. Thanks Ananda Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
[OMPI users] strange problem with OpenMPI + rankfile + Intel compiler 11.0.074 + centos/fedora-12
Intel compiler 11.0.074 OpenMPI 1.4.1 Two different OSes: centos 5.4 (2.6.18 kernel) and Fedora-12 (2.6.32 kernel) Two different CPUs: Opteron 248 and Opteron 8356. same binary for OpenMPI. Same binary for user code (vasp compiled for older arch) When I supply rankfile, then depending on combo of OS and CPU results are different centos+Opt8356 : works centos+Opt248 : works fedora+Opt8356 : works fedora+Opt248 : fails rankfile is (in case of Opt248) rank 0=node014 slot=1 rank 1=node014 slot=0 I tried play with formats, leave one slot (and start one process) - it doesn't change result Without rankfile it works on all combos. Just in case, all this happens inside of cpuset which always wraps all slots given in rankfile (I just use torque with cpusets and my custom patch for torque which also creates rankfile for openmpi, in this case MPI tasks are bound to particular cores and multithreaded codes limited by given cpuset). AFAIR, it also works without problem on both hardware setups with 1.3.x/1.4.0 and 2.6.30 kernel from OpenSuSE 11.1. Strangely, but when I run OSU benchmarks (osu_bw etc), it works without any problems. And finally two errorlogs (starting 1 and 2 processes): mpirun -mca paffinity_base_verbose 8 -np 1 vasp [node014:26373] mca:base:select:(paffinity) Querying component [linux] [node014:26373] mca:base:select:(paffinity) Query of component [linux] set priority to 10 [node014:26373] mca:base:select:(paffinity) Selected component [linux] [node014:26373] paffinity slot assignment: slot_list == 1 [node014:26373] paffinity slot assignment: rank 0 runs on cpu #1 (#1) [node014:26374] mca:base:select:(paffinity) Querying component [linux] [node014:26374] mca:base:select:(paffinity) Query of component [linux] set priority to 10 [node014:26374] mca:base:select:(paffinity) Selected component [linux] [node014:26374] paffinity slot assignment: slot_list == 1 [node014:26374] paffinity slot assignment: rank 0 runs on cpu #1 (#1) [node014:26374] *** An error occurred in MPI_Comm_rank [node014:26374] *** on a NULL communicator [node014:26374] *** Unknown error [node014:26374] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PCRoutineLineSource libmpi.so.02ACC26BB36C3 Unknown Unknown Unknown libmpi.so.02ACC26BA0EB8 Unknown Unknown Unknown libmpi.so.02ACC26BA0B4B Unknown Unknown Unknown libmpi.so.02ACC26BCF77E Unknown Unknown Unknown libmpi_f77.so.02ACC269528FB Unknown Unknown Unknown vasp 0046FE66 Unknown Unknown Unknown vasp 00486102 Unknown Unknown Unknown vasp 0042A1AB Unknown Unknown Unknown vasp 0042A02C Unknown Unknown Unknown libc.so.6 00364DE1EB1D Unknown Unknown Unknown vasp 00429F29 Unknown Unknown Unknown -- mpirun has exited due to process rank 0 with PID 26374 on node node014 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- $ mpirun -mca paffinity_base_verbose 8 -np 2 vasp [node014:26402] mca:base:select:(paffinity) Querying component [linux] [node014:26402] mca:base:select:(paffinity) Query of component [linux] set priority to 10 [node014:26402] mca:base:select:(paffinity) Selected component [linux] [node014:26402] paffinity slot assignment: slot_list == 1 [node014:26402] paffinity slot assignment: rank 0 runs on cpu #1 (#1) [node014:26402] paffinity slot assignment: slot_list == 0 [node014:26402] paffinity slot assignment: rank 1 runs on cpu #0 (#0) [node014:26403] mca:base:select:(paffinity) Querying component [linux] [node014:26403] mca:base:select:(paffinity) Query of component [linux] set priority to 10 [node014:26403] mca:base:select:(paffinity) Selected component [linux] [node014:26404] mca:base:select:(paffinity) Querying component [linux] [node014:26404] mca:base:select:(paffinity) Query of component [linux] set priority to 10 [node014:26404] mca:base:select:(paffinity) Selected component [linux] [node014:26403] paffinity slot assignment: slot_list == 1 [node014:26403] paffinity slot assignment: rank 0 runs on cpu #1 (#1) [node014:26403] *** An error occurred in MPI_Comm_rank [node014:26403] *** on a NULL communicator [node014:26403] *** Unknown error [node014:26403] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [node014:26404] paffinity slot assignment: slot_list == 0 [node014:26404] paffinity slot assignment: rank 1 runs on