I have OpenMPI running fine for a small/medium number of tasks (simple hello or cpi program). But when I try 700 or 800 tasks, it hangs, apparently on startup. I think this might be related to LDAP, since if I try to log into my account while the job is hung, I get told my username doesn't exist. However, I tried adding -debug to the mpirun, and got the same sequence of output as for successful smaller runs, until it hung again. So I added --debug-daemons and got this (with an exit, i.e. no hanging):
... [blade1:31733] [0,0,0] wrote setup file ------------------------------------------------------------------------ -- The rsh launcher has been given a number of 128 concurrent daemons to launch and is in a debug-daemons option. However, the total number of daemons to launch (200) is greater than this value. This is a scenario that will cause the system to deadlock. To avoid deadlock, either increase the number of concurrent daemons, or remove the debug-daemons flag. ------------------------------------------------------------------------ -- [blade1:31733] [0,0,0] ORTE_ERROR_LOG: Fatal in file ../../../../../orte/mca/rmgr/urm/ rmgr_urm.c at line 455 [blade1:31733] mpirun: spawn failed with errno=-6 [blade1:31733] sess_dir_finalize: proc session dir not empty - leaving Any ideas or suggestions appreciated. Todd Heywood