Hi, I need to launch my openmpi application on grid.
My application is designed to run N processes, where each process would have M threads. To run it without grid, I run it as (say N = 7, M = 2) % mpirun -np 7 <application name with arguments> The above works well and runs N processes. I am also able to submit it on grid using below command and it works. % qsub -pe orte 7 -l os-redhat6.7* -V -j y -b y -shell no mpirun -np 7 <application name with arguments> However, the above job allocates only N slots on the grid, when it really is consuming N*M slots. How do I submit the qsub command so that it reserves the N*M slots, while starting up N processes? I tried belwo but I get some weird error from ORTE as pasted below. % qsub -pe orte 14 -l os-redhat6.7* -V -j y -b y -shell no mpirun -np 7 <application name with arguments> Any ideas? Thanks, Vipul -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: mach12 target node: mach24 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem.