Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
Hmmm...yeah, there's a bug in there. I'm afraid you just need to give it a number for now - something large enough to meet your needs. You could do it pretty much any way you like - what you have is fine (minus the host key problem) since you only specify one node. Since you are already telling us to spawn only one process in the MPI_Comm_spawn call itself, you don't need the "map-by" key at all - just tell us the host you want it on and we are good. In the end, it doesn't really matter - will do the same thing. On Nov 5, 2021, at 8:45 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote: Ralph, I changed the host name to n022:* and the problem persisted. Here is my C++ code (modified slightly. the host name is not really hard coded as it is below). I thought I needed “ppr:1:node” to spawn a single process, but maybe that is wrong. char info_str[64]; sprintf(info_str, "ppr:%d:node", 1); MPI_Info_create(); MPI_Info_set(info, "host", “n022:*”); MPI_Info_set(info, "map-by", info_str); MPI_Comm_spawn(manager_cmd_.c_str(), argv_, 1, info, rank_, MPI_COMM_SELF, , error_codes); From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Friday, November 5, 2021 9:50 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Here is the problem: [n022.cluster.com:30045 <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2Fdata=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3Dreserved=0> ] [[36230,0],0] using dash_host n022 [n022.cluster.com:30045 <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2Fdata=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3Dreserved=0> ] [[36230,0],0] Removing node n022 slots 1 inuse 1 -- All nodes which are allocated for this job are already filled. -- Looks like your program is passing a "dash-host" MPI info key to the Comm_spawn request and listing host "n022". This translates into assigning only one slot to that host, which indeed has already been filled. If you want to tell OMPI to use that host with _all_ slots available, then you need to change that "dash-host" info to be "n022:*", or replace the asterisk with the number of procs you want to allow on that node. On Nov 5, 2021, at 7:37 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote: Ralph, I configured my build with –enable-debug and added "--mca rmaps_base_verbose 5" to the mpiexec command line. I have attached the job output. Thanks for being willing to look at this problem. My complete configure command is as follows: $ ./configure --enable-shared --enable-static --with-tm=/opt/torque --enable-mpi-cxx --enable-cxx-exceptions --disable-wrapper-runpath --prefix=/opt/openmpi_pgc_tm CC=nvc CXX=nvc++ FC=pgfortran CPP=cpp CFLAGS="-O0 -tp p7-64 -c99" CXXFLAGS="-O0 -tp p7-64" FCFLAGS="-O0 -tp p7-64" --enable-debug --enable-memchecker --with-valgrind=/home/kmccall/valgrind_install The nvc++ version is “nvc++ 20.9-0 LLVM 64-bit target on x86-64 Linux -tp haswell". Our OS is CentOS 7. Here is my mpiexec command, minus all of the trailing arguments that don’t affect mpiexec. mpiexec --enable-recovery \ --mca rmaps_base_verbose 5 \ --display-allocation \ --merge-stderr-to-stdout \ --mca mpi_param_check 1 \ --v \ --x DISPLAY \ --map-by node \ -np 21 \ -wdir ${work_dir} … Here is my qsub command for the program “Needles”. qsub -V -j oe -e $tmpdir_stdio -o $tmpdir_stdio -f -X -N Needles -l nodes=21:ppn=9 RunNeedles.bash; From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 11:58 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Could y
Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
Ralph, I changed the host name to n022:* and the problem persisted. Here is my C++ code (modified slightly. the host name is not really hard coded as it is below). I thought I needed "ppr:1:node" to spawn a single process, but maybe that is wrong. char info_str[64]; sprintf(info_str, "ppr:%d:node", 1); MPI_Info_create(); MPI_Info_set(info, "host", "n022:*"); MPI_Info_set(info, "map-by", info_str); MPI_Comm_spawn(manager_cmd_.c_str(), argv_, 1, info, rank_, MPI_COMM_SELF, , error_codes); From: users On Behalf Of Ralph Castain via users Sent: Friday, November 5, 2021 9:50 AM To: Open MPI Users Cc: Ralph Castain Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Here is the problem: [n022.cluster.com:30045<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2F=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3D=0>] [[36230,0],0] using dash_host n022 [n022.cluster.com:30045<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2F=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3D=0>] [[36230,0],0] Removing node n022 slots 1 inuse 1 -- All nodes which are allocated for this job are already filled. -- Looks like your program is passing a "dash-host" MPI info key to the Comm_spawn request and listing host "n022". This translates into assigning only one slot to that host, which indeed has already been filled. If you want to tell OMPI to use that host with _all_ slots available, then you need to change that "dash-host" info to be "n022:*", or replace the asterisk with the number of procs you want to allow on that node. On Nov 5, 2021, at 7:37 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov>> wrote: Ralph, I configured my build with -enable-debug and added "--mca rmaps_base_verbose 5" to the mpiexec command line. I have attached the job output. Thanks for being willing to look at this problem. My complete configure command is as follows: $ ./configure --enable-shared --enable-static --with-tm=/opt/torque --enable-mpi-cxx --enable-cxx-exceptions --disable-wrapper-runpath --prefix=/opt/openmpi_pgc_tm CC=nvc CXX=nvc++ FC=pgfortran CPP=cpp CFLAGS="-O0 -tp p7-64 -c99" CXXFLAGS="-O0 -tp p7-64" FCFLAGS="-O0 -tp p7-64" --enable-debug --enable-memchecker --with-valgrind=/home/kmccall/valgrind_install The nvc++ version is "nvc++ 20.9-0 LLVM 64-bit target on x86-64 Linux -tp haswell". Our OS is CentOS 7. Here is my mpiexec command, minus all of the trailing arguments that don't affect mpiexec. mpiexec --enable-recovery \ --mca rmaps_base_verbose 5 \ --display-allocation \ --merge-stderr-to-stdout \ --mca mpi_param_check 1 \ --v \ --x DISPLAY \ --map-by node \ -np 21 \ -wdir ${work_dir} ... Here is my qsub command for the program "Needles". qsub -V -j oe -e $tmpdir_stdio -o $tmpdir_stdio -f -X -N Needles -l nodes=21:ppn=9 RunNeedles.bash; From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 11:58 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Could you please ensure it was configured with --enable-debug and then add "--mca rmaps_base_verbose 5" to the mpirun cmd line? On Nov 3, 2021, at 9:10 AM, Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org>> wrote: Gilles and Ralph, I did build with -with-tm. I tried Gilles workaround but the failure still occurred.What do I need to provide you so that you can investigate this possible bug? Thanks, Kurt From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 8:45 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ralph Castain mailto:r...@open-mpi.org>> Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filli
Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
Here is the problem: [n022.cluster.com:30045 <http://n022.cluster.com:30045> ] [[36230,0],0] using dash_host n022 [n022.cluster.com:30045 <http://n022.cluster.com:30045> ] [[36230,0],0] Removing node n022 slots 1 inuse 1 -- All nodes which are allocated for this job are already filled. -- Looks like your program is passing a "dash-host" MPI info key to the Comm_spawn request and listing host "n022". This translates into assigning only one slot to that host, which indeed has already been filled. If you want to tell OMPI to use that host with _all_ slots available, then you need to change that "dash-host" info to be "n022:*", or replace the asterisk with the number of procs you want to allow on that node. On Nov 5, 2021, at 7:37 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote: Ralph, I configured my build with –enable-debug and added "--mca rmaps_base_verbose 5" to the mpiexec command line. I have attached the job output. Thanks for being willing to look at this problem. My complete configure command is as follows: $ ./configure --enable-shared --enable-static --with-tm=/opt/torque --enable-mpi-cxx --enable-cxx-exceptions --disable-wrapper-runpath --prefix=/opt/openmpi_pgc_tm CC=nvc CXX=nvc++ FC=pgfortran CPP=cpp CFLAGS="-O0 -tp p7-64 -c99" CXXFLAGS="-O0 -tp p7-64" FCFLAGS="-O0 -tp p7-64" --enable-debug --enable-memchecker --with-valgrind=/home/kmccall/valgrind_install The nvc++ version is “nvc++ 20.9-0 LLVM 64-bit target on x86-64 Linux -tp haswell". Our OS is CentOS 7. Here is my mpiexec command, minus all of the trailing arguments that don’t affect mpiexec. mpiexec --enable-recovery \ --mca rmaps_base_verbose 5 \ --display-allocation \ --merge-stderr-to-stdout \ --mca mpi_param_check 1 \ --v \ --x DISPLAY \ --map-by node \ -np 21 \ -wdir ${work_dir} … Here is my qsub command for the program “Needles”. qsub -V -j oe -e $tmpdir_stdio -o $tmpdir_stdio -f -X -N Needles -l nodes=21:ppn=9 RunNeedles.bash; From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 11:58 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Could you please ensure it was configured with --enable-debug and then add "--mca rmaps_base_verbose 5" to the mpirun cmd line? On Nov 3, 2021, at 9:10 AM, Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org> > wrote: Gilles and Ralph, I did build with -with-tm. I tried Gilles workaround but the failure still occurred. What do I need to provide you so that you can investigate this possible bug? Thanks, Kurt From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 8:45 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Sounds like a bug to me - regardless of configuration, if the hostfile contains an entry for each slot on a node, OMPI should have added those up. On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users mailto:users@lists.open-mpi.org> > wrote: Kurt, Assuming you built Open MPI with tm support (default if tm is detected at configure time, but you can configure --with-tm to have it abort if tm support is not found), you should not need to use a hostfile. As a workaround, I would suggest you try to mpirun --map-by node -np 21 ... Cheers, Gilles On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org> > wrote: I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with Torque support. I want to reserve multiple slots on each node, and then launch a single manager process on each node. The remaining slots would be filled up as the manager spawns new processes with MPI_Comm_spawn on its local node. Here is the abbreviated mpiexec command, which I assume is the source of the problem described below (?). The hostfile was created by Torque and it contains many repeated node names, one for each slot that it reserved. $ mpiexec --hostfile MyHostFile -np 21 -npernode 1 (etc.) When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are allocated for this job are already filled." They don’t appear to be filled as it also reports that only one slot
Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
Could you please ensure it was configured with --enable-debug and then add "--mca rmaps_base_verbose 5" to the mpirun cmd line? On Nov 3, 2021, at 9:10 AM, Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org> > wrote: Gilles and Ralph, I did build with -with-tm. I tried Gilles workaround but the failure still occurred. What do I need to provide you so that you can investigate this possible bug? Thanks, Kurt From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 8:45 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-mpi.org> > Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Sounds like a bug to me - regardless of configuration, if the hostfile contains an entry for each slot on a node, OMPI should have added those up. On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users mailto:users@lists.open-mpi.org> > wrote: Kurt, Assuming you built Open MPI with tm support (default if tm is detected at configure time, but you can configure --with-tm to have it abort if tm support is not found), you should not need to use a hostfile. As a workaround, I would suggest you try to mpirun --map-by node -np 21 ... Cheers, Gilles On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org> > wrote: I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with Torque support. I want to reserve multiple slots on each node, and then launch a single manager process on each node. The remaining slots would be filled up as the manager spawns new processes with MPI_Comm_spawn on its local node. Here is the abbreviated mpiexec command, which I assume is the source of the problem described below (?). The hostfile was created by Torque and it contains many repeated node names, one for each slot that it reserved. $ mpiexec --hostfile MyHostFile -np 21 -npernode 1 (etc.) When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are allocated for this job are already filled." They don’t appear to be filled as it also reports that only one slot is in use for each node: == ALLOCATED NODES == n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP Do you have any idea what I am doing wrong? My Torque qsub arguments are unchanged from when I successfully launched this kind of job structure under MPICH. The relevant argument to qsub is the resource list, which is “-l nodes=21:ppn=9”.
Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
Gilles and Ralph, I did build with -with-tm. I tried Gilles workaround but the failure still occurred.What do I need to provide you so that you can investigate this possible bug? Thanks, Kurt From: users On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 8:45 AM To: Open MPI Users Cc: Ralph Castain Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn Sounds like a bug to me - regardless of configuration, if the hostfile contains an entry for each slot on a node, OMPI should have added those up. On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users mailto:users@lists.open-mpi.org>> wrote: Kurt, Assuming you built Open MPI with tm support (default if tm is detected at configure time, but you can configure --with-tm to have it abort if tm support is not found), you should not need to use a hostfile. As a workaround, I would suggest you try to mpirun --map-by node -np 21 ... Cheers, Gilles On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org>> wrote: I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with Torque support. I want to reserve multiple slots on each node, and then launch a single manager process on each node. The remaining slots would be filled up as the manager spawns new processes with MPI_Comm_spawn on its local node. Here is the abbreviated mpiexec command, which I assume is the source of the problem described below (?). The hostfile was created by Torque and it contains many repeated node names, one for each slot that it reserved. $ mpiexec --hostfile MyHostFile -np 21 -npernode 1 (etc.) When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are allocated for this job are already filled." They don’t appear to be filled as it also reports that only one slot is in use for each node: == ALLOCATED NODES == n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP Do you have any idea what I am doing wrong? My Torque qsub arguments are unchanged from when I successfully launched this kind of job structure under MPICH. The relevant argument to qsub is the resource list, which is “-l nodes=21:ppn=9”.
Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
Sounds like a bug to me - regardless of configuration, if the hostfile contains an entry for each slot on a node, OMPI should have added those up. On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users mailto:users@lists.open-mpi.org> > wrote: Kurt, Assuming you built Open MPI with tm support (default if tm is detected at configure time, but you can configure --with-tm to have it abort if tm support is not found), you should not need to use a hostfile. As a workaround, I would suggest you try to mpirun --map-by node -np 21 ... Cheers, Gilles On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users mailto:users@lists.open-mpi.org> > wrote: I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with Torque support. I want to reserve multiple slots on each node, and then launch a single manager process on each node. The remaining slots would be filled up as the manager spawns new processes with MPI_Comm_spawn on its local node. Here is the abbreviated mpiexec command, which I assume is the source of the problem described below (?). The hostfile was created by Torque and it contains many repeated node names, one for each slot that it reserved. $ mpiexec --hostfile MyHostFile -np 21 -npernode 1 (etc.) When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are allocated for this job are already filled." They don’t appear to be filled as it also reports that only one slot is in use for each node: == ALLOCATED NODES == n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP Do you have any idea what I am doing wrong? My Torque qsub arguments are unchanged from when I successfully launched this kind of job structure under MPICH. The relevant argument to qsub is the resource list, which is “-l nodes=21:ppn=9”.
Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
Kurt, Assuming you built Open MPI with tm support (default if tm is detected at configure time, but you can configure --with-tm to have it abort if tm support is not found), you should not need to use a hostfile. As a workaround, I would suggest you try to mpirun --map-by node -np 21 ... Cheers, Gilles On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users < users@lists.open-mpi.org> wrote: > I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled > with Torque support. > > > > I want to reserve multiple slots on each node, and then launch a single > manager process on each node. The remaining slots would be filled up as > the manager spawns new processes with MPI_Comm_spawn on its local node. > > > > Here is the abbreviated mpiexec command, which I assume is the source of > the problem described below (?). The hostfile was created by Torque and > it contains many repeated node names, one for each slot that it reserved. > > > > $ mpiexec --hostfile MyHostFile -np 21 -npernode 1 (etc.) > > > > > > When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are > allocated for this job are already filled." They don’t appear to be > filled as it also reports that only one slot is in use for each node: > > > > == ALLOCATED NODES == > > n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP > > n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > > > Do you have any idea what I am doing wrong? My Torque qsub arguments are > unchanged from when I successfully launched this kind of job structure > under MPICH. The relevant argument to qsub is the resource list, which is > “-l nodes=21:ppn=9”. > > >
[OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn
I'm using OpenMPI 4.1.1 compiled with Nvidia's nvc++ 20.9, and compiled with Torque support. I want to reserve multiple slots on each node, and then launch a single manager process on each node. The remaining slots would be filled up as the manager spawns new processes with MPI_Comm_spawn on its local node. Here is the abbreviated mpiexec command, which I assume is the source of the problem described below (?). The hostfile was created by Torque and it contains many repeated node names, one for each slot that it reserved. $ mpiexec --hostfile MyHostFile -np 21 -npernode 1 (etc.) When MPI_Comm_spawn is called, MPI is reporting that "All nodes which are allocated for this job are already filled." They don't appear to be filled as it also reports that only one slot is in use for each node: == ALLOCATED NODES == n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP Do you have any idea what I am doing wrong? My Torque qsub arguments are unchanged from when I successfully launched this kind of job structure under MPICH. The relevant argument to qsub is the resource list, which is "-l nodes=21:ppn=9".