Re: [OMPI users] MPI_Init
On Aug 28, 2012, at 6:47 PM, Tony Raymondwrote: > Hi Ralph, > > Thanks for taking care of this so quick! > > Does this mean that MPI_Init will leave the SIGCHLD handler alone? Yes > Should it be fine to set the handler as I did in the current version of MPI? Yes - no harm done either way, but we shouldn't be messing with the handler (and didn't realize we were). > > Thanks, > Tony > > From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of > Ralph Castain [r...@open-mpi.org] > Sent: Tuesday, August 28, 2012 2:40 PM > To: Open MPI Users > Subject: Re: [OMPI users] MPI_Init > > Okay, I fixed this on our trunk - I'll post it for transfer to the 1.7 and > 1.6 series in their next releases. > > Thanks! > > On Aug 28, 2012, at 2:27 PM, Ralph Castain wrote: > >> Oh crud - yes we do. Checking on it... >> >> On Aug 28, 2012, at 2:23 PM, Ralph Castain wrote: >> >>> Glancing at the code, I don't see anywhere that we trap SIGCHLD outside of >>> mpirun and the orte daemons - certainly not inside an MPI app. What version >>> of OMPI are you using? >>> >>> On Aug 28, 2012, at 2:06 PM, Tony Raymond wrote: >>> Hi, I have an application that uses openMPI and creates some child processes using fork(). I've been trying to catch SIGCHLD in order to check the exit status of these processes so that the program will exit if a child errors out. I've found out that if I set the SIGCHLD handler before calling MPI_Init, MPI_Init sets the SIGCHLD handler so that my application appears to ignore SIGCHLD, but if I set my handler after MPI_Init, the application handles SIGCHLD appropriately. I'm wondering if there are any problems that could come up by changing the SIGCHLD handler, and why MPI_Init modifies the SIGCHLD handler in the first place. Thanks, Tony ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Init
Hi Ralph, Thanks for taking care of this so quick! Does this mean that MPI_Init will leave the SIGCHLD handler alone? Should it be fine to set the handler as I did in the current version of MPI? Thanks, Tony From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Ralph Castain [r...@open-mpi.org] Sent: Tuesday, August 28, 2012 2:40 PM To: Open MPI Users Subject: Re: [OMPI users] MPI_Init Okay, I fixed this on our trunk - I'll post it for transfer to the 1.7 and 1.6 series in their next releases. Thanks! On Aug 28, 2012, at 2:27 PM, Ralph Castainwrote: > Oh crud - yes we do. Checking on it... > > On Aug 28, 2012, at 2:23 PM, Ralph Castain wrote: > >> Glancing at the code, I don't see anywhere that we trap SIGCHLD outside of >> mpirun and the orte daemons - certainly not inside an MPI app. What version >> of OMPI are you using? >> >> On Aug 28, 2012, at 2:06 PM, Tony Raymond wrote: >> >>> Hi, >>> >>> I have an application that uses openMPI and creates some child processes >>> using fork(). I've been trying to catch SIGCHLD in order to check the exit >>> status of these processes so that the program will exit if a child errors >>> out. >>> >>> I've found out that if I set the SIGCHLD handler before calling MPI_Init, >>> MPI_Init sets the SIGCHLD handler so that my application appears to ignore >>> SIGCHLD, but if I set my handler after MPI_Init, the application handles >>> SIGCHLD appropriately. >>> >>> I'm wondering if there are any problems that could come up by changing the >>> SIGCHLD handler, and why MPI_Init modifies the SIGCHLD handler in the first >>> place. >>> >>> Thanks, >>> Tony >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Thanks! On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castainwrote: > Yeah, I'm seeing the hang as well when running across multiple machines. Let > me dig a little and get this fixed. > > Thanks > Ralph > > On Aug 28, 2012, at 4:51 PM, Brian Budge wrote: > >> Hmmm, I went to the build directories of openmpi for my two machines, >> went into the orte/test/mpi directory and made the executables on both >> machines. I set the hostsfile in the env variable on the "master" >> machine. >> >> Here's the output: >> >> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile >> ./simple_spawn >> Parent [pid 97504] starting up! >> 0 completed MPI_Init >> Parent [pid 97504] about to spawn! >> Parent [pid 97507] starting up! >> Parent [pid 97508] starting up! >> Parent [pid 30626] starting up! >> ^C >> zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn >> >> I had to ^C to kill the hung process. >> >> When I run using mpirun: >> >> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile >> mpirun -np 1 ./simple_spawn >> Parent [pid 97511] starting up! >> 0 completed MPI_Init >> Parent [pid 97511] about to spawn! >> Parent [pid 97513] starting up! >> Parent [pid 30762] starting up! >> Parent [pid 30764] starting up! >> Parent done with spawn >> Parent sending message to child >> 1 completed MPI_Init >> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513 >> 0 completed MPI_Init >> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762 >> 2 completed MPI_Init >> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764 >> Child 1 disconnected >> Child 0 received msg: 38 >> Child 0 disconnected >> Parent disconnected >> Child 2 disconnected >> 97511: exiting >> 97513: exiting >> 30762: exiting >> 30764: exiting >> >> As you can see, I'm using openmpi v 1.6.1. I just barely freshly >> installed on both machines using the default configure options. >> >> Thanks for all your help. >> >> Brian >> >> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain wrote: >>> Looks to me like it didn't find your executable - could be a question of >>> where it exists relative to where you are running. If you look in your OMPI >>> source tree at the orte/test/mpi directory, you'll see an example program >>> "simple_spawn.c" there. Just "make simple_spawn" and execute that with your >>> default hostfile set - does it work okay? >>> >>> It works fine for me, hence the question. >>> >>> Also, what OMPI version are you using? >>> >>> On Aug 28, 2012, at 4:25 PM, Brian Budge wrote: >>> I see. Okay. So, I just tried removing the check for universe size, and set the universe size to 2. Here's my output: LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file base/plm_base_receive.c at line 253 [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified application failed to start in file dpm_orte.c at line 785 The corresponding run with mpirun still works. Thanks, Brian On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain wrote: > I see the issue - it's here: > >> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); >> >> if(!flag) { >> std::cerr << "no universe size" << std::endl; >> return -1; >> } >> universeSize = *puniverseSize; >> if(universeSize == 1) { >> std::cerr << "cannot start slaves... not enough nodes" << std::endl; >> } > > The universe size is set to 1 on a singleton because the attribute gets > set at the beginning of time - we haven't any way to go back and change > it. The sequence of events explains why. The singleton starts up and sets > its attributes, including universe_size. It also spins off an orte daemon > to act as its own private "mpirun" in case you call comm_spawn. At this > point, however, no hostfile has been read - the singleton is just an MPI > proc doing its own thing, and the orte daemon is just sitting there on > "stand-by". > > When your app calls comm_spawn, then the orte daemon gets called to > launch the new procs. At that time, it (not the original singleton!) > reads the hostfile to find out how many nodes are around, and then does > the launch. > > You are trying to check the number of nodes from within the singleton, > which won't work - it has no way of discovering that info. > > > > > On Aug 28, 2012, at 2:38 PM, Brian Budge wrote: > >>> echo hostsfile >> localhost >> budgeb-sandybridge >> >> Thanks, >> Brian >> >> On
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Yeah, I'm seeing the hang as well when running across multiple machines. Let me dig a little and get this fixed. Thanks Ralph On Aug 28, 2012, at 4:51 PM, Brian Budgewrote: > Hmmm, I went to the build directories of openmpi for my two machines, > went into the orte/test/mpi directory and made the executables on both > machines. I set the hostsfile in the env variable on the "master" > machine. > > Here's the output: > > OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile > ./simple_spawn > Parent [pid 97504] starting up! > 0 completed MPI_Init > Parent [pid 97504] about to spawn! > Parent [pid 97507] starting up! > Parent [pid 97508] starting up! > Parent [pid 30626] starting up! > ^C > zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn > > I had to ^C to kill the hung process. > > When I run using mpirun: > > OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile > mpirun -np 1 ./simple_spawn > Parent [pid 97511] starting up! > 0 completed MPI_Init > Parent [pid 97511] about to spawn! > Parent [pid 97513] starting up! > Parent [pid 30762] starting up! > Parent [pid 30764] starting up! > Parent done with spawn > Parent sending message to child > 1 completed MPI_Init > Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513 > 0 completed MPI_Init > Hello from the child 0 of 3 on host budgeb-interlagos pid 30762 > 2 completed MPI_Init > Hello from the child 2 of 3 on host budgeb-interlagos pid 30764 > Child 1 disconnected > Child 0 received msg: 38 > Child 0 disconnected > Parent disconnected > Child 2 disconnected > 97511: exiting > 97513: exiting > 30762: exiting > 30764: exiting > > As you can see, I'm using openmpi v 1.6.1. I just barely freshly > installed on both machines using the default configure options. > > Thanks for all your help. > > Brian > > On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain wrote: >> Looks to me like it didn't find your executable - could be a question of >> where it exists relative to where you are running. If you look in your OMPI >> source tree at the orte/test/mpi directory, you'll see an example program >> "simple_spawn.c" there. Just "make simple_spawn" and execute that with your >> default hostfile set - does it work okay? >> >> It works fine for me, hence the question. >> >> Also, what OMPI version are you using? >> >> On Aug 28, 2012, at 4:25 PM, Brian Budge wrote: >> >>> I see. Okay. So, I just tried removing the check for universe size, >>> and set the universe size to 2. Here's my output: >>> >>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >>> [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file >>> base/plm_base_receive.c at line 253 >>> [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified >>> application failed to start in file dpm_orte.c at line 785 >>> >>> The corresponding run with mpirun still works. >>> >>> Thanks, >>> Brian >>> >>> On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain wrote: I see the issue - it's here: > MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); > > if(!flag) { > std::cerr << "no universe size" << std::endl; > return -1; > } > universeSize = *puniverseSize; > if(universeSize == 1) { > std::cerr << "cannot start slaves... not enough nodes" << std::endl; > } The universe size is set to 1 on a singleton because the attribute gets set at the beginning of time - we haven't any way to go back and change it. The sequence of events explains why. The singleton starts up and sets its attributes, including universe_size. It also spins off an orte daemon to act as its own private "mpirun" in case you call comm_spawn. At this point, however, no hostfile has been read - the singleton is just an MPI proc doing its own thing, and the orte daemon is just sitting there on "stand-by". When your app calls comm_spawn, then the orte daemon gets called to launch the new procs. At that time, it (not the original singleton!) reads the hostfile to find out how many nodes are around, and then does the launch. You are trying to check the number of nodes from within the singleton, which won't work - it has no way of discovering that info. On Aug 28, 2012, at 2:38 PM, Brian Budge wrote: >> echo hostsfile > localhost > budgeb-sandybridge > > Thanks, > Brian > > On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain wrote: >> Hmmm...what is in your "hostsfile"? >> >> On Aug 28, 2012, at 2:33 PM, Brian Budge
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Hmmm, I went to the build directories of openmpi for my two machines, went into the orte/test/mpi directory and made the executables on both machines. I set the hostsfile in the env variable on the "master" machine. Here's the output: OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile ./simple_spawn Parent [pid 97504] starting up! 0 completed MPI_Init Parent [pid 97504] about to spawn! Parent [pid 97507] starting up! Parent [pid 97508] starting up! Parent [pid 30626] starting up! ^C zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn I had to ^C to kill the hung process. When I run using mpirun: OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile mpirun -np 1 ./simple_spawn Parent [pid 97511] starting up! 0 completed MPI_Init Parent [pid 97511] about to spawn! Parent [pid 97513] starting up! Parent [pid 30762] starting up! Parent [pid 30764] starting up! Parent done with spawn Parent sending message to child 1 completed MPI_Init Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513 0 completed MPI_Init Hello from the child 0 of 3 on host budgeb-interlagos pid 30762 2 completed MPI_Init Hello from the child 2 of 3 on host budgeb-interlagos pid 30764 Child 1 disconnected Child 0 received msg: 38 Child 0 disconnected Parent disconnected Child 2 disconnected 97511: exiting 97513: exiting 30762: exiting 30764: exiting As you can see, I'm using openmpi v 1.6.1. I just barely freshly installed on both machines using the default configure options. Thanks for all your help. Brian On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castainwrote: > Looks to me like it didn't find your executable - could be a question of > where it exists relative to where you are running. If you look in your OMPI > source tree at the orte/test/mpi directory, you'll see an example program > "simple_spawn.c" there. Just "make simple_spawn" and execute that with your > default hostfile set - does it work okay? > > It works fine for me, hence the question. > > Also, what OMPI version are you using? > > On Aug 28, 2012, at 4:25 PM, Brian Budge wrote: > >> I see. Okay. So, I just tried removing the check for universe size, >> and set the universe size to 2. Here's my output: >> >> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >> [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file >> base/plm_base_receive.c at line 253 >> [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified >> application failed to start in file dpm_orte.c at line 785 >> >> The corresponding run with mpirun still works. >> >> Thanks, >> Brian >> >> On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain wrote: >>> I see the issue - it's here: >>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); if(!flag) { std::cerr << "no universe size" << std::endl; return -1; } universeSize = *puniverseSize; if(universeSize == 1) { std::cerr << "cannot start slaves... not enough nodes" << std::endl; } >>> >>> The universe size is set to 1 on a singleton because the attribute gets set >>> at the beginning of time - we haven't any way to go back and change it. The >>> sequence of events explains why. The singleton starts up and sets its >>> attributes, including universe_size. It also spins off an orte daemon to >>> act as its own private "mpirun" in case you call comm_spawn. At this point, >>> however, no hostfile has been read - the singleton is just an MPI proc >>> doing its own thing, and the orte daemon is just sitting there on >>> "stand-by". >>> >>> When your app calls comm_spawn, then the orte daemon gets called to launch >>> the new procs. At that time, it (not the original singleton!) reads the >>> hostfile to find out how many nodes are around, and then does the launch. >>> >>> You are trying to check the number of nodes from within the singleton, >>> which won't work - it has no way of discovering that info. >>> >>> >>> >>> >>> On Aug 28, 2012, at 2:38 PM, Brian Budge wrote: >>> > echo hostsfile localhost budgeb-sandybridge Thanks, Brian On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain wrote: > Hmmm...what is in your "hostsfile"? > > On Aug 28, 2012, at 2:33 PM, Brian Budge wrote: > >> Hi Ralph - >> >> Thanks for confirming this is possible. I'm trying this and currently >> failing. Perhaps there's something I'm missing in the code to make >> this work. Here are the two instantiations and their outputs: >> >>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Looks to me like it didn't find your executable - could be a question of where it exists relative to where you are running. If you look in your OMPI source tree at the orte/test/mpi directory, you'll see an example program "simple_spawn.c" there. Just "make simple_spawn" and execute that with your default hostfile set - does it work okay? It works fine for me, hence the question. Also, what OMPI version are you using? On Aug 28, 2012, at 4:25 PM, Brian Budgewrote: > I see. Okay. So, I just tried removing the check for universe size, > and set the universe size to 2. Here's my output: > > LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib > OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe > [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file > base/plm_base_receive.c at line 253 > [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified > application failed to start in file dpm_orte.c at line 785 > > The corresponding run with mpirun still works. > > Thanks, > Brian > > On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain wrote: >> I see the issue - it's here: >> >>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); >>> >>> if(!flag) { >>> std::cerr << "no universe size" << std::endl; >>> return -1; >>> } >>> universeSize = *puniverseSize; >>> if(universeSize == 1) { >>> std::cerr << "cannot start slaves... not enough nodes" << std::endl; >>> } >> >> The universe size is set to 1 on a singleton because the attribute gets set >> at the beginning of time - we haven't any way to go back and change it. The >> sequence of events explains why. The singleton starts up and sets its >> attributes, including universe_size. It also spins off an orte daemon to act >> as its own private "mpirun" in case you call comm_spawn. At this point, >> however, no hostfile has been read - the singleton is just an MPI proc doing >> its own thing, and the orte daemon is just sitting there on "stand-by". >> >> When your app calls comm_spawn, then the orte daemon gets called to launch >> the new procs. At that time, it (not the original singleton!) reads the >> hostfile to find out how many nodes are around, and then does the launch. >> >> You are trying to check the number of nodes from within the singleton, which >> won't work - it has no way of discovering that info. >> >> >> >> >> On Aug 28, 2012, at 2:38 PM, Brian Budge wrote: >> echo hostsfile >>> localhost >>> budgeb-sandybridge >>> >>> Thanks, >>> Brian >>> >>> On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain wrote: Hmmm...what is in your "hostsfile"? On Aug 28, 2012, at 2:33 PM, Brian Budge wrote: > Hi Ralph - > > Thanks for confirming this is possible. I'm trying this and currently > failing. Perhaps there's something I'm missing in the code to make > this work. Here are the two instantiations and their outputs: > >> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe > cannot start slaves... not enough nodes > >> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe > master spawned 1 slaves... > slave responding... > > > The code: > > //master.cpp > #include > #include > #include > > int main(int argc, char **args) { > int worldSize, universeSize, *puniverseSize, flag; > > MPI_Comm everyone; //intercomm > boost::filesystem::path curPath = > boost::filesystem::absolute(boost::filesystem::current_path()); > > std::string toRun = (curPath / "slave_exe").string(); > > int ret = MPI_Init(, ); > > if(ret != MPI_SUCCESS) { > std::cerr << "failed init" << std::endl; > return -1; > } > > MPI_Comm_size(MPI_COMM_WORLD, ); > > if(worldSize != 1) { > std::cerr << "too many masters" << std::endl; > } > > MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); > > if(!flag) { > std::cerr << "no universe size" << std::endl; > return -1; > } > universeSize = *puniverseSize; > if(universeSize == 1) { > std::cerr << "cannot start slaves... not enough nodes" << std::endl; > } > > > char *buf = (char*)alloca(toRun.size() + 1); > memcpy(buf, toRun.c_str(), toRun.size()); > buf[toRun.size()] = '\0'; > > MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, > 0, MPI_COMM_SELF, , > MPI_ERRCODES_IGNORE); > > std::cerr << "master spawned " << universeSize-1 << " slaves..." > << std::endl; > > MPI_Finalize();
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
I see. Okay. So, I just tried removing the check for universe size, and set the universe size to 2. Here's my output: LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file base/plm_base_receive.c at line 253 [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified application failed to start in file dpm_orte.c at line 785 The corresponding run with mpirun still works. Thanks, Brian On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castainwrote: > I see the issue - it's here: > >> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); >> >> if(!flag) { >> std::cerr << "no universe size" << std::endl; >> return -1; >> } >> universeSize = *puniverseSize; >> if(universeSize == 1) { >> std::cerr << "cannot start slaves... not enough nodes" << std::endl; >> } > > The universe size is set to 1 on a singleton because the attribute gets set > at the beginning of time - we haven't any way to go back and change it. The > sequence of events explains why. The singleton starts up and sets its > attributes, including universe_size. It also spins off an orte daemon to act > as its own private "mpirun" in case you call comm_spawn. At this point, > however, no hostfile has been read - the singleton is just an MPI proc doing > its own thing, and the orte daemon is just sitting there on "stand-by". > > When your app calls comm_spawn, then the orte daemon gets called to launch > the new procs. At that time, it (not the original singleton!) reads the > hostfile to find out how many nodes are around, and then does the launch. > > You are trying to check the number of nodes from within the singleton, which > won't work - it has no way of discovering that info. > > > > > On Aug 28, 2012, at 2:38 PM, Brian Budge wrote: > >>> echo hostsfile >> localhost >> budgeb-sandybridge >> >> Thanks, >> Brian >> >> On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain wrote: >>> Hmmm...what is in your "hostsfile"? >>> >>> On Aug 28, 2012, at 2:33 PM, Brian Budge wrote: >>> Hi Ralph - Thanks for confirming this is possible. I'm trying this and currently failing. Perhaps there's something I'm missing in the code to make this work. Here are the two instantiations and their outputs: > LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib > OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe cannot start slaves... not enough nodes > LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib > OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe master spawned 1 slaves... slave responding... The code: //master.cpp #include #include #include int main(int argc, char **args) { int worldSize, universeSize, *puniverseSize, flag; MPI_Comm everyone; //intercomm boost::filesystem::path curPath = boost::filesystem::absolute(boost::filesystem::current_path()); std::string toRun = (curPath / "slave_exe").string(); int ret = MPI_Init(, ); if(ret != MPI_SUCCESS) { std::cerr << "failed init" << std::endl; return -1; } MPI_Comm_size(MPI_COMM_WORLD, ); if(worldSize != 1) { std::cerr << "too many masters" << std::endl; } MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); if(!flag) { std::cerr << "no universe size" << std::endl; return -1; } universeSize = *puniverseSize; if(universeSize == 1) { std::cerr << "cannot start slaves... not enough nodes" << std::endl; } char *buf = (char*)alloca(toRun.size() + 1); memcpy(buf, toRun.c_str(), toRun.size()); buf[toRun.size()] = '\0'; MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, 0, MPI_COMM_SELF, , MPI_ERRCODES_IGNORE); std::cerr << "master spawned " << universeSize-1 << " slaves..." << std::endl; MPI_Finalize(); return 0; } //slave.cpp #include int main(int argc, char **args) { int size; MPI_Comm parent; MPI_Init(, ); MPI_Comm_get_parent(); if(parent == MPI_COMM_NULL) { std::cerr << "slave has no parent" << std::endl; } MPI_Comm_remote_size(parent, ); if(size != 1) { std::cerr << "parent size is " << size << std::endl; } std::cerr << "slave responding..." << std::endl; MPI_Finalize(); return 0; } Any ideas? Thanks for any help. Brian
Re: [OMPI users] 转发:lwkmpi
There is only one file where "return { ... };" is used. --disable-vt seems to fix it. -- Reuti Am 28.08.2012 um 14:56 schrieb Tim Prince: > On 8/28/2012 5:11 AM, 清风 wrote: >> >> >> >> -- 原始邮 件 -- >> *发件人:* "295187383"<295187...@qq.com>; >> *发送时间:* 2012年8月28日(星期二) 下午4:13 >> *收件人:* "users"; >> *主题:* lwkmpi >> >> Hi everybody, >>I'm trying compile openmpi with intel compiler11.1.07 on ubuntu . >>I compiled openmpi many times and I could always find a problem. But the >> error that I'm getting now, gives me no clues where to even search for the >> problem. >>It seems I have succeed to configure.While I try "make all",it always >> show problems below: >> >> >> >>make[7]: 正在进入目录 >> `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' >> /opt/intel/Compiler/11.1/072/bin/ia32/icpc -DHAVE_CONFIG_H -I. -I../../.. >> -DINSIDE_OPENMPI -I/home/lwk/桌面/mnt/Software/openmpi- >> 1.6.1/opal/mca/hwloc/hwloc132/hwloc /include -I/usr/include/infiniband >> -I/usr/include/infiniband -DOPARI_VT -O3 -DNDEBUG -finline-functions >> -pthread -MT opari-ompragma_c.o -MD -MP -MF .deps/opari-ompragma_c.Tpo -c -o >> opari-ompragma_c.o `test -f 'ompragma_c.cc' || echo './'`ompragma_c.cc >> /usr/include/c++/4.5/iomanip(64): error: expected an expression >>{ return { __mask }; } >> ^ >> > > Looks like your icpc is too old to work with your g++. If you want to build > with C++ support, you'll need better matching versions of icpc and g++. icpc > support for g++4.7 is expected to release within the next month; icpc 12.1 > should be fine with g++ 4.5 and 4.6. > > -- > Tim Prince > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
I see the issue - it's here: > MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); > > if(!flag) { > std::cerr << "no universe size" << std::endl; > return -1; > } > universeSize = *puniverseSize; > if(universeSize == 1) { > std::cerr << "cannot start slaves... not enough nodes" << std::endl; > } The universe size is set to 1 on a singleton because the attribute gets set at the beginning of time - we haven't any way to go back and change it. The sequence of events explains why. The singleton starts up and sets its attributes, including universe_size. It also spins off an orte daemon to act as its own private "mpirun" in case you call comm_spawn. At this point, however, no hostfile has been read - the singleton is just an MPI proc doing its own thing, and the orte daemon is just sitting there on "stand-by". When your app calls comm_spawn, then the orte daemon gets called to launch the new procs. At that time, it (not the original singleton!) reads the hostfile to find out how many nodes are around, and then does the launch. You are trying to check the number of nodes from within the singleton, which won't work - it has no way of discovering that info. On Aug 28, 2012, at 2:38 PM, Brian Budgewrote: >> echo hostsfile > localhost > budgeb-sandybridge > > Thanks, > Brian > > On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain wrote: >> Hmmm...what is in your "hostsfile"? >> >> On Aug 28, 2012, at 2:33 PM, Brian Budge wrote: >> >>> Hi Ralph - >>> >>> Thanks for confirming this is possible. I'm trying this and currently >>> failing. Perhaps there's something I'm missing in the code to make >>> this work. Here are the two instantiations and their outputs: >>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >>> cannot start slaves... not enough nodes >>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe >>> master spawned 1 slaves... >>> slave responding... >>> >>> >>> The code: >>> >>> //master.cpp >>> #include >>> #include >>> #include >>> >>> int main(int argc, char **args) { >>> int worldSize, universeSize, *puniverseSize, flag; >>> >>> MPI_Comm everyone; //intercomm >>> boost::filesystem::path curPath = >>> boost::filesystem::absolute(boost::filesystem::current_path()); >>> >>> std::string toRun = (curPath / "slave_exe").string(); >>> >>> int ret = MPI_Init(, ); >>> >>> if(ret != MPI_SUCCESS) { >>> std::cerr << "failed init" << std::endl; >>> return -1; >>> } >>> >>> MPI_Comm_size(MPI_COMM_WORLD, ); >>> >>> if(worldSize != 1) { >>> std::cerr << "too many masters" << std::endl; >>> } >>> >>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); >>> >>> if(!flag) { >>> std::cerr << "no universe size" << std::endl; >>> return -1; >>> } >>> universeSize = *puniverseSize; >>> if(universeSize == 1) { >>> std::cerr << "cannot start slaves... not enough nodes" << std::endl; >>> } >>> >>> >>> char *buf = (char*)alloca(toRun.size() + 1); >>> memcpy(buf, toRun.c_str(), toRun.size()); >>> buf[toRun.size()] = '\0'; >>> >>> MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, >>> 0, MPI_COMM_SELF, , >>> MPI_ERRCODES_IGNORE); >>> >>> std::cerr << "master spawned " << universeSize-1 << " slaves..." >>> << std::endl; >>> >>> MPI_Finalize(); >>> >>> return 0; >>> } >>> >>> >>> //slave.cpp >>> #include >>> >>> int main(int argc, char **args) { >>> int size; >>> MPI_Comm parent; >>> MPI_Init(, ); >>> >>> MPI_Comm_get_parent(); >>> >>> if(parent == MPI_COMM_NULL) { >>> std::cerr << "slave has no parent" << std::endl; >>> } >>> MPI_Comm_remote_size(parent, ); >>> if(size != 1) { >>> std::cerr << "parent size is " << size << std::endl; >>> } >>> >>> std::cerr << "slave responding..." << std::endl; >>> >>> MPI_Finalize(); >>> >>> return 0; >>> } >>> >>> >>> Any ideas? Thanks for any help. >>> >>> Brian >>> >>> On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain wrote: It really is just that simple :-) On Aug 22, 2012, at 8:56 AM, Brian Budge wrote: > Okay. Is there a tutorial or FAQ for setting everything up? Or is it > really just that simple? I don't need to run a copy of the orte > server somewhere? > > if my current ip is 192.168.0.1, > > 0 > echo 192.168.0.11 > /tmp/hostfile > 1 > echo 192.168.0.12 >> /tmp/hostfile > 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile > 3 > ./mySpawningExe > > At this point, mySpawningExe will be the master, running on > 192.168.0.1, and I can have spawned, for example, childExe on
Re: [OMPI users] MPI_Init
Okay, I fixed this on our trunk - I'll post it for transfer to the 1.7 and 1.6 series in their next releases. Thanks! On Aug 28, 2012, at 2:27 PM, Ralph Castainwrote: > Oh crud - yes we do. Checking on it... > > On Aug 28, 2012, at 2:23 PM, Ralph Castain wrote: > >> Glancing at the code, I don't see anywhere that we trap SIGCHLD outside of >> mpirun and the orte daemons - certainly not inside an MPI app. What version >> of OMPI are you using? >> >> On Aug 28, 2012, at 2:06 PM, Tony Raymond wrote: >> >>> Hi, >>> >>> I have an application that uses openMPI and creates some child processes >>> using fork(). I've been trying to catch SIGCHLD in order to check the exit >>> status of these processes so that the program will exit if a child errors >>> out. >>> >>> I've found out that if I set the SIGCHLD handler before calling MPI_Init, >>> MPI_Init sets the SIGCHLD handler so that my application appears to ignore >>> SIGCHLD, but if I set my handler after MPI_Init, the application handles >>> SIGCHLD appropriately. >>> >>> I'm wondering if there are any problems that could come up by changing the >>> SIGCHLD handler, and why MPI_Init modifies the SIGCHLD handler in the first >>> place. >>> >>> Thanks, >>> Tony >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
>echo hostsfile localhost budgeb-sandybridge Thanks, Brian On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castainwrote: > Hmmm...what is in your "hostsfile"? > > On Aug 28, 2012, at 2:33 PM, Brian Budge wrote: > >> Hi Ralph - >> >> Thanks for confirming this is possible. I'm trying this and currently >> failing. Perhaps there's something I'm missing in the code to make >> this work. Here are the two instantiations and their outputs: >> >>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >> cannot start slaves... not enough nodes >> >>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe >> master spawned 1 slaves... >> slave responding... >> >> >> The code: >> >> //master.cpp >> #include >> #include >> #include >> >> int main(int argc, char **args) { >>int worldSize, universeSize, *puniverseSize, flag; >> >>MPI_Comm everyone; //intercomm >>boost::filesystem::path curPath = >> boost::filesystem::absolute(boost::filesystem::current_path()); >> >>std::string toRun = (curPath / "slave_exe").string(); >> >>int ret = MPI_Init(, ); >> >>if(ret != MPI_SUCCESS) { >>std::cerr << "failed init" << std::endl; >>return -1; >>} >> >>MPI_Comm_size(MPI_COMM_WORLD, ); >> >>if(worldSize != 1) { >>std::cerr << "too many masters" << std::endl; >>} >> >>MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); >> >>if(!flag) { >>std::cerr << "no universe size" << std::endl; >>return -1; >>} >>universeSize = *puniverseSize; >>if(universeSize == 1) { >>std::cerr << "cannot start slaves... not enough nodes" << std::endl; >>} >> >> >>char *buf = (char*)alloca(toRun.size() + 1); >>memcpy(buf, toRun.c_str(), toRun.size()); >>buf[toRun.size()] = '\0'; >> >>MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, >> 0, MPI_COMM_SELF, , >> MPI_ERRCODES_IGNORE); >> >>std::cerr << "master spawned " << universeSize-1 << " slaves..." >> << std::endl; >> >>MPI_Finalize(); >> >> return 0; >> } >> >> >> //slave.cpp >> #include >> >> int main(int argc, char **args) { >>int size; >>MPI_Comm parent; >>MPI_Init(, ); >> >>MPI_Comm_get_parent(); >> >>if(parent == MPI_COMM_NULL) { >>std::cerr << "slave has no parent" << std::endl; >>} >>MPI_Comm_remote_size(parent, ); >>if(size != 1) { >>std::cerr << "parent size is " << size << std::endl; >>} >> >>std::cerr << "slave responding..." << std::endl; >> >>MPI_Finalize(); >> >>return 0; >> } >> >> >> Any ideas? Thanks for any help. >> >> Brian >> >> On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain wrote: >>> It really is just that simple :-) >>> >>> On Aug 22, 2012, at 8:56 AM, Brian Budge wrote: >>> Okay. Is there a tutorial or FAQ for setting everything up? Or is it really just that simple? I don't need to run a copy of the orte server somewhere? if my current ip is 192.168.0.1, 0 > echo 192.168.0.11 > /tmp/hostfile 1 > echo 192.168.0.12 >> /tmp/hostfile 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile 3 > ./mySpawningExe At this point, mySpawningExe will be the master, running on 192.168.0.1, and I can have spawned, for example, childExe on 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and childExe2 on 192.168.0.12? Thanks for the help. Brian On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain wrote: > Sure, that's still true on all 1.3 or above releases. All you need to do > is set the hostfile envar so we pick it up: > > OMPI_MCA_orte_default_hostfile= > > > On Aug 21, 2012, at 7:23 PM, Brian Budge wrote: > >> Hi. I know this is an old thread, but I'm curious if there are any >> tutorials describing how to set this up? Is this still available on >> newer open mpi versions? >> >> Thanks, >> Brian >> >> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain wrote: >>> Hi Elena >>> >>> I'm copying this to the user list just to correct a mis-statement on my >>> part >>> in an earlier message that went there. I had stated that a singleton >>> could >>> comm_spawn onto other nodes listed in a hostfile by setting an >>> environmental >>> variable that pointed us to the hostfile. >>> >>> This is incorrect in the 1.2 code series. That series does not allow >>> singletons to read a hostfile at all. Hence, any comm_spawn done by a >>> singleton can only launch child processes on the singleton's local host. >>> >>> This
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Hmmm...what is in your "hostsfile"? On Aug 28, 2012, at 2:33 PM, Brian Budgewrote: > Hi Ralph - > > Thanks for confirming this is possible. I'm trying this and currently > failing. Perhaps there's something I'm missing in the code to make > this work. Here are the two instantiations and their outputs: > >> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe > cannot start slaves... not enough nodes > >> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe > master spawned 1 slaves... > slave responding... > > > The code: > > //master.cpp > #include > #include > #include > > int main(int argc, char **args) { >int worldSize, universeSize, *puniverseSize, flag; > >MPI_Comm everyone; //intercomm >boost::filesystem::path curPath = > boost::filesystem::absolute(boost::filesystem::current_path()); > >std::string toRun = (curPath / "slave_exe").string(); > >int ret = MPI_Init(, ); > >if(ret != MPI_SUCCESS) { >std::cerr << "failed init" << std::endl; >return -1; >} > >MPI_Comm_size(MPI_COMM_WORLD, ); > >if(worldSize != 1) { >std::cerr << "too many masters" << std::endl; >} > >MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); > >if(!flag) { >std::cerr << "no universe size" << std::endl; >return -1; >} >universeSize = *puniverseSize; >if(universeSize == 1) { >std::cerr << "cannot start slaves... not enough nodes" << std::endl; >} > > >char *buf = (char*)alloca(toRun.size() + 1); >memcpy(buf, toRun.c_str(), toRun.size()); >buf[toRun.size()] = '\0'; > >MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, > 0, MPI_COMM_SELF, , > MPI_ERRCODES_IGNORE); > >std::cerr << "master spawned " << universeSize-1 << " slaves..." > << std::endl; > >MPI_Finalize(); > > return 0; > } > > > //slave.cpp > #include > > int main(int argc, char **args) { >int size; >MPI_Comm parent; >MPI_Init(, ); > >MPI_Comm_get_parent(); > >if(parent == MPI_COMM_NULL) { >std::cerr << "slave has no parent" << std::endl; >} >MPI_Comm_remote_size(parent, ); >if(size != 1) { >std::cerr << "parent size is " << size << std::endl; >} > >std::cerr << "slave responding..." << std::endl; > >MPI_Finalize(); > >return 0; > } > > > Any ideas? Thanks for any help. > > Brian > > On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain wrote: >> It really is just that simple :-) >> >> On Aug 22, 2012, at 8:56 AM, Brian Budge wrote: >> >>> Okay. Is there a tutorial or FAQ for setting everything up? Or is it >>> really just that simple? I don't need to run a copy of the orte >>> server somewhere? >>> >>> if my current ip is 192.168.0.1, >>> >>> 0 > echo 192.168.0.11 > /tmp/hostfile >>> 1 > echo 192.168.0.12 >> /tmp/hostfile >>> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile >>> 3 > ./mySpawningExe >>> >>> At this point, mySpawningExe will be the master, running on >>> 192.168.0.1, and I can have spawned, for example, childExe on >>> 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and >>> childExe2 on 192.168.0.12? >>> >>> Thanks for the help. >>> >>> Brian >>> >>> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain wrote: Sure, that's still true on all 1.3 or above releases. All you need to do is set the hostfile envar so we pick it up: OMPI_MCA_orte_default_hostfile= On Aug 21, 2012, at 7:23 PM, Brian Budge wrote: > Hi. I know this is an old thread, but I'm curious if there are any > tutorials describing how to set this up? Is this still available on > newer open mpi versions? > > Thanks, > Brian > > On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain wrote: >> Hi Elena >> >> I'm copying this to the user list just to correct a mis-statement on my >> part >> in an earlier message that went there. I had stated that a singleton >> could >> comm_spawn onto other nodes listed in a hostfile by setting an >> environmental >> variable that pointed us to the hostfile. >> >> This is incorrect in the 1.2 code series. That series does not allow >> singletons to read a hostfile at all. Hence, any comm_spawn done by a >> singleton can only launch child processes on the singleton's local host. >> >> This situation has been corrected for the upcoming 1.3 code series. For >> the >> 1.2 series, though, you will have to do it via an mpirun command line. >> >> Sorry for the confusion - I sometimes have too many code families to keep >>
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Hi Ralph - Thanks for confirming this is possible. I'm trying this and currently failing. Perhaps there's something I'm missing in the code to make this work. Here are the two instantiations and their outputs: > LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib > OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe cannot start slaves... not enough nodes > LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib > OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe master spawned 1 slaves... slave responding... The code: //master.cpp #include #include #include int main(int argc, char **args) { int worldSize, universeSize, *puniverseSize, flag; MPI_Comm everyone; //intercomm boost::filesystem::path curPath = boost::filesystem::absolute(boost::filesystem::current_path()); std::string toRun = (curPath / "slave_exe").string(); int ret = MPI_Init(, ); if(ret != MPI_SUCCESS) { std::cerr << "failed init" << std::endl; return -1; } MPI_Comm_size(MPI_COMM_WORLD, ); if(worldSize != 1) { std::cerr << "too many masters" << std::endl; } MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, , ); if(!flag) { std::cerr << "no universe size" << std::endl; return -1; } universeSize = *puniverseSize; if(universeSize == 1) { std::cerr << "cannot start slaves... not enough nodes" << std::endl; } char *buf = (char*)alloca(toRun.size() + 1); memcpy(buf, toRun.c_str(), toRun.size()); buf[toRun.size()] = '\0'; MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, 0, MPI_COMM_SELF, , MPI_ERRCODES_IGNORE); std::cerr << "master spawned " << universeSize-1 << " slaves..." << std::endl; MPI_Finalize(); return 0; } //slave.cpp #include int main(int argc, char **args) { int size; MPI_Comm parent; MPI_Init(, ); MPI_Comm_get_parent(); if(parent == MPI_COMM_NULL) { std::cerr << "slave has no parent" << std::endl; } MPI_Comm_remote_size(parent, ); if(size != 1) { std::cerr << "parent size is " << size << std::endl; } std::cerr << "slave responding..." << std::endl; MPI_Finalize(); return 0; } Any ideas? Thanks for any help. Brian On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castainwrote: > It really is just that simple :-) > > On Aug 22, 2012, at 8:56 AM, Brian Budge wrote: > >> Okay. Is there a tutorial or FAQ for setting everything up? Or is it >> really just that simple? I don't need to run a copy of the orte >> server somewhere? >> >> if my current ip is 192.168.0.1, >> >> 0 > echo 192.168.0.11 > /tmp/hostfile >> 1 > echo 192.168.0.12 >> /tmp/hostfile >> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile >> 3 > ./mySpawningExe >> >> At this point, mySpawningExe will be the master, running on >> 192.168.0.1, and I can have spawned, for example, childExe on >> 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and >> childExe2 on 192.168.0.12? >> >> Thanks for the help. >> >> Brian >> >> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain wrote: >>> Sure, that's still true on all 1.3 or above releases. All you need to do is >>> set the hostfile envar so we pick it up: >>> >>> OMPI_MCA_orte_default_hostfile= >>> >>> >>> On Aug 21, 2012, at 7:23 PM, Brian Budge wrote: >>> Hi. I know this is an old thread, but I'm curious if there are any tutorials describing how to set this up? Is this still available on newer open mpi versions? Thanks, Brian On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain wrote: > Hi Elena > > I'm copying this to the user list just to correct a mis-statement on my > part > in an earlier message that went there. I had stated that a singleton could > comm_spawn onto other nodes listed in a hostfile by setting an > environmental > variable that pointed us to the hostfile. > > This is incorrect in the 1.2 code series. That series does not allow > singletons to read a hostfile at all. Hence, any comm_spawn done by a > singleton can only launch child processes on the singleton's local host. > > This situation has been corrected for the upcoming 1.3 code series. For > the > 1.2 series, though, you will have to do it via an mpirun command line. > > Sorry for the confusion - I sometimes have too many code families to keep > straight in this old mind! > > Ralph > > > On 1/4/08 5:10 AM, "Elena Zhebel" wrote: > >> Hello Ralph, >> >> Thank you very much for the explanations. >> But I still do not get it running... >> >> For the case >> mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe
Re: [OMPI users] MPI_Init
Oh crud - yes we do. Checking on it... On Aug 28, 2012, at 2:23 PM, Ralph Castainwrote: > Glancing at the code, I don't see anywhere that we trap SIGCHLD outside of > mpirun and the orte daemons - certainly not inside an MPI app. What version > of OMPI are you using? > > On Aug 28, 2012, at 2:06 PM, Tony Raymond wrote: > >> Hi, >> >> I have an application that uses openMPI and creates some child processes >> using fork(). I've been trying to catch SIGCHLD in order to check the exit >> status of these processes so that the program will exit if a child errors >> out. >> >> I've found out that if I set the SIGCHLD handler before calling MPI_Init, >> MPI_Init sets the SIGCHLD handler so that my application appears to ignore >> SIGCHLD, but if I set my handler after MPI_Init, the application handles >> SIGCHLD appropriately. >> >> I'm wondering if there are any problems that could come up by changing the >> SIGCHLD handler, and why MPI_Init modifies the SIGCHLD handler in the first >> place. >> >> Thanks, >> Tony >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI_Init
Glancing at the code, I don't see anywhere that we trap SIGCHLD outside of mpirun and the orte daemons - certainly not inside an MPI app. What version of OMPI are you using? On Aug 28, 2012, at 2:06 PM, Tony Raymondwrote: > Hi, > > I have an application that uses openMPI and creates some child processes > using fork(). I've been trying to catch SIGCHLD in order to check the exit > status of these processes so that the program will exit if a child errors > out. > > I've found out that if I set the SIGCHLD handler before calling MPI_Init, > MPI_Init sets the SIGCHLD handler so that my application appears to ignore > SIGCHLD, but if I set my handler after MPI_Init, the application handles > SIGCHLD appropriately. > > I'm wondering if there are any problems that could come up by changing the > SIGCHLD handler, and why MPI_Init modifies the SIGCHLD handler in the first > place. > > Thanks, > Tony > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] MPI_Init
Hi, I have an application that uses openMPI and creates some child processes using fork(). I've been trying to catch SIGCHLD in order to check the exit status of these processes so that the program will exit if a child errors out. I've found out that if I set the SIGCHLD handler before calling MPI_Init, MPI_Init sets the SIGCHLD handler so that my application appears to ignore SIGCHLD, but if I set my handler after MPI_Init, the application handles SIGCHLD appropriately. I'm wondering if there are any problems that could come up by changing the SIGCHLD handler, and why MPI_Init modifies the SIGCHLD handler in the first place. Thanks, Tony
Re: [hwloc-users] lstopo and GPus
Hi, thanks for the reply. How can cuda branch help me? lstopo output of that branch is the same of the trunk. Another question: the GPU IDs are the same (10de: 06d2). How is it possible? Thanks. 2012/8/28 Samuel Thibault> Brice Goglin, le Tue 28 Aug 2012 14:43:53 +0200, a écrit : > > > $ lstopo > > > Socket #0 > > > Socket #1 > > > PCI... > > > (connected to socket #1) > > > > > > vs > > > > > > $ lstopo > > > Socket #0 > > > Socket #1 > > > PCI... > > > (connected to both sockets) > > > > Fortunately, this won't occur in most cases (including Gabriele's > > machines) because there's a NUMAnode object above each socket. > > Oops, I actually meant NUMAnode above > > > Both the socket and the PCI bus are drawn inside the NUMA box, so > > things appear OK in graphics to. > > Indeed, if the PCI bus was connected to one NUMAnode/socket only, it > would be drawn inside, which is not the case. > > > Gabriele, assuming you have a dual Xeon X56xx Westmere machine, there > > are plenty of such platforms where the GPU is indeed connected to both > > sockets. Or it could be a buggy BIOS. > > Agreed. > > Samuel > ___ > hwloc-users mailing list > hwloc-us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users > -- Ing. Gabriele Fatigati HPC specialist SuperComputing Applications and Innovation Department Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] deprecated MCA parameter
Ralph and I talked about this -- it seems like we should extend the help message. If there is no replacement for the param, it should say that. If there is a replacement, it should be listed. We'll take this as a feature enhancement. On Aug 28, 2012, at 9:23 AM, jody wrote: > Thanks Ralph > > I renamed the parameter in my script, > and now there are no more ugly messages :) > > Jody > > On Tue, Aug 28, 2012 at 3:17 PM, Ralph Castainwrote: >> Ah, I see - yeah, the parameter technically is being renamed to >> "orte_rsh_agent" to avoid having users need to know the internal topology of >> the code base (i.e., that it is in the plm framework and the rsh component). >> It will always be there, though - only the name is changing to protect the >> innocent. :-) >> >> >> On Aug 28, 2012, at 6:07 AM, jody wrote: >> >>> Hi Rallph >>> >>> I get one of these messages >>> -- >>> A deprecated MCA parameter value was specified in the environment or >>> on the command line. Deprecated MCA parameters should be avoided; >>> they may disappear in future releases. >>> >>> Deprecated parameter: plm_rsh_agent >>> -- >>> for every process that starts... >>> >>> My openmpi version is 1.6 (gentoo package sys-cluster/openmpi-1.6-r1) >>> >>> jody >>> >>> On Tue, Aug 28, 2012 at 2:38 PM, Ralph Castain wrote: Guess I'm confused - what is the issue here? The param still exists: MCA plm: parameter "plm_rsh_agent" (current value: >>> rsh>, data source: default value, synonyms: pls_rsh_agent, orte_rsh_agent) The command used to launch executables on remote nodes (typically either "ssh" or "rsh") I am unaware of any plans to deprecate it. Is there a problem with it? On Aug 28, 2012, at 2:24 AM, jody wrote: > Hi > > In order to open a xterm for each of my processes i use the MCA > parameter 'plm_rsh_agent' > like this: > mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca > plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e > ./MPIProg > > Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows > from the remote: > > jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts > -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e > ./MPIStruct > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > > Is there some replacement for this parameter, > or how else can i get mpi to use" ssh -Y for" its connections? > > Thank You > jody > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] deprecated MCA parameter
Thanks Ralph I renamed the parameter in my script, and now there are no more ugly messages :) Jody On Tue, Aug 28, 2012 at 3:17 PM, Ralph Castainwrote: > Ah, I see - yeah, the parameter technically is being renamed to > "orte_rsh_agent" to avoid having users need to know the internal topology of > the code base (i.e., that it is in the plm framework and the rsh component). > It will always be there, though - only the name is changing to protect the > innocent. :-) > > > On Aug 28, 2012, at 6:07 AM, jody wrote: > >> Hi Rallph >> >> I get one of these messages >> -- >> A deprecated MCA parameter value was specified in the environment or >> on the command line. Deprecated MCA parameters should be avoided; >> they may disappear in future releases. >> >> Deprecated parameter: plm_rsh_agent >> -- >> for every process that starts... >> >> My openmpi version is 1.6 (gentoo package sys-cluster/openmpi-1.6-r1) >> >> jody >> >> On Tue, Aug 28, 2012 at 2:38 PM, Ralph Castain wrote: >>> Guess I'm confused - what is the issue here? The param still exists: >>> >>> MCA plm: parameter "plm_rsh_agent" (current value: >> rsh>, data source: default value, synonyms: >>> pls_rsh_agent, orte_rsh_agent) >>> The command used to launch executables on remote >>> nodes (typically either "ssh" or "rsh") >>> >>> I am unaware of any plans to deprecate it. Is there a problem with it? >>> >>> On Aug 28, 2012, at 2:24 AM, jody wrote: >>> Hi In order to open a xterm for each of my processes i use the MCA parameter 'plm_rsh_agent' like this: mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e ./MPIProg Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows from the remote: jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e ./MPIStruct xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- Is there some replacement for this parameter, or how else can i get mpi to use" ssh -Y for" its connections? Thank You jody ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] deprecated MCA parameter
Ah, I see - yeah, the parameter technically is being renamed to "orte_rsh_agent" to avoid having users need to know the internal topology of the code base (i.e., that it is in the plm framework and the rsh component). It will always be there, though - only the name is changing to protect the innocent. :-) On Aug 28, 2012, at 6:07 AM, jodywrote: > Hi Rallph > > I get one of these messages > -- > A deprecated MCA parameter value was specified in the environment or > on the command line. Deprecated MCA parameters should be avoided; > they may disappear in future releases. > > Deprecated parameter: plm_rsh_agent > -- > for every process that starts... > > My openmpi version is 1.6 (gentoo package sys-cluster/openmpi-1.6-r1) > > jody > > On Tue, Aug 28, 2012 at 2:38 PM, Ralph Castain wrote: >> Guess I'm confused - what is the issue here? The param still exists: >> >> MCA plm: parameter "plm_rsh_agent" (current value: > rsh>, data source: default value, synonyms: >> pls_rsh_agent, orte_rsh_agent) >> The command used to launch executables on remote >> nodes (typically either "ssh" or "rsh") >> >> I am unaware of any plans to deprecate it. Is there a problem with it? >> >> On Aug 28, 2012, at 2:24 AM, jody wrote: >> >>> Hi >>> >>> In order to open a xterm for each of my processes i use the MCA >>> parameter 'plm_rsh_agent' >>> like this: >>> mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca >>> plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e >>> ./MPIProg >>> >>> Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows >>> from the remote: >>> >>> jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts >>> -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e >>> ./MPIStruct >>> xterm: Xt error: Can't open display: >>> xterm: DISPLAY is not set >>> xterm: Xt error: Can't open display: >>> xterm: DISPLAY is not set >>> xterm: Xt error: Can't open display: >>> xterm: DISPLAY is not set >>> xterm: Xt error: Can't open display: >>> xterm: DISPLAY is not set >>> xterm: Xt error: Can't open display: >>> xterm: DISPLAY is not set >>> -- >>> mpirun noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> -- >>> >>> Is there some replacement for this parameter, >>> or how else can i get mpi to use" ssh -Y for" its connections? >>> >>> Thank You >>> jody >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] 转发:lwkmpi
On 8/28/2012 5:11 AM, 清风 wrote: -- 原始邮 件 -- *发件人:* "295187383"<295187...@qq.com>; *发送时间:* 2012年8月28日(星期二) 下午4:13 *收件人:* "users"; *主题:* lwkmpi Hi everybody, I'm trying compile openmpi with intel compiler11.1.07 on ubuntu . I compiled openmpi many times and I could always find a problem. But the error that I'm getting now, gives me no clues where to even search for the problem. It seems I have succeed to configure.While I try "make all",it always show problems below: make[7]: 正在进入目录 `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' /opt/intel/Compiler/11.1/072/bin/ia32/icpc -DHAVE_CONFIG_H -I. -I../../.. -DINSIDE_OPENMPI -I/home/lwk/桌面/mnt/Software/openmpi- 1.6.1/opal/mca/hwloc/hwloc132/hwloc /include -I/usr/include/infiniband -I/usr/include/infiniband -DOPARI_VT -O3 -DNDEBUG -finline-functions -pthread -MT opari-ompragma_c.o -MD -MP -MF .deps/opari-ompragma_c.Tpo -c -o opari-ompragma_c.o `test -f 'ompragma_c.cc' || echo './'`ompragma_c.cc /usr/include/c++/4.5/iomanip(64): error: expected an expression { return { __mask }; } ^ Looks like your icpc is too old to work with your g++. If you want to build with C++ support, you'll need better matching versions of icpc and g++. icpc support for g++4.7 is expected to release within the next month; icpc 12.1 should be fine with g++ 4.5 and 4.6. -- Tim Prince
Re: [hwloc-users] lstopo and GPus
Brice Goglin, le Tue 28 Aug 2012 14:43:53 +0200, a écrit : > > $ lstopo > > Socket #0 > > Socket #1 > > PCI... > > (connected to socket #1) > > > > vs > > > > $ lstopo > > Socket #0 > > Socket #1 > > PCI... > > (connected to both sockets) > > Fortunately, this won't occur in most cases (including Gabriele's > machines) because there's a NUMAnode object above each socket. Oops, I actually meant NUMAnode above > Both the socket and the PCI bus are drawn inside the NUMA box, so > things appear OK in graphics to. Indeed, if the PCI bus was connected to one NUMAnode/socket only, it would be drawn inside, which is not the case. > Gabriele, assuming you have a dual Xeon X56xx Westmere machine, there > are plenty of such platforms where the GPU is indeed connected to both > sockets. Or it could be a buggy BIOS. Agreed. Samuel
Re: [hwloc-users] lstopo and GPus
Le 28/08/2012 14:23, Samuel Thibault a écrit : > Gabriele Fatigati, le Tue 28 Aug 2012 14:19:44 +0200, a écrit : >> I'm using hwloc 1.5. I would to see how GPUs are connected with the processor >> socket using lstopo command. > About connexion with the socket, there is indeed no real graphical > difference between "connected to socket #1" and "connected to all > sockets". You can use the text output for that: > > $ lstopo > Socket #0 > Socket #1 > PCI... > (connected to socket #1) > > vs > > $ lstopo > Socket #0 > Socket #1 > PCI... > (connected to both sockets) Fortunately, this won't occur in most cases (including Gabriele's machines) because there's a NUMAnode object above each socket. Both the socket and the PCI bus are drawn inside the NUMA box, so things appear OK in graphics to. I've never seen the problem on a real machine, but a fake topology with a PCI bus attached to a socket that is not strictly equal to the above NUMA node is indeed wrongly displayed. Gabriele, assuming you have a dual Xeon X56xx Westmere machine, there are plenty of such platforms where the GPU is indeed connected to both sockets. Or it could be a buggy BIOS. Brice
Re: [OMPI users] deprecated MCA parameter
Guess I'm confused - what is the issue here? The param still exists: MCA plm: parameter "plm_rsh_agent" (current value: , data source: default value, synonyms: pls_rsh_agent, orte_rsh_agent) The command used to launch executables on remote nodes (typically either "ssh" or "rsh") I am unaware of any plans to deprecate it. Is there a problem with it? On Aug 28, 2012, at 2:24 AM, jodywrote: > Hi > > In order to open a xterm for each of my processes i use the MCA > parameter 'plm_rsh_agent' > like this: > mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca > plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e > ./MPIProg > > Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows > from the remote: > > jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts > -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e > ./MPIStruct > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > xterm: Xt error: Can't open display: > xterm: DISPLAY is not set > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > > Is there some replacement for this parameter, > or how else can i get mpi to use" ssh -Y for" its connections? > > Thank You > jody > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [hwloc-users] lstopo and GPus
Gabriele Fatigati, le Tue 28 Aug 2012 14:19:44 +0200, a écrit : > I'm using hwloc 1.5. I would to see how GPUs are connected with the processor > socket using lstopo command. About connexion with the socket, there is indeed no real graphical difference between "connected to socket #1" and "connected to all sockets". You can use the text output for that: $ lstopo Socket #0 Socket #1 PCI... (connected to socket #1) vs $ lstopo Socket #0 Socket #1 PCI... (connected to both sockets) Samuel
[hwloc-users] lstopo and GPus
Dear hwloc user, I'm using hwloc 1.5. I would to see how GPUs are connected with the processor socket using lstopo command. I attach the figure. The system has two GPUs, but I don't understand how to find that information from PCI boxes. Thanks in advance. -- Ing. Gabriele Fatigati HPC specialist SuperComputing Applications and Innovation Department Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] problem with installing open mpi with intelcompiler11.1.07 on ubuntu
Try using the 1.6.2 nightly snapshot tarball and see if that fixed your problem. I'm not near a computer to give you the specific link - go to openmpi.org and download and nightly snapshots and the v1.6 series. Sent from my phone. No type good. On Aug 28, 2012, at 6:59 AM, "清风" <295187...@qq.com> wrote: > > Hi everybody, > I'm trying compile openmpi with intel compiler11.1.07 on ubuntu . > I compiled openmpi many times and I could always find a problem. But the > error that I'm getting now, gives me no clues where to even search for the > problem. > It seems I have succeed to configure.While I try "make all",it always show > problems below: > > > > make[7]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' > /opt/intel/Compiler/11.1/072/bin/ia32/icpc -DHAVE_CONFIG_H -I. -I../../.. > -DINSIDE_OPENMPI > -I/home/lwk/desktop/mnt/Software/openmpi-1.6.1/opal/mca/hwloc/hwloc132/hwloc > /include -I/usr/include/infiniband -I/usr/include/infiniband -DOPARI_VT > -O3 -DNDEBUG -finline-functions -pthread -MT opari-ompragma_c.o -MD -MP -MF > .deps/opari-ompragma_c.Tpo -c -o opari-ompragma_c.o `test -f 'ompragma_c.cc' > || echo './'`ompragma_c.cc > /usr/include/c++/4.5/iomanip(64): error: expected an expression > { return { __mask }; } >^ > > /usr/include/c++/4.5/iomanip(94): error: expected an expression > { return { __mask }; } >^ > > /usr/include/c++/4.5/iomanip(125): error: expected an expression > { return { __base }; } >^ > > /usr/include/c++/4.5/iomanip(193): error: expected an expression > { return { __n }; } >^ > > /usr/include/c++/4.5/iomanip(223): error: expected an expression > { return { __n }; } >^ > > /usr/include/c++/4.5/iomanip(163): error: expected an expression > { return { __c }; } > ^ > > compilation aborted for ompragma_c.cc (code 2) > make[7]: *** [opari-ompragma_c.o] error 2 > make[7]:are now leaving directory > `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' > make[6]: *** [all-recursive] error 1 > make[6]:are now leaving > directory`/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari' > make[5]: *** [all-recursive] error 1 > make[5]:are now leaving directory > `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools' > make[4]: *** [all-recursive] error 1 > make[4]:are now leaving directory > `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' > make[3]: *** [all] error 2 > make[3]:are now leaving directory > `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' > make[2]: *** [all-recursive] error 1 > make[2]:are now leaving directory > `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt' > make[1]: *** [all-recursive] error 1 > make[1]:are now leaving directory `/mnt/Software/openmpi-1.6.1/ompi' > make: *** [all-recursive] error 1 > > > > > My file "configure.log" and "make.out" is in accessory lwkmpi.zip. > with best regards >Liang Wenke > > > * > ** ** > ** WARNING: This email contains an attachment of a very suspicious type. ** > ** You are urged NOT to open this attachment unless you are absolutely ** > ** sure it is legitimate. Opening this attachment may cause irreparable ** > ** damage to your computer and your files. If you have any questions ** > ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** > ** ** > ** This warning was added by the IU Computer Science Dept. mail scanner. ** > * > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] problem with installing open mpi with intelcompiler11.1.07 on ubuntu
Hi everybody, I'm trying compile openmpi with intel compiler11.1.07 on ubuntu . I compiled openmpi many times and I could always find a problem. But the error that I'm getting now, gives me no clues where to even search for the problem. It seems I have succeed to configure.While I try "make all",it always show problems below: make[7]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' /opt/intel/Compiler/11.1/072/bin/ia32/icpc -DHAVE_CONFIG_H -I. -I../../.. -DINSIDE_OPENMPI -I/home/lwk/desktop/mnt/Software/openmpi-1.6.1/opal/mca/hwloc/hwloc132/hwloc /include -I/usr/include/infiniband -I/usr/include/infiniband -DOPARI_VT -O3 -DNDEBUG -finline-functions -pthread -MT opari-ompragma_c.o -MD -MP -MF .deps/opari-ompragma_c.Tpo -c -o opari-ompragma_c.o `test -f 'ompragma_c.cc' || echo './'`ompragma_c.cc /usr/include/c++/4.5/iomanip(64): error: expected an expression { return { __mask }; } ^ /usr/include/c++/4.5/iomanip(94): error: expected an expression { return { __mask }; } ^ /usr/include/c++/4.5/iomanip(125): error: expected an expression { return { __base }; } ^ /usr/include/c++/4.5/iomanip(193): error: expected an expression { return { __n }; } ^ /usr/include/c++/4.5/iomanip(223): error: expected an expression { return { __n }; } ^ /usr/include/c++/4.5/iomanip(163): error: expected an expression { return { __c }; } ^ compilation aborted for ompragma_c.cc (code 2) make[7]: *** [opari-ompragma_c.o] error 2 make[7]:are now leaving directory `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' make[6]: *** [all-recursive] error 1 make[6]:are now leaving directory`/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari' make[5]: *** [all-recursive] error 1 make[5]:are now leaving directory `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools' make[4]: *** [all-recursive] error 1 make[4]:are now leaving directory `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' make[3]: *** [all] error 2 make[3]:are now leaving directory `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' make[2]: *** [all-recursive] error 1 make[2]:are now leaving directory `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt' make[1]: *** [all-recursive] error 1 make[1]:are now leaving directory `/mnt/Software/openmpi-1.6.1/ompi' make: *** [all-recursive] error 1 My file "configure.log" and "make.out" is in accessory lwkmpi.zip. with best regards Liang Wenke * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <>
[OMPI users] deprecated MCA parameter
Hi In order to open a xterm for each of my processes i use the MCA parameter 'plm_rsh_agent' like this: mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e ./MPIProg Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows from the remote: jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e ./MPIStruct xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- Is there some replacement for this parameter, or how else can i get mpi to use" ssh -Y for" its connections? Thank You jody
[OMPI users] ??????lwkmpi
-- -- ??: "295187383"<295187...@qq.com>; : 2012??8??28??(??) 4:13 ??: "users"; : lwkmpi Hi everybody, I'm trying compile openmpi with intel compiler11.1.07 on ubuntu . I compiled openmpi many times and I could always find a problem. But the error that I'm getting now, gives me no clues where to even search for the problem. It seems I have succeed to configure.While I try "make all",it always show problems below: make[7]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' /opt/intel/Compiler/11.1/072/bin/ia32/icpc -DHAVE_CONFIG_H -I. -I../../.. -DINSIDE_OPENMPI -I/home/lwk//mnt/Software/openmpi-1.6.1/opal/mca/hwloc/hwloc132/hwloc /include -I/usr/include/infiniband -I/usr/include/infiniband -DOPARI_VT -O3 -DNDEBUG -finline-functions -pthread -MT opari-ompragma_c.o -MD -MP -MF .deps/opari-ompragma_c.Tpo -c -o opari-ompragma_c.o `test -f 'ompragma_c.cc' || echo './'`ompragma_c.cc /usr/include/c++/4.5/iomanip(64): error: expected an expression { return { __mask }; } ^ /usr/include/c++/4.5/iomanip(94): error: expected an expression { return { __mask }; } ^ /usr/include/c++/4.5/iomanip(125): error: expected an expression { return { __base }; } ^ /usr/include/c++/4.5/iomanip(193): error: expected an expression { return { __n }; } ^ /usr/include/c++/4.5/iomanip(223): error: expected an expression { return { __n }; } ^ /usr/include/c++/4.5/iomanip(163): error: expected an expression { return { __c }; } ^ compilation aborted for ompragma_c.cc (code 2) make[7]: *** [opari-ompragma_c.o] 2 make[7]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' make[6]: *** [all-recursive] 1 make[6]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari' make[5]: *** [all-recursive] 1 make[5]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools' make[4]: *** [all-recursive] 1 make[4]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' make[3]: *** [all] 2 make[3]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' make[2]: *** [all-recursive] 1 make[2]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt' make[1]: *** [all-recursive] 1 make[1]: `/mnt/Software/openmpi-1.6.1/ompi' make: *** [all-recursive] 1 My file "configure.log" and "make.out" is in accessory lwkmpi.zip. with best regards Liang Wenke * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <>
Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
Hi Siegmar, It seems that the runtime environment is messed up with the different versions of Open MPI. I suggest you completely remove all the installations and install 1.6.1 again (just build the installation project again). It should work without any problem under Cygwin too. Shiqing On 2012-08-27 4:02 PM, Siegmar Gross wrote: Hi, thank you very much for your reply. I compiled and installed openmpi-1.6.1. Unfortunately I cannot compile programs because "mpicc" uses wrong path names. I have set an environment for openmpi-1.6.1 as you can see from the following output. D:\...prog\mpi\small_prog>set | c:\cygwin\bin\grep openmpi LIB=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64; C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\atlmfc\lib\amd64; C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Lib\x64; C:\Program Files\openmpi-1.6.1\lib LIBPATH=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64; C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\atlmfc\lib\amd64; C:\Program Files\openmpi-1.6.1\lib Path=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64; C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\vcpackages; C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\; C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\Tools\; C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\NETFX 4.0 Tools\x64; C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\x64; C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\; C:\Windows\System32; C:\Windows; C:\Windows\System32\Wbem; C:\Program Files\openmpi-1.6.1\bin; C:\cmd;. I get the following error when I try to compile my program because of "/LIBPATH:C:\Program Files (x86)\openmpi-1.6/lib". D:\...\prog\mpi\small_prog>mpicc init_finalize.c Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 für x64 Copyright (C) Microsoft Corporation. Alle Rechte vorbehalten. init_finalize.c Microsoft (R) Incremental Linker Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. /out:init_finalize.exe "/LIBPATH:C:\Program Files (x86)\openmpi-1.6/lib" libmpi.lib libopen-pal.lib libopen-rte.lib advapi32.lib Ws2_32.lib shlwapi.lib init_finalize.obj init_finalize.obj : error LNK2019: Verweis auf nicht aufgelöstes externes Symbol "__imp_MP I_Finalize" in Funktion "main". init_finalize.obj : error LNK2019: Verweis auf nicht aufgelöstes externes Symbol "__imp_MP I_Init" in Funktion "main". init_finalize.exe : fatal error LNK1120: 2 nicht aufgelöste externe Verweise. When I start in a new command shell without my MPI environment, I get the following outputs for "mpicc -show". The first one is OK, but both others are wrong because they point to 32-bit libraries instead of 64-bit ones. Why do both versions point to openmpi-1.6? I downloaded and installed the precompiled 32- and 64-bit version 1.6 from open-mpi.org. C:\Program Files>openmpi-1.5.1\bin\mpicc -show cl.exe /I"C:/Program Files/openmpi-1.5.1/include" /TC /D "OMPI_IMPORTS" /D "OPAL_IMPORTS" /D "ORTE_IMPORTS" /link /LIBPATH:"C:/Program Files/openmpi-1.5.1/lib" libmpi.lib libopen-p al.lib libopen-rte.lib advapi32.lib Ws2_32.lib shlwapi.lib C:\Program Files>openmpi-1.6\bin\mpicc -show cl.exe /I"C:\Program Files (x86)\openmpi-1.6/include" /TC /DOMPI_IMPORTS /DOPAL_IMPORTS /D ORTE_IMPORTS /link /LIBPATH:"C:\Program Files (x86)\openmpi-1.6/lib" libmpi.lib libopen-pa l.lib libopen-rte.lib advapi32.lib Ws2_32.lib shlwapi.lib C:\Program Files>openmpi-1.6.1\bin\mpicc -show cl.exe /I"C:\Program Files (x86)\openmpi-1.6/include" /TC /DOMPI_IMPORTS /DOPAL_IMPORTS /D ORTE_IMPORTS /link /LIBPATH:"C:\Program Files (x86)\openmpi-1.6/lib" libmpi.lib libopen-pa l.lib libopen-rte.lib advapi32.lib Ws2_32.lib shlwapi.lib Do you have any idea what I have done wrong? Thank you very much for any help in advance. Kind regards Siegmar I didn't have this problem when building the binary release. But to solve the problem is very easy. You can just open \openmpi-1.6.1\ompi\mca\osc\rdma\osc_rdma_data_move.c, and go to line 1099, change "void*" to "void**". This will get rid of the error. For the warnings, they are just some redefinitions that cannot be avoided, they are totally harmless. Regards, Shiqing On 2012-08-27 1:02 PM, Siegmar Gross wrote: Hi, I tried to compile openmpi-1.6.1 with CMake-2.8.3 and Visual Studio 2010 on Windows 7. All service packs and patches from Microsoft are installed. I changed the following options: CMAKE_BUILD_TYPE: "Debug" modified to "Release" CMAKE_INSTALL_PREFIX: modified to "c:/Program Files (x86)/openmpi-1.6.1" OMPI_ENABLE_THREAD_MULTIPLE: "no" changed to "yes" OMPI_RELEASE_BUILD: "no" changed to "yes" OPAL_ENABLE_HETEREOGENEOUS:SUPPORT: "no" changed to "yes" OPAL_ENABLE_IPV6: "yes" changed to "no" OPAL_ENABLE_MULTI_THREADS: "no" changed to "yes" I also selected "Release" in "Visual Studio".
Re: [OMPI users] Application with mxm hangs on startup
Dear prof.Aleksey: My system is 32 bit system for unbuntu.What's used for MXM version which you give to me? Best regards, Liang Wenke -- Original -- From: "Aleksey Senin"; Date: Tue, Aug 28, 2012 04:19 PM To: "pavel.mezentsev" ; Cc: "users" ; Subject: [OMPI users] Application with mxm hangs on startup Please, download MXM version http://mellanox.com/downloads/hpc/mxm/v1.1/mxm_1.1.1328.tar This version checked with OMPI-1.6.2 (http://svn.open-mpi.org/svn/ompi/branches/v1.6). In the case of any failure, could you enclose the output? Regards, Aleksey. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Application with mxm hangs on startup
Dear prof.Aleksey: Thank you very much. Some failure output file like "config.log","make.out" is in attachment 'lwkmpi.zip'. -- Original -- From: "Aleksey Senin"; Date: Tue, Aug 28, 2012 04:19 PM To: "pavel.mezentsev" ; Cc: "users" ; Subject: [OMPI users] Application with mxm hangs on startup Please, download MXM version http://mellanox.com/downloads/hpc/mxm/v1.1/mxm_1.1.1328.tar This version checked with OMPI-1.6.2 (http://svn.open-mpi.org/svn/ompi/branches/v1.6). In the case of any failure, could you enclose the output? Regards, Aleksey. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <>
[OMPI users] Application with mxm hangs on startup
Please, download MXM version http://mellanox.com/downloads/hpc/mxm/v1.1/mxm_1.1.1328.tar This version checked with OMPI-1.6.2 (http://svn.open-mpi.org/svn/ompi/branches/v1.6). In the case of any failure, could you enclose the output? Regards, Aleksey.
[OMPI users] lwkmpi
Hi everybody, I'm trying compile openmpi with intel compiler11.1.07 on ubuntu . I compiled openmpi many times and I could always find a problem. But the error that I'm getting now, gives me no clues where to even search for the problem. It seems I have succeed to configure.While I try "make all",it always show problems below: make[7]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' /opt/intel/Compiler/11.1/072/bin/ia32/icpc -DHAVE_CONFIG_H -I. -I../../.. -DINSIDE_OPENMPI -I/home/lwk//mnt/Software/openmpi-1.6.1/opal/mca/hwloc/hwloc132/hwloc /include -I/usr/include/infiniband -I/usr/include/infiniband -DOPARI_VT -O3 -DNDEBUG -finline-functions -pthread -MT opari-ompragma_c.o -MD -MP -MF .deps/opari-ompragma_c.Tpo -c -o opari-ompragma_c.o `test -f 'ompragma_c.cc' || echo './'`ompragma_c.cc /usr/include/c++/4.5/iomanip(64): error: expected an expression { return { __mask }; } ^ /usr/include/c++/4.5/iomanip(94): error: expected an expression { return { __mask }; } ^ /usr/include/c++/4.5/iomanip(125): error: expected an expression { return { __base }; } ^ /usr/include/c++/4.5/iomanip(193): error: expected an expression { return { __n }; } ^ /usr/include/c++/4.5/iomanip(223): error: expected an expression { return { __n }; } ^ /usr/include/c++/4.5/iomanip(163): error: expected an expression { return { __c }; } ^ compilation aborted for ompragma_c.cc (code 2) make[7]: *** [opari-ompragma_c.o] 2 make[7]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari/tool' make[6]: *** [all-recursive] 1 make[6]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools/opari' make[5]: *** [all-recursive] 1 make[5]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt/tools' make[4]: *** [all-recursive] 1 make[4]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' make[3]: *** [all] 2 make[3]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt/vt' make[2]: *** [all-recursive] 1 make[2]: `/mnt/Software/openmpi-1.6.1/ompi/contrib/vt' make[1]: *** [all-recursive] 1 make[1]: `/mnt/Software/openmpi-1.6.1/ompi' make: *** [all-recursive] 1 My file "configure.log" and "make.out" is in accessory lwkmpi.zip. with best regards Liang Wenke * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <>
Re: [OMPI users] Infiniband performance Problem and stalling
Randolph, after reading this: On 08/28/12 04:26, Randolph Pullen wrote: - On occasions it seems to stall indefinately, waiting on a single receive. ... I would make a blind guess: are you aware about IB card parameters for registered memory? http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem "Waiting forever" for a single operation is one of symptoms of the problem especially in 1.5.3. best, Paul P.S. the lower performance with 'big' chinks is known phenomenon, cf. http://www.scl.ameslab.gov/netpipe/ (image on bottom of the page). But the chunk size of 64k is fairly small -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 smime.p7s Description: S/MIME Cryptographic Signature