Re: [OMPI users] running external program on same processor (Fortran)
Surely this is the problem of the scheduler that your system uses, rather than MPI? On Wed, 2010-03-03 at 00:48 +, abc def wrote: > Hello, > > I wonder if someone can help. > > The situation is that I have an MPI-parallel fortran program. I run it > and it's distributed on N cores, and each of these processes must call > an external program. > > This external program is also an MPI program, however I want to run it > in serial, on the core that is calling it, as if it were part of the > fortran program. The fortran program waits until the external program > has completed, and then continues. > > The problem is that this external program seems to run on any core, > and not necessarily the (now idle) core that called it. This slows > things down a lot as you get one core doing multiple tasks. > > Can anyone tell me how I can call the program and ensure it runs only > on the core that's calling it? Note that there are several cores per > node. I can ID the node by running the hostname command (I don't know > a way to do this for individual cores). > > Thanks! > > > > Extra information that might be helpful: > > If I simply run the external program from the command line (ie, type > "/path/myprogram.ex "), it runs fine. If I run it within the > fortran program by calling it via > > CALL SYSTEM("/path/myprogram.ex") > > it doesn't run at all (doesn't even start) and everything crashes. I > don't know why this is. > > If I call it using mpiexec: > > CALL SYSTEM("mpiexec -n 1 /path/myprogram.ex") > > then it does work, but I get the problem that it can go on any core. > > __ > Do you want a Hotmail account? Sign-up now - Free > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option to use only 7 cores out of 8 on each node
It works after creating a new pe and even from the command prompt with out using SGE. Thanks Rangam From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Reuti [re...@staff.uni-marburg.de] Sent: Tuesday, March 02, 2010 12:35 PM To: Open MPI Users Subject: Re: [OMPI users] Option to use only 7 cores out of 8 on each node Am 02.03.2010 um 19:26 schrieb Eugene Loh: > Eugene Loh wrote: > >> Addepalli, Srirangam V wrote: >> >>> i tried using the following syntax with machinefile >>> mpirun -np 14 -npernode 7 -machinefile machinefile ven_nw.e >>> > >> It "works" for me. I'm not using SGE, though. When it's tightly integrated with SGE, maybe you need a PE with a fixed allocation rule of 7. Then all should work automatically and without any need of a machinefile for mpiexec. If you want to use the node exclusively for your job although you want only 7 out of 8 available slots, you also need to request an exclusive resource (e.g. named "exclusive") which is attached to each exechost. -- Reuti >> >> % cat machinefile >> % mpirun -tag-output -np 14 -npernode 7 -machinefile machinefile >> hostname > > Incidentally, the key ingredient here is the "-npernode 7" part. > The machine file only needs enough slots. E.g., you could have had: > > % cat machinefile > node0 slots=20 > node1 slots=20 > > mpirun will see that there are enough slots on each node, but load > only 7 up per node due to the -npernode switch. > > That said, I don't know what's going wrong in your case -- only > that things work as advertised for me. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] running external program on same processor (Fortran)
Hello, I wonder if someone can help. The situation is that I have an MPI-parallel fortran program. I run it and it's distributed on N cores, and each of these processes must call an external program. This external program is also an MPI program, however I want to run it in serial, on the core that is calling it, as if it were part of the fortran program. The fortran program waits until the external program has completed, and then continues. The problem is that this external program seems to run on any core, and not necessarily the (now idle) core that called it. This slows things down a lot as you get one core doing multiple tasks. Can anyone tell me how I can call the program and ensure it runs only on the core that's calling it? Note that there are several cores per node. I can ID the node by running the hostname command (I don't know a way to do this for individual cores). Thanks! Extra information that might be helpful: If I simply run the external program from the command line (ie, type "/path/myprogram.ex "), it runs fine. If I run it within the fortran program by calling it via CALL SYSTEM("/path/myprogram.ex") it doesn't run at all (doesn't even start) and everything crashes. I don't know why this is. If I call it using mpiexec: CALL SYSTEM("mpiexec -n 1 /path/myprogram.ex") then it does work, but I get the problem that it can go on any core. _ Do you have a story that started on Hotmail? Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
Re: [OMPI users] Option to use only 7 cores out of 8 on each node
Am 02.03.2010 um 19:26 schrieb Eugene Loh: Eugene Loh wrote: Addepalli, Srirangam V wrote: i tried using the following syntax with machinefile mpirun -np 14 -npernode 7 -machinefile machinefile ven_nw.e It "works" for me. I'm not using SGE, though. When it's tightly integrated with SGE, maybe you need a PE with a fixed allocation rule of 7. Then all should work automatically and without any need of a machinefile for mpiexec. If you want to use the node exclusively for your job although you want only 7 out of 8 available slots, you also need to request an exclusive resource (e.g. named "exclusive") which is attached to each exechost. -- Reuti % cat machinefile % mpirun -tag-output -np 14 -npernode 7 -machinefile machinefile hostname Incidentally, the key ingredient here is the "-npernode 7" part. The machine file only needs enough slots. E.g., you could have had: % cat machinefile node0 slots=20 node1 slots=20 mpirun will see that there are enough slots on each node, but load only 7 up per node due to the -npernode switch. That said, I don't know what's going wrong in your case -- only that things work as advertised for me. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option to use only 7 cores out of 8 on each node
Eugene Loh wrote: Addepalli, Srirangam V wrote: i tried using the following syntax with machinefile mpirun -np 14 -npernode 7 -machinefile machinefile ven_nw.e It "works" for me. I'm not using SGE, though. % cat machinefile % mpirun -tag-output -np 14 -npernode 7 -machinefile machinefile hostname Incidentally, the key ingredient here is the "-npernode 7" part. The machine file only needs enough slots. E.g., you could have had: % cat machinefile node0 slots=20 node1 slots=20 mpirun will see that there are enough slots on each node, but load only 7 up per node due to the -npernode switch. That said, I don't know what's going wrong in your case -- only that things work as advertised for me.
Re: [OMPI users] Option to use only 7 cores out of 8 on each node
Correct, i was not clear. It spawns more than 7 processes per node. (It spawns 8 of them). Rangam From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Ralph Castain [r...@open-mpi.org] Sent: Tuesday, March 02, 2010 11:55 AM To: Open MPI Users Subject: Re: [OMPI users] Option to use only 7 cores out of 8 on each node When you say "it fails", what do you mean? That it doesn't run at all, or that it still fills each node, or...? On Tue, Mar 2, 2010 at 9:49 AM, Addepalli, Srirangam V> wrote: Hello All. I am trying to run a parallel application that should use one core less than the no of cores that are available on the system. Are there any flags that i can use to specify this. i tried using the following syntax with machinefile openmpi-1.4-BM/bin/mpirun -np 14 -npernode 7 -machinefile machinefile ven_nw.e mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option to use only 7 cores out of 8 on each node
Addepalli, Srirangam V wrote: i tried using the following syntax with machinefile mpirun -np 14 -npernode 7 -machinefile machinefile ven_nw.e It "works" for me. I'm not using SGE, though. % cat machinefile node0 node0 node0 node0 node0 node0 node0 node1 node1 node1 node1 node1 node1 node1 % mpirun -tag-output -np 14 -npernode 7 -machinefile machinefile hostname [1,0]:node0 [1,1]:node0 [1,2]:node0 [1,3]:node0 [1,4]:node0 [1,5]:node0 [1,6]:node0 [1,7]:node1 [1,8]:node1 [1,9]:node1 [1,10]:node1 [1,11]:node1 [1,12]:node1 [1,13]:node1
Re: [OMPI users] Option to use only 7 cores out of 8 on each node
When you say "it fails", what do you mean? That it doesn't run at all, or that it still fills each node, or...? On Tue, Mar 2, 2010 at 9:49 AM, Addepalli, Srirangam V < srirangam.v.addepa...@ttu.edu> wrote: > Hello All. > I am trying to run a parallel application that should use one core less > than the no of cores that are available on the system. Are there any flags > that i can use to specify this. > > i tried using the following syntax with machinefile > > openmpi-1.4-BM/bin/mpirun -np 14 -npernode 7 -machinefile machinefile > ven_nw.e get two nodes (with 16 cores) allocated from SGE > and we want to use only 14 cores out of the 16 allocated. > > Rangam > > My machine file has > > compute-9-8.local > compute-9-8.local > compute-9-8.local > compute-9-8.local > compute-9-8.local > compute-9-8.local > compute-9-8.local > compute-9-6.local > compute-9-6.local > compute-9-6.local > compute-9-6.local > compute-9-6.local > compute-9-6.local > compute-9-6.local > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Option to use only 7 cores out of 8 on each node
Hello All. I am trying to run a parallel application that should use one core less than the no of cores that are available on the system. Are there any flags that i can use to specify this. i tried using the following syntax with machinefile openmpi-1.4-BM/bin/mpirun -np 14 -npernode 7 -machinefile machinefile ven_nw.e
Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)
On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemoswrote: > Hello, > > > I'm trying to come up with a fault tolerant OpenMPI setup for research > purposes. I'm doing some tests now, but I'm stuck with a segfault when > I try to restart my test program from a checkpoint. > > My test program is the "ring" program, where messages are sent to the > next node in the ring N times. It's pretty simple, I can supply the > source code if needed. I'm running it like this: > > # mpirun -np 4 -am ft-enable-cr ring > ... Process 1 sending 703 to 2 Process 3 received 704 Process 3 sending 704 to 0 Process 3 received 703 Process 3 sending 703 to 0 > -- > mpirun noticed that process rank 0 with PID 18358 on node debian1 > exited on signal 0 (Unknown signal 0). > -- > 4 total processes killed (some possibly by mpirun during cleanup) > > That's the output when I ompi-checkpoint the mpirun PID from another terminal. > > The checkpoint is taken just fine in maybe 1.5 seconds. I can see the > checkpoint directory has been created in $HOME. > > This is what I get when I try to run ompi-restart > > ps axroot@debian1:~# ps ax | grep mpirun > 18357 pts/0 R+ 0:01 mpirun -np 4 -am ft-enable-cr ring > 18378 pts/5 S+ 0:00 grep mpirun > root@debian1:~# ompi-checkpoint 18357 > Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt > root@debian1:~# ompi-checkpoint --term 18357 > Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt > root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt > -- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_2.ckpt). Returned -1. > > -- > [debian1:18384] *** Process received signal *** > [debian1:18384] Signal: Segmentation fault (11) > [debian1:18384] Signal code: Address not mapped (1) > [debian1:18384] Failing at address: 0x725f725f > [debian1:18384] [ 0] [0xb775f40c] > [debian1:18384] [ 1] > /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63] > [debian1:18384] [ 2] > /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0] > [debian1:18384] [ 3] > /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5] > [debian1:18384] [ 4] opal-restart [0x804908e] > [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) > [0xb7568b55] > [debian1:18384] [ 6] opal-restart [0x8048fc1] > [debian1:18384] *** End of error message *** > -- > mpirun noticed that process rank 2 with PID 18384 on node debian1 > exited on signal 11 (Segmentat > -- > > I used a clean install of Debian Squeeze (testing) to make sure my > environment was ok. Those are the steps I took: > > - Installed Debian Squeeze, only base packages > - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build > tools, BLCR dev and run-time environment) > - Compiled openmpi-1.4.1 > > Note that I did compile openmpi-1.4.1 because the Debian package > (openmpi-checkpoint) doesn't seem to be usable at the moment. There > are no leftovers from any previous install of Debian packages > supplying OpenMPI because this is a fresh install, no openmpi package > had been installed before. > > I used the following configure options: > > # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads > > I also tried to add the option --with-memory-manager=none because I > saw an e-mail on the mailing list that described this as a possible > solution to an (apparently) not related problem, but the problem > remains the same. > > I don't have config.log (I rm'ed the build dir), but if you think it's > necessary I can recompile OpenMPI and provide it. > > Some information about the system (VirtualBox virtual machine, single > processor, btw): > > Kernel version 2.6.32-trunk-686 > > root@debian1:~# lsmod | grep blcr > blcr 79084 0 > blcr_imports 2077 1 blcr > > libcr (BLCR) is version 0.8.2-9. > > gcc is version 4.4.3. > > > Please let me know of any other information you might need. > > > Thanks in advance, > Hello, I figured it out. The problem is that the Debian package brcl-utils, which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.) wasn't installed. I believe OpenMPI could perhaps show a more descriptive message instead of segfaulting, though? Also, you might want to add that information to the FAQ. Anyways, I'm filing another Debian bug report. For the sake of completeness, here's, some more information: - I forgot to mention that since I've installed OpenMPI to /usr/local. So I'm setting LD_LIBRARY_PATH to
Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)
Did you configure Open MPI with --enable-heterogeneous? On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote: > Hello, > > I have some problems running MPI on my heterogeneous cluster. More > precisley i got segmentation fault when sending a large array (about > 1) of double from a i686 machine to a x86_64 machine. It does not > happen with small array. Here is the send/recv code source (complete > source is in attached file) : > code > if (me == 0 ) { > for (int pe=1; pe{ > printf("Receiving from proc %d : ",pe); fflush(stdout); > d=(double *)malloc(sizeof(double)*n); > MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,); > printf("OK\n"); fflush(stdout); > } > printf("All done.\n"); > } > else { > d=(double *)malloc(sizeof(double)*n); > MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD); > } > code > > I got segmentation fault with n=1 but no error with n=1000 > I have 2 machines : > sbtn155 : Intel Xeon, x86_64 > sbtn211 : Intel Pentium 4, i686 > > The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1, > installed in /tmp/openmpi : > [mhtrinh@sbtn211 heterogenous]$ make hetero > gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o > /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include > hetero.i686.o -o hetero.i686 -lm > > [mhtrinh@sbtn155 heterogenous]$ make hetero > gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o > hetero.x86_64.o > /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include > hetero.x86_64.o -o hetero.x86_64 -lm > > I run with the code using appfile and got thoses error : > $ cat appfile > --host sbtn155 -np 1 hetero.x86_64 > --host sbtn155 -np 1 hetero.x86_64 > --host sbtn211 -np 1 hetero.i686 > > $ mpirun -hetero --app appfile > Input array length : > 1 > Receiving from proc 1 : OK > Receiving from proc 2 : [sbtn155:26386] *** Process received signal *** > [sbtn155:26386] Signal: Segmentation fault (11) > [sbtn155:26386] Signal code: Address not mapped (1) > [sbtn155:26386] Failing at address: 0x200627bd8 > [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540] > [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2d8d7908] > [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2e2fc6e3] > [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2afe39db] > [sbtn155:26386] [ 4] > /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2afd8b9e] > [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2d8d4b25] > [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b) > [0x2ab30f9b] > [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe] > [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074] > [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29] > [sbtn155:26386] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 26386 on node sbtn155 > exited on signal 11 (Segmentation fault). > -- > > Am I missing an option in order to run in heterogenous cluster ? > MPI_Send/Recv have limit array size when using heterogeneous cluster ? > Thanks for your help. Regards > > -- > >M. TRINH Minh Hieu >CEA, IBEB, SBTN/LIRM, >F-30207 Bagnols-sur-Cèze, FRANCE > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] MPI_Comm_accept() busy waiting?
Hi, I've recently been trying to develop a client-server distributed file system (for my thesis) using the MPI. The communication between the machines is working great, however when ever the MPI_Comm_accept() function is called, the server starts like consuming 100% of the CPU. One interesting thing is that I tried to compile the same code using the LAM/MPI library and the mentioned behaviour could not be observed. Is this a bug? On a side note, I'm using Ubuntu 9.10's default OpenMPI deb package. Its version is 1.3.2. Regards Ramon.
Re: [OMPI users] Leftover session directories [was sm btl choices]
I found the problem - the orted wasn't whacking any lingering session directories when it exited. Missing one line...sigh. Rolf: I have submitted a patch for the 1.4 branch. Can you please review? It is a trivial fix. David: Thanks for bringing it to my attention. Sorry for the problem. Ralph On Mar 1, 2010, at 2:34 PM, Rolf Vandevaart wrote: > On 03/01/10 11:51, Ralph Castain wrote: >> On Mar 1, 2010, at 8:41 AM, David Turner wrote: >>> On 3/1/10 1:51 AM, Ralph Castain wrote: Which version of OMPI are you using? We know that the 1.2 series was unreliable about removing the session directories, but 1.3 and above appear to be quite good about it. If you are having problems with the 1.3 or 1.4 series, I would definitely like to know about it. >>> Oops; sorry! OMPI 1.4.1, compiled with PGI 10.0 compilers, >>> running on Scientific Linux 5.4, ofed 1.4.2. >>> >>> The session directories are *frequently* left behind. I have >>> not really tried to characterize under what circumstances they >>> are removed. But please confirm: they *should* be removed by >>> OMPI. >> Most definitely - they should always be removed by OMPI. This is the first >> report we have had of them -not- being removed in the 1.4 series, so it is >> disturbing. >> What environment are you running under? Does this happen under normal >> termination, or under abnormal failures (the more you can tell us, the >> better)? > > Hi Ralph: > > It turns out that I am seeing session directories left behind as well with > v1.4 (r22713) I have not tested any other versions. I believe there are two > elements that make this reproducible. > 1. Run across 2 or more nodes. > 2. CTRL-C out of the MPI job. > > Then take a look at the remote nodes and you may see a leftover session > directory. The mpirun node seems to be clean. > > Here is an example using two nodes. I also added some sleeps to the ring_c > program to slow things down so I could hit CTRL-C. > > First, tmp directories are empty: > [rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv* > ls: No match. > [rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv* > ls: No match. > > Now run test: > [rolfv@burl-ct-x2200-6 ~/examples]$ mpirun -np 4 -host > burl-ct-x2200-6,burl-ct-x2200-6,burl-ct-x2200-7,burl-ct-x2200-7 ring_slow_c > Process 0 sending 10 to 1, tag 201 (4 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > mpirun: killing job... > > -- > mpirun noticed that process rank 0 with PID 3002 on node burl-ct-x2200-6 > exited on signal 0 (Unknown signal 0). > -- > 4 total processes killed (some possibly by mpirun during cleanup) > mpirun: clean termination accomplished > > [burl-ct-x2200-6:02990] 2 more processes have sent help message > help-mpi-btl-openib.txt / default subnet prefix > > Now check tmp directories: > [rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv* ls: > No match. > [rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv* > total 8 > drwx-- 3 rolfv hpcgroup 4096 Mar 1 17:27 20007/ > > Rolf > > -- > > = > rolf.vandeva...@sun.com > 781-442-3043 > = > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users