Re: [OMPI users] MPI_COMPLEX16
On 05/23/2012 07:30 PM, Patrick Le Dot wrote: David Singleton<David.Singleton anu.edu.au> writes: I should have checked earlier - same for MPI_COMPLEX and MPI_COMPLEX8. David On 04/27/2012 08:43 AM, David Singleton wrote: Apologies if this has already been covered somewhere. One of our users has noticed that MPI_COMPLEX16 is flagged as an invalid type in 1.5.4 but not in 1.4.3 while MPI_DOUBLE_COMPLEX is accepted for both. This is with either gfortran or intel-fc. ... Hi, I hit the same problem : MPI_COMPLEX8 and MPI_COMPLEX16 were available in v1.4 but were removes in v1.5 and I don't understand why, except that this types are not into the standard... I have a patch to reintroduce them again so let me know what you think. I would very much appreciate seeing that patch. Thanks David
Re: [OMPI users] MPI_COMPLEX16
I should have checked earlier - same for MPI_COMPLEX and MPI_COMPLEX8. David On 04/27/2012 08:43 AM, David Singleton wrote: Apologies if this has already been covered somewhere. One of our users has noticed that MPI_COMPLEX16 is flagged as an invalid type in 1.5.4 but not in 1.4.3 while MPI_DOUBLE_COMPLEX is accepted for both. This is with either gfortran or intel-fc. Superficially, the configure looks the same for 1.4.3 and 1.5.4, eg. % grep COMPLEX16 opal/include/opal_config.h #define OMPI_HAVE_F90_COMPLEX16 1 #define OMPI_HAVE_FORTRAN_COMPLEX16 1 Their test code (appended below) produces: % module load openmpi/1.4.3 % mpif90 mpi_complex_test.f90 % mpirun -np 2 ./a.out SUM1 (3.00,-1.00) SUM2 (3.00,-1.00) % module swap openmpi/1.5.4 % mpif90 mpi_complex_test.f90 % mpirun -np 2 ./a.out [vayu1:1935] *** An error occurred in MPI_Reduce [vayu1:1935] *** on communicator MPI_COMM_WORLD [vayu1:1935] *** MPI_ERR_TYPE: invalid datatype [vayu1:1935] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort SUM1 (3.00,-1.00) Thanks for any help, David program mpi_test implicit none include 'mpif.h' integer, parameter :: r8 = selected_real_kind(12) complex(kind=r8) :: local, global integer :: ierr, myid, nproc call MPI_INIT (ierr) call MPI_COMM_RANK (MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE (MPI_COMM_WORLD, nproc, ierr) local = cmplx(myid+1.0, myid-1.0, kind=r8) call MPI_REDUCE (local, global, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, 0, & MPI_COMM_WORLD, ierr) if ( myid == 0 ) then print*, 'SUM1', global end if call MPI_REDUCE (local, global, 1, MPI_COMPLEX16, MPI_SUM, 0, & MPI_COMM_WORLD, ierr) if ( myid == 0 ) then print*, 'SUM2', global end if call MPI_FINALIZE (ierr) end program mpi_test
[OMPI users] MPI_COMPLEX16
Apologies if this has already been covered somewhere. One of our users has noticed that MPI_COMPLEX16 is flagged as an invalid type in 1.5.4 but not in 1.4.3 while MPI_DOUBLE_COMPLEX is accepted for both. This is with either gfortran or intel-fc. Superficially, the configure looks the same for 1.4.3 and 1.5.4, eg. % grep COMPLEX16 opal/include/opal_config.h #define OMPI_HAVE_F90_COMPLEX16 1 #define OMPI_HAVE_FORTRAN_COMPLEX16 1 Their test code (appended below) produces: % module load openmpi/1.4.3 % mpif90 mpi_complex_test.f90 % mpirun -np 2 ./a.out SUM1 (3.00,-1.00) SUM2 (3.00,-1.00) % module swap openmpi/1.5.4 % mpif90 mpi_complex_test.f90 % mpirun -np 2 ./a.out [vayu1:1935] *** An error occurred in MPI_Reduce [vayu1:1935] *** on communicator MPI_COMM_WORLD [vayu1:1935] *** MPI_ERR_TYPE: invalid datatype [vayu1:1935] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort SUM1 (3.00,-1.00) Thanks for any help, David program mpi_test implicit none include 'mpif.h' integer, parameter :: r8 = selected_real_kind(12) complex(kind=r8) :: local, global integer :: ierr, myid, nproc call MPI_INIT (ierr) call MPI_COMM_RANK (MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE (MPI_COMM_WORLD, nproc, ierr) local = cmplx(myid+1.0, myid-1.0, kind=r8) call MPI_REDUCE (local, global, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, 0, & MPI_COMM_WORLD, ierr) if ( myid == 0 ) then print*, 'SUM1', global end if call MPI_REDUCE (local, global, 1, MPI_COMPLEX16, MPI_SUM, 0, & MPI_COMM_WORLD, ierr) if ( myid == 0 ) then print*, 'SUM2', global end if call MPI_FINALIZE (ierr) end program mpi_test
Re: [OMPI users] openmpi/pbsdsh/Torque problem
On 04/04/2011 12:56 AM, Ralph Castain wrote: What I still don't understand is why you are trying to do it this way. Why not just run time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def where machineN contains the names of the nodes where you want the MPI apps to execute? mpirun will only execute apps on those nodes, so this accomplishes the same thing as your script - only with a lot less pain. Your script would just contain a sequence of these commands, each with its number of procs and machinefile as required. Maybe I missed why this suggestion of forgetting about the ssh/pbsdsh altogether was not feasible? Just use mpirun (with its great tm support!) to distribute MPI jobs. A simple example: vayu1:~/MPI > qsub -lncpus=24,vmem=24gb,walltime=10:00 -wd -I qsub: waiting for job 574900.vu-pbs to start qsub: job 574900.vu-pbs ready [dbs900@v250 ~/MPI]$ wc -l $PBS_NODEFILE 24 [dbs900@v250 ~/MPI]$ head -12 $PBS_NODEFILE > m1 [dbs900@v250 ~/MPI]$ tail -12 $PBS_NODEFILE > m2 [dbs900@v250 ~/MPI]$ mpirun --machinefile m1 ./a2a143 12 30 & mpirun --machinefile m2 ./pp143 Check how the processes are distributed ... vayu1:~ > qps 574900.vu-pbs Node 0: v250: PID S RSSVSZ %MEM TIME %CPU COMMAND 11420 S 2104 10396 0.0 00:00:00 0.0 -tcsh 11421 S 620 10552 0.0 00:00:00 0.0 pbs_demux 12471 S 2208 49324 0.0 00:00:00 0.9 /apps/openmpi/1.4.3/bin/mpirun --machinefile m1 ./a2a143 12 30 12472 S 2116 49312 0.0 00:00:00 0.0 /apps/openmpi/1.4.3/bin/mpirun --machinefile m2 ./pp143 12535 R 270160 565668 1.0 00:00:02 82.4 ./a2a143 12 30 12536 R 270032 565536 1.0 00:00:02 81.4 ./a2a143 12 30 12537 R 270012 565528 1.0 00:00:02 87.3 ./a2a143 12 30 12538 R 269992 565532 1.0 00:00:02 93.3 ./a2a143 12 30 12539 R 269980 565516 1.0 00:00:02 81.4 ./a2a143 12 30 12540 R 270008 565516 1.0 00:00:02 86.3 ./a2a143 12 30 12541 R 270008 565516 1.0 00:00:02 96.3 ./a2a143 12 30 12542 R 272064 567568 1.0 00:00:02 91.3 ./a2a143 12 30 Node 1: v251: PID S RSSVSZ %MEM TIME %CPU COMMAND 10367 S 1872 40648 0.0 00:00:00 0.0 orted -mca ess env -mca orte_ess_jobid 113440 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "113440.0;tcp://10.1.3.58:37339" 10368 S 1868 40648 0.0 00:00:00 0.0 orted -mca ess env -mca orte_ess_jobid 1444347904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "1444347904.0;tcp://10.1.3.58:39610" 10372 R 271112 567556 1.0 00:00:04 74.5 ./a2a143 12 30 10373 R 271036 567564 1.0 00:00:04 71.5 ./a2a143 12 30 10374 R 271032 567560 1.0 00:00:04 66.5 ./a2a143 12 30 10375 R 273112 569612 1.1 00:00:04 68.5 ./a2a143 12 30 10378 R 552280 840712 2.2 00:00:04 100 ./pp143 10379 R 552280 840708 2.2 00:00:04 100 ./pp143 10380 R 552328 841576 2.2 00:00:04 100 ./pp143 10381 R 552788 841216 2.2 00:00:04 99.3 ./pp143 Node 2: v252: PID S RSSVSZ %MEM TIME %CPU COMMAND 10152 S 1908 40780 0.0 00:00:00 0.0 orted -mca ess env -mca orte_ess_jobid 1444347904 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "1444347904.0;tcp://10.1.3.58:39610" 10156 R 552384 840200 2.2 00:00:07 99.3 ./pp143 10157 R 551868 839692 2.2 00:00:06 99.3 ./pp143 10158 R 551400 839184 2.2 00:00:07 100 ./pp143 10159 R 551436 839184 2.2 00:00:06 98.3 ./pp143 10160 R 551760 839692 2.2 00:00:07 100 ./pp143 10161 R 551788 839824 2.2 00:00:07 97.3 ./pp143 10162 R 552256 840332 2.2 00:00:07 100 ./pp143 10163 R 552216 840340 2.2 00:00:07 99.3 ./pp143 You would have to do something smarter to get correct process binding etc.
Re: [OMPI users] openmpi/pbsdsh/Torque problem
You can prove this to yourself rather easily. Just ssh to a remote node and execute any command that lingers for awhile - say something simple like "sleep". Then kill the ssh and do a "ps" on the remote node. I guarantee that the command will have died. H ... vayu1:~ > ssh v37 sleep 600 & [1] 30145 vayu1:~ > kill -9 30145 [1] + Suspended (tty input) ssh v37 sleep 600 vayu1:~ > [1]Killedssh v37 sleep 600 vayu1:~ > ssh v37 ps aux | grep dbs900 | grep sleep dbs900 18774 0.0 0.0 9360 1348 ?Ss 07:12 0:00 /bin/tcsh -c sleep 600 dbs900 18806 0.0 0.0 3800 480 ?S07:12 0:00 sleep 600
Re: [OMPI users] Sending large boradcasts
Hi Brock, That message should only be 2MB. Are you sure its not a mismatch of message lengths in MPI_Bcast calls? David On 01/04/2011 03:47 AM, Brock Palen wrote: I have a user who reports that sending a broadcast of 540*1080 of reals (just over 2GB) fails with this: *** An error occurred in MPI_Bcast *** on communicator MPI_COMM_WORLD *** MPI_ERR_TRUNCATE: message truncated *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) I was reading the archives and there appears to be an issue with large messages. I was a little confused, is there a way to send messages larger than 2GB? The user has access to some IB machines, per a note in the archives there was an issue with writev() would this issue only be related to messages over ethernet? Thanks just trying to get some clarification. Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985
Re: [OMPI users] Open MPI vs IBM MPI performance help
http://www.open-mpi.org/faq/?category=running#oversubscribing On 12/03/2010 06:25 AM, Price, Brian M (N-KCI) wrote: Additional testing seems to show that the problem is related to barriers and how often they poll to determine whether or not it's time to leave. Is there some MCA parameter or environment variable that allows me to control the frequency of polling while in barriers? Thanks, Brian Price From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Price, Brian M (N-KCI) Sent: Wednesday, December 01, 2010 11:29 AM To: Open MPI Users Cc: Stern, Craig J Subject: EXTERNAL: [OMPI users] Open MPI vs IBM MPI performance help OpenMPI version: 1.4.3 Platform: IBM P5, 32 processors, 256 GB memory, Symmetric Multi-Threading (SMT) enabled Application: starts up 48 processes and does MPI using MPI_Barrier, MPI_Get, MPI_Put (lots of transfers, large amounts of data) Issue: When implemented using Open MPI vs. IBM's MPI ('poe' from HPC Toolkit), the application runs 3-5 times slower. I suspect that IBM's MPI implementation must take advantage of some knowledge that it has about data transfers that Open MPI is not taking advantage of. Any suggestions? Thanks, Brian Price
Re: [OMPI users] Memory affinity
On 09/28/2010 06:52 AM, Tim Prince wrote: On 9/27/2010 12:21 PM, Gabriele Fatigati wrote: HI Tim, I have read that link, but I haven't understood if enabling processor affinity are enabled also memory affinity because is written that: "Note that memory affinity support is enabled only when processor affinity is enabled" Can i set processory affinity without memory affinity? This is my question.. 2010/9/27 Tim PrinceOn 9/27/2010 9:01 AM, Gabriele Fatigati wrote: if OpenMPI is numa-compiled, memory affinity is enabled by default? Because I didn't find memory affinity alone ( similar) parameter to set at 1. The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a useful introduction to affinity. It's available in a default build, but not enabled by default. Memory affinity is implied by processor affinity. Your system libraries are set up so as to cause any memory allocated to be made local to the processor, if possible. That's one of the primary benefits of processor affinity. Not being an expert in openmpi, I assume, in the absence of further easily accessible documentation, there's no useful explicit way to disable maffinity while using paffinity on platforms other than the specified legacy platforms. Memory allocation policy really needs to be independent of processor binding policy. The default memory policy (memory affinity) of "attempt to allocate to the NUMA node of the cpu that made the allocation request but fallback as needed" is flawed in a number of situations. This is true even when MPI jobs are given dedicated access to processors. A common one is where the local NUMA node is full of pagecache pages (from the checkpoint of the last job to complete). For those sites that support suspend/resume based scheduling, NUMA nodes will generally contain pages from suspended jobs. Ideally, the new (suspending) job should suffer a little bit of paging overhead (pushing out the suspended job) to get ideal memory placement for the next 6 or whatever hours of execution. An mbind (MPOL_BIND) policy of binding to the one local NUMA node will not work in the case of one process requiring more memory than that local NUMA node. One scenario is a master-slave where you might want: master (rank 0) bound to processor 0 but not memory bound slave (rank i) bound to processor i and memory bound to the local memory of processor i. They really are independent requirements. Cheers, David
Re: [OMPI users] spin-wait backoff
On 09/03/2010 10:05 PM, Jeff Squyres wrote: On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote: Backing off the polling rate requires more application-specific logic like that offered below, so it is a little difficult for us to implement at the MPI library level. Not saying we eventually won't - just not sure anyone quite knows how to do so in a generalized form. FWIW, we've *talked* about this kind of stuff among the developers -- it's at least somewhat similar to the "backoff to blocking communications instead of polling communications" issues. That work in particular has been discussed for a long time but never implemented. Are your jobs hanging because of deadlock (i.e., application error), or infrastructure error? If they're hanging because of deadlock, there are some PMPI-based tools that might be able to help. These are application deadlocks (like the well-known VASP calling MPI_Finalize when it should be calling MPI_Abort!). But I'm asking as a system manager with dozens of apps run by dozens of users hanging and not being noticed for a day or two because users are not attentive and, from outside the job, everything looks OK. So the problem is detection. Are you suggesting there are PMPI approaches we could apply to every production job on the system? I now have a hack to opal_progress that seems to do what we want without any impact on performance in the "good" case. It basically involves keeping count of the number of contiguous calls to opal_progress with no events completed. When that hits a large number (eg 10^9), sleeping (maybe up to a second) on every, say, 10^3-10^4 passes through opal_progress seems to do "the right thing". (Obviously, any event completion resets everything to spinning.) There are a few magic numbers there that need to be overrideable by users. Please let me know if this idea is blatantly flawed. Thanks, David
[OMPI users] spin-wait backoff
I'm sure this has been discussed before but having watched hundreds of thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd be keen to know why there isn't some sort of "spin-wait backoff" option. For example, a way to specify spin-wait for x seconds/cycles/iterations then backoff to lighter and lighter cpu usage. At least that way, hung jobs would become self-evident. Maybe there is already some way of doing this? Thanks, David
Re: [OMPI users] Open MPI 1.4.2 released
On 05/28/2010 08:20 AM, Jeff Squyres wrote: On May 16, 2010, at 5:21 AM, Aleksej Saushev wrote: http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/parallel/openmpi/patches/ Sorry for the high latency reply... aa: We haven't added RPATH support yet. We've talked about it but never done it. There are some in OMPI who insist that rpath support needs to be optional. A full patch solution would be appreciated. We have problems with rpath overriding LD_RUN_PATH. LD_RUN_PATH is an intrinsic part of the way we configure our user's environment. We effectively use (impose) rpath but through the flexible, concatenatable LD_RUN_PATH. David
Re: [OMPI users] Hide Abort output
Yes, Dick has isolated the issue - novice users often believe Open MPI (not their application) had a problem. Anything along the lines he suggests can only help. David On 04/01/2010 01:12 AM, Richard Treumann wrote: I do not know what the OpenMPI message looks like or why people want to hide it. It should be phrased to avoid any implication of a problem with OpenMPI itself. How about something like this which: "The application has called MPI_Abort. The application is terminated by OpenMPI as the application demanded" Dick Treumann - MPI Team IBM Systems& Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: "Jeff Squyres (jsquyres)"To: , Date: 03/31/2010 06:43 AM Subject:Re: [OMPI users] Hide Abort output Sent by:users-boun...@open-mpi.org At present there is no such feature, but it should not be hard to add. Can you guys be a little more specific about exactly what you are seeing and exactly what you want to see? (And what version you're working with - I'll caveat my discussion that this may be a 1.5-and-forward thing) -jms Sent from my PDA. No type good. - Original Message - From: users-boun...@open-mpi.org To: Open MPI Users Sent: Wed Mar 31 05:38:48 2010 Subject: Re: [OMPI users] Hide Abort output I have to say this is a very common issue for our users. They repeatedly report the long Open MPI MPI_Abort() message in help queries and fail to look for the application error message about the root cause. A short MPI_Abort() message that said "look elsewhere for the real error message" would be useful. Cheers, David On 03/31/2010 07:58 PM, Yves Caniou wrote: Dear all, I am using the MPI_Abort() command in a MPI program. I would like to not see the note explaining that the command caused Open MPI to kill all the jobs and so on. I thought that I could find a --mca parameter, but couldn't grep it. The only ones deal with the delay and printing more information (the stack). Is there a mean to avoid the printing of the note (except the 2>/dev/null tips)? Or to delay this printing? Thank you. .Yves.
Re: [OMPI users] Hide Abort output
I have to say this is a very common issue for our users. They repeatedly report the long Open MPI MPI_Abort() message in help queries and fail to look for the application error message about the root cause. A short MPI_Abort() message that said "look elsewhere for the real error message" would be useful. Cheers, David On 03/31/2010 07:58 PM, Yves Caniou wrote: Dear all, I am using the MPI_Abort() command in a MPI program. I would like to not see the note explaining that the command caused Open MPI to kill all the jobs and so on. I thought that I could find a --mca parameter, but couldn't grep it. The only ones deal with the delay and printing more information (the stack). Is there a mean to avoid the printing of the note (except the 2>/dev/null tips)? Or to delay this printing? Thank you. .Yves.
Re: [OMPI users] Parallel file write in fortran (+mpi)
Its definitely not a bug in Lustre - its an essential part of any "coherent" cluster filesystem (CXFS, QFS, GFS, GPFS, ). The whole point is that some people actually want to have meaningful (non-garbage) files accessed using properly managed parallel IO techniques. The locking is necessary in this case. You would see the same issues on any of those filesystems. David On 02/03/2010 11:37 AM, Laurence Marks wrote: Agreed that it is not good (and I am recoding some programs to avoid this), but (and here life gets interesting) is this a "bug" in Lustre? On Tue, Feb 2, 2010 at 5:59 PM, David Singleton <david.single...@anu.edu.au> wrote: But its a very bad idea on a "coherent", "POSIX" filesystem like Lustre. Locks have to bounce around between the nodes for every write. This can be VERY slow (even for trivial amounts of "logging" IO) and thrash the filesystem for other users. So, yes, at our site, we include this sort of "parallel IO" on our list of disallowed behaviour. Not a good practice to adopt in general. David On 02/03/2010 10:40 AM, Laurence Marks wrote: I know it's wrong, but I don't think it is forbidden (which I guess is what you are saying). On Tue, Feb 2, 2010 at 5:31 PM, Jeff Squyres<jsquy...@cisco.com>wrote: +1 on Nick's responses. AFAIK, if you don't mind getting garbage in the output file, it should be fine to do. Specifically: it should not cause OS issues (crash, reboot, corrupted filesystem, etc.) to do this -- but the file contents will likely be garbage. That being said, this situation likely falls into the "Doc, it hurts when I do this..." category. Meaning: you know it's wrong, so you probably shouldn't be doing it anyway. :-) On Feb 2, 2010, at 4:50 PM, Nicolas Bock wrote: Hi Laurence, I don't know whether it's as bad as a deadly sin, but for us parallel writes are a huge problem and we get complete garbage in the file. Take a look at: Implementing MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided Communication, Robert Latham,Robert Ross, Rajeev Thakur, International Journal of High Performance Computing Applications, 21, 132 (2007). They describe an implemenation of a "mutex" like object in MPI. If you protect writes to the file with an exclusive lock you can serialize the writes and make use of NFS's close to open cache coherence. nick On Tue, Feb 2, 2010 at 08:27, Laurence Marks<l-ma...@northwestern.edu> wrote: I have a question concerning having many processors in a mpi job all write to the same file -- not using mpi calls but with standard fortran I/O. I know that this can lead to consistency issues, but it can also lead to OS issues with some flavors of nfs. At least in fortran, there is nothing "wrong" with doing this. My question is whether this is "One of the Seven Deadly Sins" of mpi programming, or just frowned on. (That is, it should be OK even if it leads to nonsense files, and not lead to OS issues.) If it is a sin, I would appreciate a link to where this is spelt out in some "official" document or similar. -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Parallel file write in fortran (+mpi)
But its a very bad idea on a "coherent", "POSIX" filesystem like Lustre. Locks have to bounce around between the nodes for every write. This can be VERY slow (even for trivial amounts of "logging" IO) and thrash the filesystem for other users. So, yes, at our site, we include this sort of "parallel IO" on our list of disallowed behaviour. Not a good practice to adopt in general. David On 02/03/2010 10:40 AM, Laurence Marks wrote: I know it's wrong, but I don't think it is forbidden (which I guess is what you are saying). On Tue, Feb 2, 2010 at 5:31 PM, Jeff Squyreswrote: +1 on Nick's responses. AFAIK, if you don't mind getting garbage in the output file, it should be fine to do. Specifically: it should not cause OS issues (crash, reboot, corrupted filesystem, etc.) to do this -- but the file contents will likely be garbage. That being said, this situation likely falls into the "Doc, it hurts when I do this..." category. Meaning: you know it's wrong, so you probably shouldn't be doing it anyway. :-) On Feb 2, 2010, at 4:50 PM, Nicolas Bock wrote: Hi Laurence, I don't know whether it's as bad as a deadly sin, but for us parallel writes are a huge problem and we get complete garbage in the file. Take a look at: Implementing MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided Communication, Robert Latham,Robert Ross, Rajeev Thakur, International Journal of High Performance Computing Applications, 21, 132 (2007). They describe an implemenation of a "mutex" like object in MPI. If you protect writes to the file with an exclusive lock you can serialize the writes and make use of NFS's close to open cache coherence. nick On Tue, Feb 2, 2010 at 08:27, Laurence Marks wrote: I have a question concerning having many processors in a mpi job all write to the same file -- not using mpi calls but with standard fortran I/O. I know that this can lead to consistency issues, but it can also lead to OS issues with some flavors of nfs. At least in fortran, there is nothing "wrong" with doing this. My question is whether this is "One of the Seven Deadly Sins" of mpi programming, or just frowned on. (That is, it should be OK even if it leads to nonsense files, and not lead to OS issues.) If it is a sin, I would appreciate a link to where this is spelt out in some "official" document or similar. -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] exceedingly virtual memory consumption of MPI, environment if higher-setting "ulimit -s"
I think the issue is that if you *dont* specifically use pthread_attr_setstacksize the pthread library will (can?) give each thread a stack of size equal to the stacksize rlimit. You are correct - this is not specifically an Open MPI issue although if it is Open MPI spawning the threads, maybe it should be actively setting pthread_attr_setstacksize. But there is an unfortunate tendency for users to set very large (or unlimited) stacksize without realising these sort of repercussions. Try catching/looking at the stacktrace of an application that is in infinite recursion with a 1GB+ stacksize after it finally gets a SEGV ... Jeff Squyres wrote: I can't think of what OMPI would be doing related to the predefined stack size -- I am not aware of anywhere in the code where we look up the predefine stack size and then do something with it. That being said, I don't know what the OS and resource consumption effects are of setting 1GB+ stack size on *any* application... Have you tried non-MPI examples, potentially with applications as large as MPI applications but without the complexity of MPI? On Nov 19, 2009, at 3:13 PM, David Singleton wrote: Depending on the setup, threads often get allocated a thread local stack with size equal to the stacksize rlimit. Two threads maybe? David Terry Dontje wrote: > A couple things to note. First Sun MPI 8.2.1 is effectively OMPI > 1.3.4. I also reproduced the below issue using a C code so I think this > is a general issue with OMPI and not Fortran based. > > I did a pmap of a process and there were two anon spaces equal to the > stack space set by ulimit. > > In one case (setting 102400) the anon spaces were next to each other > prior to all the loadable libraries. In another case (setting 1024000) > one anon space was locate in the same area as the first case but the > second space was deep into some memory used by ompi. > > Is any of this possibly related to the predefined handles? Though I am > not sure why it would expand based on stack size?. > > --td >> Date: Thu, 19 Nov 2009 19:21:46 +0100 >> From: Paul Kapinos <kapi...@rz.rwth-aachen.de> >> Subject: [OMPI users] exceedingly virtual memory consumption of MPI >> environment if higher-setting "ulimit -s" >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <4b058cba.3000...@rz.rwth-aachen.de> >> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" >> >> Hi volks, >> >> we see an exeedingly *virtual* memory consumtion through MPI processes >> if "ulimit -s" (stack size)in profile configuration was setted higher. >> >> Furthermore we believe, every mpi process started, wastes about the >> double size of `ulimit -s` value which will be set in a fresh console >> (that is, the value is configurated in e.g. .zshenv, *not* the value >> actually setted in the console from which the mpiexec runs). >> >> Sun MPI 8.2.1, an empty mpi-HelloWorld program >> ! either if running both processes on the same host.. >> >> .zshenv: ulimit -s 10240 --> VmPeak:180072 kB >> .zshenv: ulimit -s 102400 --> VmPeak:364392 kB >> .zshenv: ulimit -s 1024000 --> VmPeak:2207592 kB >> .zshenv: ulimit -s 2024000 --> VmPeak:4207592 kB >> .zshenv: ulimit -s 2024 --> VmPeak: 39.7 GB >> (see the attached files; the a.out binary is a mpi helloworld program >> running an never ending loop). >> >> >> >> Normally, the stack size ulimit is set to some 10 MB by us, but we see >> a lot of codes which needs *a lot* of stack space, e.g. Fortran codes, >> OpenMP codes (and especially fortran OpenMP codes). Users tends to >> hard-code the setting-up the higher value for stack size ulimit. >> >> Normally, the using of a lot of virtual memory is no problem, because >> there is a lot of this thing :-) But... If more than one person is >> allowed to work on a computer, you have to divide the ressources in >> such a way that nobody can crash the box. We do not know how to limit >> the real RAM used so we need to divide the RAM by means of setting >> virtual memory ulimit (in our batch system e.g.. That is, for us >> "virtual memory consumption" = "real memory consumption". >> And real memory is not that way cheap than virtual memory. >> >> >> So, why consuming the *twice* amount of stack size for each process? >> >> And, why consuming the virtual memory at all? We guess this virtual >> memory is allocated for the stack (why else it will be related to the >> stack size ulimit). But, is such allocation really needed? Is there a >> way to avoid the
Re: [OMPI users] exceedingly virtual memory consumption of MPI, environment if higher-setting "ulimit -s"
Depending on the setup, threads often get allocated a thread local stack with size equal to the stacksize rlimit. Two threads maybe? David Terry Dontje wrote: A couple things to note. First Sun MPI 8.2.1 is effectively OMPI 1.3.4. I also reproduced the below issue using a C code so I think this is a general issue with OMPI and not Fortran based. I did a pmap of a process and there were two anon spaces equal to the stack space set by ulimit. In one case (setting 102400) the anon spaces were next to each other prior to all the loadable libraries. In another case (setting 1024000) one anon space was locate in the same area as the first case but the second space was deep into some memory used by ompi. Is any of this possibly related to the predefined handles? Though I am not sure why it would expand based on stack size?. --td Date: Thu, 19 Nov 2009 19:21:46 +0100 From: Paul KapinosSubject: [OMPI users] exceedingly virtual memory consumption of MPI environment if higher-setting "ulimit -s" To: Open MPI Users Message-ID: <4b058cba.3000...@rz.rwth-aachen.de> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" Hi volks, we see an exeedingly *virtual* memory consumtion through MPI processes if "ulimit -s" (stack size)in profile configuration was setted higher. Furthermore we believe, every mpi process started, wastes about the double size of `ulimit -s` value which will be set in a fresh console (that is, the value is configurated in e.g. .zshenv, *not* the value actually setted in the console from which the mpiexec runs). Sun MPI 8.2.1, an empty mpi-HelloWorld program ! either if running both processes on the same host.. .zshenv: ulimit -s 10240 --> VmPeak:180072 kB .zshenv: ulimit -s 102400 --> VmPeak:364392 kB .zshenv: ulimit -s 1024000 --> VmPeak:2207592 kB .zshenv: ulimit -s 2024000 --> VmPeak:4207592 kB .zshenv: ulimit -s 2024 --> VmPeak: 39.7 GB (see the attached files; the a.out binary is a mpi helloworld program running an never ending loop). Normally, the stack size ulimit is set to some 10 MB by us, but we see a lot of codes which needs *a lot* of stack space, e.g. Fortran codes, OpenMP codes (and especially fortran OpenMP codes). Users tends to hard-code the setting-up the higher value for stack size ulimit. Normally, the using of a lot of virtual memory is no problem, because there is a lot of this thing :-) But... If more than one person is allowed to work on a computer, you have to divide the ressources in such a way that nobody can crash the box. We do not know how to limit the real RAM used so we need to divide the RAM by means of setting virtual memory ulimit (in our batch system e.g.. That is, for us "virtual memory consumption" = "real memory consumption". And real memory is not that way cheap than virtual memory. So, why consuming the *twice* amount of stack size for each process? And, why consuming the virtual memory at all? We guess this virtual memory is allocated for the stack (why else it will be related to the stack size ulimit). But, is such allocation really needed? Is there a way to avoid the vaste of virtual memory? best regards, Paul Kapinos
Re: [OMPI users] custom modules per job (PBS/OpenMPI/environment-modules)
Hi Ralph, Now I'm in a quandry - if I show you that its actually Open MPI that is propagating the environment then you are likely to "fix it" and then tm users will lose a nice feature. :-) Can I suggest that "least surprise" would require that MPI tasks get exactly the same environment/limits/... as mpirun so that "mpirun a.out" behaves just like "a.out". [Following this principle we modified tm_spawn to propagate the callers rlimits to the spawned tasks.] A comment in orterun.c (see below) below suggests that Open MPI is trying to distinguish between "local" and "remote" processes. I would have thought that distinction should be invisible to users as much as possible - a user asking for 4 cpus would like to see the same behaviour if all 4 are local or "2 local, 2 remote". As to why tm does "The Right Thing": in the case of rsh/ssh the full mpirun environment is given to the rsh/ssh process locally while in the tm case it is an argument to tm_spawn and so gets given to the process (in this case orted) being launched remotely. Relevant lines from 1.3.3 below. PBS just passes along the environment it is told to. We dont use torque but as of 2.3.3, it was still the same as OpenPBS in this respect. Michael just pointed out the slight flaw. The environment should be somewhat selectively propagated (exclude HOSTNAME etc). I guess if you were to "fix" plm_tm_module I would put the propagation behaviour in tm_spawn and try to handle these exceptional cases. Cheers, David orterun.c: 510 /* save the environment for launch purposes. This MUST be 511 * done so that we can pass it to any local procs we 512 * spawn - otherwise, those local procs won't see any 513 * non-MCA envars were set in the enviro prior to calling 514 * orterun 515 */ 516 orte_launch_environ = opal_argv_copy(environ); plm_rsh_module.c: 681 /* actually ssh the child */ 682 static void ssh_child(int argc, char **argv, 683 orte_vpid_t vpid, int proc_vpid_index) 684 { 694 /* setup environment */ 695 env = opal_argv_copy(orte_launch_environ); 766 execve(exec_path, exec_argv, env); plm_tm_module.c: 128 static int plm_tm_launch_job(orte_job_t *jdata) 129 { 228 /* setup environment */ 229 env = opal_argv_copy(orte_launch_environ); 311 rc = tm_spawn(argc, argv, env, node->launch_id, tm_task_ids + launched, tm_events + launched); Ralph Castain wrote: Not exactly. It completely depends on how Torque was setup - OMPI isn't forwarding the environment. Torque is. We made a design decision at the very beginning of the OMPI project not to forward non-OMPI envars unless directed to do so by the user. I'm afraid I disagree with Michael's claim that other MPIs do forward them - yes, MPICH does, but not all others do. The world is bigger than MPICH and OMPI :-) Since there is inconsistency in this regard between MPIs, we chose not to forward. Reason was simple: there is no way to know what is safe to forward vs what is not (e.g., what to do with DISPLAY), nor what the underlying environment is trying to forward vs what it isn't. It is very easy to get cross-wise and cause totally unexpected behavior, as users have complained about for years. First, if you are using a managed environment like Torque, we recommend that you work with your sys admin to decide how to configure it. This is the best way to resolve a problem. Second, if you are not using a managed environment and/or decide not to have that environment do the forwarding, you can tell OMPI to forward the envars you need by specifying them via the -x cmd line option. We already have a request to expand this capability, and I will be doing so as time permits. One option I'll be adding is the reverse of -x - i.e., "forward all envars -except- the specified one(s)". HTH ralph
Re: [OMPI users] custom modules per job (PBS/OpenMPI/environment-modules)
I can see the difference - we built Open MPI with tm support. For some reason, I thought mpirun fed its environment to orted (after orted is launched) so orted can pass it on to MPI tasks. That should be portable between different launch mechanisms. But it looks like tm launches orted with the full mpirun environment (at the request of mpirun). Cheers, David Michael Sternberg wrote: Hi David, Hmm, your demo is well-chosen and crystal-clear, yet the output is unexpected. I do not see environment vars passed by default here: login3$ qsub -l nodes=2:ppn=1 -I qsub: waiting for job 34683.mds01 to start qsub: job 34683.mds01 ready n102$ mpirun -n 2 -machinefile $PBS_NODEFILE hostname n102 n085 n102$ mpirun -n 2 -machinefile $PBS_NODEFILE env | grep FOO n102$ export FOO=BAR n102$ mpirun -n 2 -machinefile $PBS_NODEFILE env | grep FOO FOO=BAR n102$ type mpirun mpirun is hashed (/opt/soft/openmpi-1.3.2-intel10-1/bin/mpirun) Curious, what do you get upon: where mpirun I built OpenMPI-1.3.2 here from source with: CC=icc CXX=icpc FC=ifort F77=ifort \ LDFLAGS='-Wl,-z,noexecstack' \ CFLAGS='-O2 -g -fPIC' \ CXXFLAGS='-O2 -g -fPIC' \ FFLAGS='-O2 -g -fPIC' \ ./configure --prefix=$prefix \ --with-libnuma=/usr \ --with-openib=/usr \ --with-udapl \ --enable-mpirun-prefix-by-default \ --without-tm I did't find the behavior I saw strange, given that orterun(1) talks only about $OPMI_* and inheritance from the remote shell. It also mentions a "boot MCA module", about which I couldn't find much on open-mpi.org - hmm. In the meantime, I did find a possible solution, namely, to tell ssh to pass a variable using SendEnv/AcceptEnv. That variable is then seen by and can be interpreted (cautiously) in /etc/profile.d/ scripts. A user could set it in the job file (or even qalter it post submission): #PBS -v VARNAME=foo:bar:baz For VARNAME, I think simply "MODULES" or "EXTRAMODULES" could do. With best regards, Michael On Nov 17, 2009, at 4:29 , David Singleton wrote: I'm not sure why you dont see Open MPI behaving like other MPI's w.r.t. modules/environment on remote MPI tasks - we do. xe:~ > qsub -q express -lnodes=2:ppn=8,walltime=10:00,vmem=2gb -I qsub: waiting for job 376366.xepbs to start qsub: job 376366.xepbs ready [dbs900@x27 ~]$ module load openmpi [dbs900@x27 ~]$ mpirun -n 2 --bynode hostname x27 x28 [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO [dbs900@x27 ~]$ setenv FOO BAR [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO FOO=BAR FOO=BAR [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber [dbs900@x27 ~]$ module load amber [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber LOADEDMODULES=openmpi/1.3.3:amber/9 PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin::/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe _LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9 AMBERHOME=/apps/amber/9 LOADEDMODULES=openmpi/1.3.3:amber/9 PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin:/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe _LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9 AMBERHOME=/apps/amber/9 David Michael Sternberg wrote: Dear readers, With OpenMPI, how would one go about requesting to load environment modules (of the http://modules.sourceforge.net/ kind) on remote nodes, augmenting those normally loaded there by shell dotfiles? Background: I run a RHEL-5/CentOS-5 cluster. I load a bunch of default modules through /etc/profile.d/ and recommend to users to customize modules in ~/.bashrc. A problem arises for PBS jobs which might need job-specific modules, e.g., to pick a specific flavor of an application. With other MPI implementations (ahem) which export all (or judiciously nearly all) environment variables by default, you can say: #PBS ... module load foo # not for OpenMPI mpirun -np 42 ... \ bar-app Not so with OpenMPI - any such customization is only effective for processes on the master (=local) node of the job, and any variables changed by a given module would have to be specifically passed via mpirun -x VARNAME. On the remote nodes, those variables are not available in the dotfiles because they are passed only once orted is live (after dotfile processing by the shell), which then immediately spawns the application binaries (right?) I thought along the following lines: (1) I happen to run Lustre, which would allow writing a file coherently across nodes prior to mpirun, and thus hook into the shell dotfile processing, but that seems rather crude. (2) "mpirun -x PATH -x LD_LIBRARY_PATH …" would take care of a lot, but is not really general. Is there a recommended way? regards, Michael ___ users mail
Re: [OMPI users] custom modules per job (PBS/OpenMPI/environment-modules)
Hi Michael, I'm not sure why you dont see Open MPI behaving like other MPI's w.r.t. modules/environment on remote MPI tasks - we do. xe:~ > qsub -q express -lnodes=2:ppn=8,walltime=10:00,vmem=2gb -I qsub: waiting for job 376366.xepbs to start qsub: job 376366.xepbs ready [dbs900@x27 ~]$ module load openmpi [dbs900@x27 ~]$ mpirun -n 2 --bynode hostname x27 x28 [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO [dbs900@x27 ~]$ setenv FOO BAR [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO FOO=BAR FOO=BAR [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber [dbs900@x27 ~]$ module load amber [dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber LOADEDMODULES=openmpi/1.3.3:amber/9 PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin::/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe _LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9 AMBERHOME=/apps/amber/9 LOADEDMODULES=openmpi/1.3.3:amber/9 PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin:/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe _LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9 AMBERHOME=/apps/amber/9 David Michael Sternberg wrote: Dear readers, With OpenMPI, how would one go about requesting to load environment modules (of the http://modules.sourceforge.net/ kind) on remote nodes, augmenting those normally loaded there by shell dotfiles? Background: I run a RHEL-5/CentOS-5 cluster. I load a bunch of default modules through /etc/profile.d/ and recommend to users to customize modules in ~/.bashrc. A problem arises for PBS jobs which might need job-specific modules, e.g., to pick a specific flavor of an application. With other MPI implementations (ahem) which export all (or judiciously nearly all) environment variables by default, you can say: #PBS ... module load foo # not for OpenMPI mpirun -np 42 ... \ bar-app Not so with OpenMPI - any such customization is only effective for processes on the master (=local) node of the job, and any variables changed by a given module would have to be specifically passed via mpirun -x VARNAME. On the remote nodes, those variables are not available in the dotfiles because they are passed only once orted is live (after dotfile processing by the shell), which then immediately spawns the application binaries (right?) I thought along the following lines: (1) I happen to run Lustre, which would allow writing a file coherently across nodes prior to mpirun, and thus hook into the shell dotfile processing, but that seems rather crude. (2) "mpirun -x PATH -x LD_LIBRARY_PATH …" would take care of a lot, but is not really general. Is there a recommended way? regards, Michael
[OMPI users] bug in MPI_Cart_create?
Looking back through the archives, a lot of people have hit error messages like > [bl302:26556] *** An error occurred in MPI_Cart_create > [bl302:26556] *** on communicator MPI_COMM_WORLD > [bl302:26556] *** MPI_ERR_ARG: invalid argument of some other kind > [bl302:26556] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) One of the reasons people *may* be hitting this is what I believe to be an incorrect test in MPI_Cart_create(): if (0 > reorder || 1 < reorder) { return OMPI_ERRHANDLER_INVOKE (old_comm, MPI_ERR_ARG, FUNC_NAME); } reorder is a "logical" argument and "2.5.2 C bindings" in the MPI 1.3 standard says: Logical flags are integers with value 0 meaning “false” and a non-zero value meaning “true.” So I'm not sure there should be any argument test. We hit this because we (sorta erroneously) were trying to use a GNU build of Open MPI with Intel compilers. gfortran has true=1 while ifort has true=-1. It seems to all work (by luck, I know) except this test. Are there any other tests like this in Open MPI? David
[OMPI users] PBS tm error returns
Maybe this should go to the devel list but I'll start here. In tracking the way the PBS tm API propagates error information back to clients, I noticed that Open MPI is making an incorrect assumption. (I'm looking 1.3.2.) The relevant code in orte/mca/plm/tm/plm_tm_module.c is: /* TM poll for all the spawns */ for (i = 0; i < launched; ++i) { rc = tm_poll(TM_NULL_EVENT, , 1, _err); if (TM_SUCCESS != rc) { errno = local_err; opal_output(0, "plm:tm: failed to poll for a spawned daemon," " return status = %d", rc); goto cleanup; } } My reading of the way the tm API works is that tm_poll() can (will) return TM_SUCCESS(0) even when the tm_spawn event being waited on failed, i.e. local_err needs to be checked even if rc=0. It looks like TM_ errors (rc values) are from tm protocol failures or incorrect calls to tm. local_err is to do with why the actual requested action failed and is usually some sort of internal PBSE_ error code. In fact it's probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn(). Something like the following is probably closer to what is needed. /* TM poll for all the spawns */ for (i = 0; i < launched; ++i) { rc = tm_poll(TM_NULL_EVENT, , 1, _err); if (TM_SUCCESS != rc) { errno = local_err; opal_output(0, "plm:tm: failed to poll for a spawned daemon," " return status = %d", rc); goto cleanup; } if (local_err!=0) { errno = local_err; opal_output(0, "plm:tm: failed to spawn daemon," " error code = %d", errno ); goto cleanup; } } I checked torque 2.3.3 to confirm that it's tm behaviour is the same as OpenPBS in this respect. No idea about PBSPro. David
Re: [OMPI users] pgi and gcc runtime compatability
I seem to remember Fortran logicals being represented differently in PGI to other Fortran (1 vs -1 maybe - cant remember). Causes grief with things like MPI_Test. David Brock Palen wrote: I did something today that I was happy worked, but I want to know if anyone has had problem with it. At runtime. (not compiling) would a OpenMPI built with pgi work to run a code that was compiled with the same version but gcc built OpenMPI ? I tested a few apps today after I accidentally did this and found it worked. They were all C/C++ apps (namd and gromacs) but what about fortran apps? Should we expect problems if someone does this? I am not going to encourage this, but it is more if needed. Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] job abort on MPI task exit
Apologies if this has been covered in a previous thread - I went back through a lot of posts without seeing anything similar. In an attempt to protect some users from themselves, I was hoping that OpenMPI could be configured so that an MPI task calling exit before calling MPI_Finalize() would cause job cleanup, i.e. behave effectively as if MPI_Abort() was called. The reason is that many users dont realise they need to use MPI_Abort() instead of Fortran stop or C exit. If exit is called, all other processes get stuck in the next blocking call and, for a large walltime limit batch job, that can be a real waste of resources. I think LAM terminated the job if a task exited with non-zero exit status or due to a signal. OpenMPI appears to cleanup only in the case a signalled task. Ideally, any exit before MPI_Finalize() should be terminal. Why is this not the case? Thanks, David