Re: [OMPI users] Limit to number of processes on one node?
Ralph Castain wrote: > On Mar 4, 2010, at 7:51 AM, Prentice Bisbal wrote: > >> >> Ralph Castain wrote: >>> On Mar 4, 2010, at 7:27 AM, Prentice Bisbal wrote: >>> Ralph Castain wrote: > On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote: > >> Eugene Loh wrote: >>> Prentice Bisbal wrote: Eugene Loh wrote: > Prentice Bisbal wrote: > >> Is there a limit on how many MPI processes can run on a single host? >> >>> Depending on which OMPI release you're using, I think you need something >>> like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need >>> 1000+ descriptors. You're quite possibly up against your limit, though >>> I don't know for sure that that's the problem here. >>> >>> You say you're running 1.2.8. That's "a while ago", so would you >>> consider updating as a first step? Among other things, newer OMPIs will >>> generate a much clearer error message if the descriptor limit is the >>> problem. >> While 1.2.8 might be "a while ago", upgrading software just because it's >> "old" is not a valid argument. >> >> I can install the lastest version of OpenMPI, but it will take a little >> while. > Maybe not because it is "old", but Eugene is correct. The old versions of > OMPI required more file descriptors than the newer versions. > > That said, you'll still need a minimum of 4x the number of procs on the > node even with the latest release. I suggest talking to your sys admin > about getting the limit increased. It sounds like it has been set > unrealistically low. > > I *am* the system admin! ;) The file descriptor limit is the default for RHEL, 1024, so I would not characterize it as "unrealistically low". I assume someone with much more knowledge of OS design and administration than me came up with this default, so I'm hesitant to change it without good reason. If there was good reason, I'd have no problem changing it. I have read that setting it to more than 8192 can lead to system instability. >>> Never heard that, and most HPC systems have it set a great deal higher >>> without trouble. >> I just read that the other day. Not sure where, though. Probably a forum >> posting somewhere. I'll take your word for it that it's safe to increase >> if necessary. >>> However, the choice is yours. If you have a large SMP system, you'll >>> eventually be forced to change it or severely limit its usefulness for MPI. >>> RHEL sets it that low arbitrarily as a way of saving memory by keeping the >>> fd table small, not because the OS can't handle it. >>> >>> Anyway, that is the problem. Nothing we (or any MPI) can do about it as the >>> fd's are required for socket-based communications and to forward I/O. >> Thanks, Ralph, that's exactly the answer I was looking for - where this >> limit was coming from. >> >> I can see how on a large SMP system the fd limit would have to be >> increased. In normal circumstances, my cluster nodes should never have >> more than 8 MPI processes running at once (per node), so I shouldn't be >> hitting that limit on my cluster. > > Ah, okay! That helps a great deal in figuring out what to advise you. In your > earlier note, it sounded like you were running all 512 procs on one node, so > I assumed you had a large single-node SMP. > > In this case, though, the problem is solely that you are using the 1.2 > series. In that series, mpirun and each process opened many more sockets to > all processes in the job. That's why you are overrunning your limit. > > Starting with 1.3, the number of sockets being opened on each is only 3 times > the number of procs on the node, plus a couple for the daemon. If you are > using TCP for MPI communications, then each MPI connection will open another > socket as these messages are direct and not routed. > > Upgrading to the 1.4 series should resolve the problem you saw. After upgrading to 1.4.1, I can start up to 253 processes on one host: mpirun -np 253 mpihello This is an increase of ~100 over 1.2.8. When it does fail, it gives more useful error message: $ mpirun -np 254 mpihello [juno.sns.ias.edu:22862] [[6399,0],0] ORTE_ERROR_LOG: The system limit on number of network connections a process can open was reached in file ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447 -- Error: system limit exceeded on number of network connections that can be open This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1, increasing your limit descriptor setting (using limit or ulimit commands), or asking the system administrator to increase the system limit. -- Case closed, court adjourned. Thanks for all the help and explanations. Prentice > > HTH > R
Re: [OMPI users] Option to use only 7 cores out of 8 on each node
"Addepalli, Srirangam V" writes: > It works after creating a new pe and even from the command prompt with > out using SGE. You shouldn't need anything special -- I don't. (It's common to run, say, one process per core for benchmarking.) Running mpirun -tag-output -np 14 -npernode 7 hostname in 16 slots in the following PE (which confines the run to our 8-core nodes) it shows 7 processes on each of 2 nodes. The nodes have exclusive access for MPI jobs, but I don't think that's relevant. pe_nameopenmpi-8 slots 999 user_lists NONE xuser_listsNONE start_proc_args/bin/true stop_proc_args /bin/true allocation_rule8 control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE In this case, you'd get the same effect allocating processes -bynode as using -npernode.
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
On Mar 4, 2010, at 10:52 AM, Anthony Chan wrote: - "Yuanyuan ZHANG" wrote: For an OpenMP/MPI hybrid program, if I only want to make MPI calls using the main thread, ie., only in between parallel sections, can I just use SINGLE or MPI_Init? If your MPI calls is NOT within OpenMP directives, MPI does not even know you are using threads. So calling MPI_Init is good enough. This is *not true*. Please read Dick's previous post for a good example of why this is not the case. In practice, on most platforms, implementation support for SINGLE and FUNNELED are identical (true for stock MPICH2, for example). However Dick's example of thread-safe versus non-thread-safe malloc options clearly shows why programs need to request (and check "provided" for) >=FUNNELED in this scenario if they wish to be truly portable. -Dave
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
- "Yuanyuan ZHANG" wrote: > For an OpenMP/MPI hybrid program, if I only want to make MPI calls > using the main thread, ie., only in between parallel sections, can I just > use SINGLE or MPI_Init? If your MPI calls is NOT within OpenMP directives, MPI does not even know you are using threads. So calling MPI_Init is good enough. A.Chan
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
On Mar 4, 2010, at 7:36 AM, Richard Treumann wrote: A call to MPI_Init allows the MPI library to return any level of thread support it chooses. This is correct, insofar as the MPI implementation can always choose any level of thread support. This MPI 1.1 call does not let the application say what it wants and does not let the implementation reply with what it can guarantee. Well, sort of. MPI-2.2, sec 12.4.3, page 385, lines 24-25: --8<-- 24| A call to MPI_INIT has the same effect as a call to MPI_INIT_THREAD with a required 25| = MPI_THREAD_SINGLE. --8<-- So even though there is no explicit request and response for thread level support, it is implicitly asking for MPI_THREAD_SINGLE. Since all implementations must be able to support at least SINGLE (0 threads running doesn't really make sense), SINGLE will be provided at a minimum. Callers to plain-old "MPI_Init" should not expect any higher level of thread support if they wish to maintain portability. [...snip...] Consider a made up example: Imagine some system supports Mutex lock/unlock but with terrible performance. As a work around, it offers a non-standard substitute for malloc called st_malloc (single thread malloc) that does not do locking. [...snip...] Dick's example is a great illustration of why FUNNELED might be necessary. The moral of the story is "don't lie to the MPI implementation" :) -Dave
Re: [OMPI users] low efficiency when we use --am ft-enable-cr to checkpoint
There is some overhead involved when activating the current C/R functionality in Open MPI due to the wrapping of the internal point-to-point stack. The wrapper (CRCP framework) tracks the signature of each message (not the buffer, so constant time for any size MPI message) so that when we need to quiesce the network we know of all the outstanding messages that need to be drained. So there is an overhead, but it should not be as significant as you have mentioned. I looked at some of the performance aspects in the paper at the link below: http://www.open-mpi.org/papers/hpdc-2009/ Though I did not look at HPL explicitly in this paper (just NPB, GROMACS, and NetPipe), I have in testing and the time difference was definitely not 2x (cannot recall the exact differences at the moment). Can you tell me a bit about your setup: - What version of Open MPI are you using? - What configure options are you using? - What MCA parameters are you using? - Are you building from a release tarball or a SVN checkout? -- Josh On Mar 3, 2010, at 10:07 PM, 马少杰 wrote: > > > 2010-03-04 > 马少杰 > Dear Sir: >I want to use blcr and openmpi to checkpoint, now I can save check > point and restart my work successfully. How erver I find the option "--am > ft-enable-cr" will case large cost . For example , when I run my HPL job > without and with the option "--am ft-enable-cr" on 4 hosts (32 process, IB > network) respectively , the time costed are 8m21.180sand 16m37.732s > respctively. it is should be noted that I did not save the checkpoint when I > run the job, the additional cost is caused by "--am ft-enable-cr" > independently. Why can the optin "--am ft-enable-cr" case so much system > cost? Is it normal? How can I solve the problem. > I also test other mpi applications, the problem still exists. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] checkpointing multi node and multi process applications
On Mar 4, 2010, at 8:17 AM, Fernando Lemos wrote: > On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos wrote: > >> Is there anything I can do to provide more information about this bug? >> E.g. try to compile the code in the SVN trunk? I also have kept the >> snapshots intact, I can tar them up and upload them somewhere in case >> you guys need it. I can also provide the source code to the ring >> program, but it's really the canonical ring MPI example. >> > > I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags). > This time taking the checkpoint didn't generate any error message: > > root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 > -np 2 --host debian1,debian2 ring > Process 1 sending 2761 to 0 Process 1 received 2760 Process 1 sending 2760 to 0 > root@debian1:~# > > But restoring it did: > > root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt > [debian1:23129] Error: Unable to access the path > [/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]! > -- > Error: The filename (opal_snapshot_1.ckpt) is invalid because either > you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -- > -- > mpirun has exited due to process rank 1 with PID 23129 on > node debian1 exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -- > root@debian1:~# > > Indeed, opal_snapshot_1.ckpt does not exist exist: > > root@debian1:~# find ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data > ompi_global_snapshot_23071.ckpt/restart-appfile > ompi_global_snapshot_23071.ckpt/0 > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073 > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data > root@debian1:~# > > It can be found in debian2: > > root@debian2:~# find ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/ > ompi_global_snapshot_23071.ckpt/0 > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data > ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501 > root@debian2:~# By default, Open MPI requires a shared file system to save checkpoint files. So by default the local snapshot is moved, since the system assumes that it is writing to the same directory on a shared file system. If you want to use the local disk staging functionality (which is known to be broken in the 1.4 series), check out the example on the webpage below: http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local > > Then I tried supplying a hostfile for ompi-run and it worked just > fine! I thought the checkpoint included the hosts information? We intentionally do not save the hostfile as part of the checkpoint. Typically folks will want to restart on different nodes than those they checkpointed on (such as in a batch scheduling environment). If we saved the hostfile then it could lead to unexpected user behavior on restart if the machines that they wish to restart on change. If you need to pass a hostfile, the you can pass one to ompi-restart just as you would mpirun. > > So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN? The file staging functionality is known to be broken in the 1.4 series at this time, per the ticket below: https://svn.open-mpi.org/trac/ompi/ticket/2139 Unfortunately the fix is likely to be both custom for the branch (since we redesigned the functionality for the trunk and v1.5) and fairly involved. I don't have the time at the moment to work on fix, but hopefully in the coming months I will be able to look into this issue. In the mean time, patches are always welcome :) Hope that helps, Josh > > > Thanks a bunch, > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] noob warning - problems testing MPI_Comm_spawn
Thanks Shiqing. I'll checkout a trunk copy and try that. Damien On 04/03/2010 7:29 AM, Shiqing Fan wrote: Hi Damien, Sorry for late reply, I was trying to dig inside the code and got some information. First of all, in your example, it's not correct to define the MPI_Info as an pointer, it will cause the initialization violation at run time. The message "LOCAL DAEMON SPAWN IS CURRENTLY UNSUPPORTED" is just a warning which won't block the execution. In order to make the master-slave work, you have to disable the CCP support, it seems to have conflicts with comm_spawn operation, I'm still checking it. To disable CCP in Open MPI 1.4.1, you have to exclude the source files manually, i.e. excluding ccp files in orte/mca/plm/ccp and orte/mca/ras/ccp, then also remove the ccp related lines in orte/mca/plm/base/static-components.h and orte/mca/ras/base/static-components.h. There is an option to do so in the trunk version, but not for 1.4.1. Sorry for the inconvenience. For the "singleton" run with master.exe, it's still not working under Windows. Best Regards, Shiqing Damien Hocking wrote: Hi all, I'm playing around with MPI_Comm_spawn, trying to do something simple with a master-slave example. I get a LOCAL DAEMON SPAWN IS CURRENTLY UNSUPPORTED error when it tries to spawn the slave. This is on Windows, OpenMPI version 1.4.1, r22421. Here's the master code: int main(int argc, char* argv[]) { int myid, ierr; MPI_Comm maincomm; ierr = MPI_Init(&argc, &argv); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid); if (myid == 0) { std::cout << "\n Hello from the boss " << myid; std::cout.flush(); } MPI_Info* spawninfo; MPI_Info_create(spawninfo); MPI_Info_set(*spawninfo, "add-host", "127.0.0.1"); if (myid == 0) { std::cout << "\n About to MPI_Comm_spawn." << myid; std::cout.flush(); } MPI_Comm_spawn("slave.exe", MPI_ARGV_NULL, 1, *spawninfo, 0, MPI_COMM_SELF, &maincomm, MPI_ERRCODES_IGNORE); if (myid == 0) { std::cout << "\n MPI_Comm_spawn successful." << myid; std::cout.flush(); } ierr = MPI_Finalize(); return 0; } Here's the slave code: int main(int argc, char* argv[]) { int myid, ierr; MPI_Comm parent; ierr = MPI_Init(&argc, &argv); MPI_Comm_get_parent(&parent); if (parent == MPI_COMM_NULL) { std::cout << "\n No parent."; } ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid); std::cout << "\n Hello from a worker " << myid; std::cout.flush(); ierr = MPI_Finalize(); return 0; } Also, this only starts up correctly if I kick it off with orterun. Ideally I'd like to run it as "master.exe" and have it initialise the MPI environment from there. Can anyone tell me what setup I need to do that? Thanks in advance, Damien ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
A call to MPI_Init allows the MPI library to return any level of thread support it chooses. This MPI 1.1 call does not let the application say what it wants and does not let the implementation reply with what it can guarantee. If you are using only one MPI implementation and your code will never be run on another implementation you can check the user doc for that MPI implementation and see if says that you can do what you want to do. If your application is single threaded you can use MPI_Init and be portable. If your application has threads, you should use MPI_Init_thread and check the response to be portable or to make sure you get thread safety from an MPI implementation that has both a thread safe and a thread unsafe mode. A thread unsafe mode can skip some locking and maybe give better performance. The MPI standard wanted to allow for the possibility that some implementation of MPI would not be able to tolerate threads in the application at all so SINGLE was one answer that could be returned. For what you want to do, you need a library that can return FUNNELED or better. If you ask for FUNNELED and the MPI implementation you are using returns SINGLE it is telling you that using OpenMP threads is not allowed. It does not say why. It may work fine but the implementation is telling you not to do it and you cannot know if there are good reasons. Consider a made up example: Imagine some system supports Mutex lock/unlock but with terrible performance. As a work around, it offers a non-standard substitute for malloc called st_malloc (single thread malloc) that does not do locking. Normal malloc is also available and it does use locking. The documentation for this system says that it is safe to use both st_malloc and malloc in a single threaded application because the only difference is that st_malloc skips the locking. It also warns that if one thread calls st_malloc while another calls malloc, heap corruption is likely. Next imagine that the MPI implementation uses st_malloc to provide best performance and the OpenMP threads use malloc. When this particular MPI implementation returns SINGLE, it really means SINGLE. If there is only one application thread using regular malloc it is safe but if there is a malloc call on one thread while the main thread is in an MPI call, heap corruption is likely. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Yuanyuan ZHANG To: Open MPI Users Date: 03/03/2010 07:34 PM Subject:Re: [OMPI users] MPI_Init() and MPI_Init_thread() Sent by:users-boun...@open-mpi.org Hi guys, Thanks for your help, but unfortunately I am still not clear. > You are right Dave, FUNNELED allows the application to have multiple > threads but only the man thread calls MPI. My understanding is that even if I use SINGLE or MPI_Init, I can still have multiple threads if I use OpenMP PARALLEL directive, and only the main thread makes MPI calls. Am I correct? > An OpenMP/MPI hybrid program that makes MPI calls only in between parallel > sections is usually a FUNNELED user of MPI For an OpenMP/MPI hybrid program, if I only want to make MPI calls using the main thread, ie., only in between parallel sections, can I just use SINGLE or MPI_Init? What's the benefit of FUNNELED? Thanks. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Limit to number of processes on one node?
On Mar 4, 2010, at 7:51 AM, Prentice Bisbal wrote: > > > Ralph Castain wrote: >> On Mar 4, 2010, at 7:27 AM, Prentice Bisbal wrote: >> >>> >>> Ralph Castain wrote: On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote: > Eugene Loh wrote: >> Prentice Bisbal wrote: >>> Eugene Loh wrote: >>> Prentice Bisbal wrote: > Is there a limit on how many MPI processes can run on a single host? > >> Depending on which OMPI release you're using, I think you need something >> like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need >> 1000+ descriptors. You're quite possibly up against your limit, though >> I don't know for sure that that's the problem here. >> >> You say you're running 1.2.8. That's "a while ago", so would you >> consider updating as a first step? Among other things, newer OMPIs will >> generate a much clearer error message if the descriptor limit is the >> problem. > While 1.2.8 might be "a while ago", upgrading software just because it's > "old" is not a valid argument. > > I can install the lastest version of OpenMPI, but it will take a little > while. Maybe not because it is "old", but Eugene is correct. The old versions of OMPI required more file descriptors than the newer versions. That said, you'll still need a minimum of 4x the number of procs on the node even with the latest release. I suggest talking to your sys admin about getting the limit increased. It sounds like it has been set unrealistically low. >>> I *am* the system admin! ;) >>> >>> The file descriptor limit is the default for RHEL, 1024, so I would not >>> characterize it as "unrealistically low". I assume someone with much >>> more knowledge of OS design and administration than me came up with this >>> default, so I'm hesitant to change it without good reason. If there was >>> good reason, I'd have no problem changing it. I have read that setting >>> it to more than 8192 can lead to system instability. >> >> Never heard that, and most HPC systems have it set a great deal higher >> without trouble. > > I just read that the other day. Not sure where, though. Probably a forum > posting somewhere. I'll take your word for it that it's safe to increase > if necessary. >> >> However, the choice is yours. If you have a large SMP system, you'll >> eventually be forced to change it or severely limit its usefulness for MPI. >> RHEL sets it that low arbitrarily as a way of saving memory by keeping the >> fd table small, not because the OS can't handle it. >> >> Anyway, that is the problem. Nothing we (or any MPI) can do about it as the >> fd's are required for socket-based communications and to forward I/O. > > Thanks, Ralph, that's exactly the answer I was looking for - where this > limit was coming from. > > I can see how on a large SMP system the fd limit would have to be > increased. In normal circumstances, my cluster nodes should never have > more than 8 MPI processes running at once (per node), so I shouldn't be > hitting that limit on my cluster. Ah, okay! That helps a great deal in figuring out what to advise you. In your earlier note, it sounded like you were running all 512 procs on one node, so I assumed you had a large single-node SMP. In this case, though, the problem is solely that you are using the 1.2 series. In that series, mpirun and each process opened many more sockets to all processes in the job. That's why you are overrunning your limit. Starting with 1.3, the number of sockets being opened on each is only 3 times the number of procs on the node, plus a couple for the daemon. If you are using TCP for MPI communications, then each MPI connection will open another socket as these messages are direct and not routed. Upgrading to the 1.4 series should resolve the problem you saw. HTH Ralph > >> >> >>> This is admittedly unusual situation - in normal use, no one would ever >>> want to run that many processes on a single system - so I don't see any >>> justification for modifying that setting. >>> >>> Yesterday I spoke to the researcher who originally asked me this limit - >>> he just wanted to know what the limit was, and doesn't actually plan to >>> do any "real" work with that many processes on a single node, rendering >>> this whole discussion academic. >>> >>> I did install OpenMPI 1.4.1 yesterday, but I haven't had a chance to >>> test it yet. I'll post the results of testing here. >>> > I have a user trying to test his code on the command-line on a single > host before running it on our cluster like so: > > mpirun -np X foo > > When he tries to run it on large number of process (X = 256, 512), the > program fails, and I can reproduce this with a simple "Hello, World" > program: > > $ mpirun -np 256 mpihello
Re: [OMPI users] Limit to number of processes on one node?
Ralph Castain wrote: > On Mar 4, 2010, at 7:27 AM, Prentice Bisbal wrote: > >> >> Ralph Castain wrote: >>> On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote: >>> Eugene Loh wrote: > Prentice Bisbal wrote: >> Eugene Loh wrote: >> >>> Prentice Bisbal wrote: >>> Is there a limit on how many MPI processes can run on a single host? > Depending on which OMPI release you're using, I think you need something > like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need > 1000+ descriptors. You're quite possibly up against your limit, though > I don't know for sure that that's the problem here. > > You say you're running 1.2.8. That's "a while ago", so would you > consider updating as a first step? Among other things, newer OMPIs will > generate a much clearer error message if the descriptor limit is the > problem. While 1.2.8 might be "a while ago", upgrading software just because it's "old" is not a valid argument. I can install the lastest version of OpenMPI, but it will take a little while. >>> Maybe not because it is "old", but Eugene is correct. The old versions of >>> OMPI required more file descriptors than the newer versions. >>> >>> That said, you'll still need a minimum of 4x the number of procs on the >>> node even with the latest release. I suggest talking to your sys admin >>> about getting the limit increased. It sounds like it has been set >>> unrealistically low. >>> >>> >> I *am* the system admin! ;) >> >> The file descriptor limit is the default for RHEL, 1024, so I would not >> characterize it as "unrealistically low". I assume someone with much >> more knowledge of OS design and administration than me came up with this >> default, so I'm hesitant to change it without good reason. If there was >> good reason, I'd have no problem changing it. I have read that setting >> it to more than 8192 can lead to system instability. > > Never heard that, and most HPC systems have it set a great deal higher > without trouble. I just read that the other day. Not sure where, though. Probably a forum posting somewhere. I'll take your word for it that it's safe to increase if necessary. > > However, the choice is yours. If you have a large SMP system, you'll > eventually be forced to change it or severely limit its usefulness for MPI. > RHEL sets it that low arbitrarily as a way of saving memory by keeping the fd > table small, not because the OS can't handle it. > > Anyway, that is the problem. Nothing we (or any MPI) can do about it as the > fd's are required for socket-based communications and to forward I/O. Thanks, Ralph, that's exactly the answer I was looking for - where this limit was coming from. I can see how on a large SMP system the fd limit would have to be increased. In normal circumstances, my cluster nodes should never have more than 8 MPI processes running at once (per node), so I shouldn't be hitting that limit on my cluster. > > >> This is admittedly unusual situation - in normal use, no one would ever >> want to run that many processes on a single system - so I don't see any >> justification for modifying that setting. >> >> Yesterday I spoke to the researcher who originally asked me this limit - >> he just wanted to know what the limit was, and doesn't actually plan to >> do any "real" work with that many processes on a single node, rendering >> this whole discussion academic. >> >> I did install OpenMPI 1.4.1 yesterday, but I haven't had a chance to >> test it yet. I'll post the results of testing here. >> I have a user trying to test his code on the command-line on a single host before running it on our cluster like so: mpirun -np X foo When he tries to run it on large number of process (X = 256, 512), the program fails, and I can reproduce this with a simple "Hello, World" program: $ mpirun -np 256 mpihello mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu exited on signal 15 (Terminated). 252 additional processes aborted (not shown) I've done some testing and found that X <155 for this program to work. Is this a bug, part of the standard, or design/implementation decision? >>> One possible issue is the limit on the number of descriptors. The error >>> message should be pretty helpful and descriptive, but perhaps you're >>> using an older version of OMPI. If this is your problem, one workaround >>> is something like this: >>> >>> unlimit descriptors >>> mpirun -np 256 mpihello >>> >> Looks like I'm not allowed to set that as a regular user: >> >> $ ulimit -n 2048 >> -bash: ulimit: open files: cannot modify limit: Operation not permitted >> >> Since I am the admin, I could change that elsewhere, but I'd rathe
Re: [OMPI users] Limit to number of processes on one node?
On Mar 4, 2010, at 7:27 AM, Prentice Bisbal wrote: > > > Ralph Castain wrote: >> On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote: >> >>> Eugene Loh wrote: Prentice Bisbal wrote: > Eugene Loh wrote: > >> Prentice Bisbal wrote: >> >>> Is there a limit on how many MPI processes can run on a single host? >>> Depending on which OMPI release you're using, I think you need something like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need 1000+ descriptors. You're quite possibly up against your limit, though I don't know for sure that that's the problem here. You say you're running 1.2.8. That's "a while ago", so would you consider updating as a first step? Among other things, newer OMPIs will generate a much clearer error message if the descriptor limit is the problem. >>> While 1.2.8 might be "a while ago", upgrading software just because it's >>> "old" is not a valid argument. >>> >>> I can install the lastest version of OpenMPI, but it will take a little >>> while. >> >> Maybe not because it is "old", but Eugene is correct. The old versions of >> OMPI required more file descriptors than the newer versions. >> >> That said, you'll still need a minimum of 4x the number of procs on the node >> even with the latest release. I suggest talking to your sys admin about >> getting the limit increased. It sounds like it has been set unrealistically >> low. >> >> > I *am* the system admin! ;) > > The file descriptor limit is the default for RHEL, 1024, so I would not > characterize it as "unrealistically low". I assume someone with much > more knowledge of OS design and administration than me came up with this > default, so I'm hesitant to change it without good reason. If there was > good reason, I'd have no problem changing it. I have read that setting > it to more than 8192 can lead to system instability. Never heard that, and most HPC systems have it set a great deal higher without trouble. However, the choice is yours. If you have a large SMP system, you'll eventually be forced to change it or severely limit its usefulness for MPI. RHEL sets it that low arbitrarily as a way of saving memory by keeping the fd table small, not because the OS can't handle it. Anyway, that is the problem. Nothing we (or any MPI) can do about it as the fd's are required for socket-based communications and to forward I/O. > > This is admittedly unusual situation - in normal use, no one would ever > want to run that many processes on a single system - so I don't see any > justification for modifying that setting. > > Yesterday I spoke to the researcher who originally asked me this limit - > he just wanted to know what the limit was, and doesn't actually plan to > do any "real" work with that many processes on a single node, rendering > this whole discussion academic. > > I did install OpenMPI 1.4.1 yesterday, but I haven't had a chance to > test it yet. I'll post the results of testing here. > >>> >>> I have a user trying to test his code on the command-line on a single >>> host before running it on our cluster like so: >>> >>> mpirun -np X foo >>> >>> When he tries to run it on large number of process (X = 256, 512), the >>> program fails, and I can reproduce this with a simple "Hello, World" >>> program: >>> >>> $ mpirun -np 256 mpihello >>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu >>> exited on signal 15 (Terminated). >>> 252 additional processes aborted (not shown) >>> >>> I've done some testing and found that X <155 for this program to work. >>> Is this a bug, part of the standard, or design/implementation decision? >>> >>> >>> >> One possible issue is the limit on the number of descriptors. The error >> message should be pretty helpful and descriptive, but perhaps you're >> using an older version of OMPI. If this is your problem, one workaround >> is something like this: >> >> unlimit descriptors >> mpirun -np 256 mpihello >> > Looks like I'm not allowed to set that as a regular user: > > $ ulimit -n 2048 > -bash: ulimit: open files: cannot modify limit: Operation not permitted > > Since I am the admin, I could change that elsewhere, but I'd rather not > do that system-wide unless absolutely necessary. > >> though I guess the syntax depends on what shell you're running. Another >> is to set the MCA parameter opal_set_max_sys_limits to 1. >> > That didn't work either: > > $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello > mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu > exited on signal 15 (Terminated). > 252 additional processes aborted (not shown) > > > -- > Prentice Bisbal > Linux Software Support Specialist/System Administrator > School of Natural Sciences
Re: [OMPI users] MPI_Init() and MPI_Init_thread()
On Thursday 04 March 2010 01:32:39 Yuanyuan ZHANG wrote: > Hi guys, > > Thanks for your help, but unfortunately I am still not clear. > > > You are right Dave, FUNNELED allows the application to have multiple > > threads but only the man thread calls MPI. > > My understanding is that even if I use SINGLE or MPI_Init, I can still > have multiple threads if I use OpenMP PARALLEL directive, and only > the main thread makes MPI calls. Am I correct? Actually if your application asks for MPI_THREAD_SINGLE, it specifies that it won't use any thread, so you shouldn't use OpenMP threads. If you asked SINGLE and use threads (even if only the main thread calls MPI) the behavior is unspecified (we can imagine that some MPI implementation cannot support any thread within the process for some reason). > > > An OpenMP/MPI hybrid program that makes MPI calls only in between > > parallel sections is usually a FUNNELED user of MPI > > For an OpenMP/MPI hybrid program, if I only want to make MPI calls using > the main thread, ie., only in between parallel sections, can I just use > SINGLE or MPI_Init? What's the benefit of FUNNELED? Asking for the FUNNELED thread-safety level informs the MPI library that your application is going to run threads. If you ask for SINGLE, then you say "I promise, I won't use any thread". The difference may be unclear from an application developer point of view, but it is important from the MPI library point of view. However, I guess most modern MPI libraries support FUNNELED by default and the performance should be the same for FUNNELED and SINGLE Francois Trahay
Re: [OMPI users] noob warning - problems testing MPI_Comm_spawn
Hi Damien, Sorry for late reply, I was trying to dig inside the code and got some information. First of all, in your example, it's not correct to define the MPI_Info as an pointer, it will cause the initialization violation at run time. The message "LOCAL DAEMON SPAWN IS CURRENTLY UNSUPPORTED" is just a warning which won't block the execution. In order to make the master-slave work, you have to disable the CCP support, it seems to have conflicts with comm_spawn operation, I'm still checking it. To disable CCP in Open MPI 1.4.1, you have to exclude the source files manually, i.e. excluding ccp files in orte/mca/plm/ccp and orte/mca/ras/ccp, then also remove the ccp related lines in orte/mca/plm/base/static-components.h and orte/mca/ras/base/static-components.h. There is an option to do so in the trunk version, but not for 1.4.1. Sorry for the inconvenience. For the "singleton" run with master.exe, it's still not working under Windows. Best Regards, Shiqing Damien Hocking wrote: Hi all, I'm playing around with MPI_Comm_spawn, trying to do something simple with a master-slave example. I get a LOCAL DAEMON SPAWN IS CURRENTLY UNSUPPORTED error when it tries to spawn the slave. This is on Windows, OpenMPI version 1.4.1, r22421. Here's the master code: int main(int argc, char* argv[]) { int myid, ierr; MPI_Comm maincomm; ierr = MPI_Init(&argc, &argv); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid); if (myid == 0) { std::cout << "\n Hello from the boss " << myid; std::cout.flush(); } MPI_Info* spawninfo; MPI_Info_create(spawninfo); MPI_Info_set(*spawninfo, "add-host", "127.0.0.1"); if (myid == 0) { std::cout << "\n About to MPI_Comm_spawn." << myid; std::cout.flush(); } MPI_Comm_spawn("slave.exe", MPI_ARGV_NULL, 1, *spawninfo, 0, MPI_COMM_SELF, &maincomm, MPI_ERRCODES_IGNORE); if (myid == 0) { std::cout << "\n MPI_Comm_spawn successful." << myid; std::cout.flush(); } ierr = MPI_Finalize(); return 0; } Here's the slave code: int main(int argc, char* argv[]) { int myid, ierr; MPI_Comm parent; ierr = MPI_Init(&argc, &argv); MPI_Comm_get_parent(&parent); if (parent == MPI_COMM_NULL) { std::cout << "\n No parent."; } ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid); std::cout << "\n Hello from a worker " << myid; std::cout.flush(); ierr = MPI_Finalize(); return 0; } Also, this only starts up correctly if I kick it off with orterun. Ideally I'd like to run it as "master.exe" and have it initialise the MPI environment from there. Can anyone tell me what setup I need to do that? Thanks in advance, Damien ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- -- Shiqing Fan http://www.hlrs.de/people/fan High Performance Computing Tel.: +49 711 685 87234 Center Stuttgart (HLRS)Fax.: +49 711 685 65832 Address:Allmandring 30 email: f...@hlrs.de 70569 Stuttgart
Re: [OMPI users] Limit to number of processes on one node?
Ralph Castain wrote: > On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote: > >> Eugene Loh wrote: >>> Prentice Bisbal wrote: Eugene Loh wrote: > Prentice Bisbal wrote: > >> Is there a limit on how many MPI processes can run on a single host? >> >>> Depending on which OMPI release you're using, I think you need something >>> like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need >>> 1000+ descriptors. You're quite possibly up against your limit, though >>> I don't know for sure that that's the problem here. >>> >>> You say you're running 1.2.8. That's "a while ago", so would you >>> consider updating as a first step? Among other things, newer OMPIs will >>> generate a much clearer error message if the descriptor limit is the >>> problem. >> While 1.2.8 might be "a while ago", upgrading software just because it's >> "old" is not a valid argument. >> >> I can install the lastest version of OpenMPI, but it will take a little >> while. > > Maybe not because it is "old", but Eugene is correct. The old versions of > OMPI required more file descriptors than the newer versions. > > That said, you'll still need a minimum of 4x the number of procs on the node > even with the latest release. I suggest talking to your sys admin about > getting the limit increased. It sounds like it has been set unrealistically > low. > > I *am* the system admin! ;) The file descriptor limit is the default for RHEL, 1024, so I would not characterize it as "unrealistically low". I assume someone with much more knowledge of OS design and administration than me came up with this default, so I'm hesitant to change it without good reason. If there was good reason, I'd have no problem changing it. I have read that setting it to more than 8192 can lead to system instability. This is admittedly unusual situation - in normal use, no one would ever want to run that many processes on a single system - so I don't see any justification for modifying that setting. Yesterday I spoke to the researcher who originally asked me this limit - he just wanted to know what the limit was, and doesn't actually plan to do any "real" work with that many processes on a single node, rendering this whole discussion academic. I did install OpenMPI 1.4.1 yesterday, but I haven't had a chance to test it yet. I'll post the results of testing here. >> >> I have a user trying to test his code on the command-line on a single >> host before running it on our cluster like so: >> >> mpirun -np X foo >> >> When he tries to run it on large number of process (X = 256, 512), the >> program fails, and I can reproduce this with a simple "Hello, World" >> program: >> >> $ mpirun -np 256 mpihello >> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu >> exited on signal 15 (Terminated). >> 252 additional processes aborted (not shown) >> >> I've done some testing and found that X <155 for this program to work. >> Is this a bug, part of the standard, or design/implementation decision? >> >> >> > One possible issue is the limit on the number of descriptors. The error > message should be pretty helpful and descriptive, but perhaps you're > using an older version of OMPI. If this is your problem, one workaround > is something like this: > > unlimit descriptors > mpirun -np 256 mpihello > Looks like I'm not allowed to set that as a regular user: $ ulimit -n 2048 -bash: ulimit: open files: cannot modify limit: Operation not permitted Since I am the admin, I could change that elsewhere, but I'd rather not do that system-wide unless absolutely necessary. > though I guess the syntax depends on what shell you're running. Another > is to set the MCA parameter opal_set_max_sys_limits to 1. > That didn't work either: $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu exited on signal 15 (Terminated). 252 additional processes aborted (not shown) -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ
Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)
Hi, I have some new discovery about this problem : It seems that the array size sendable from a 32bit to 64bit machines is proportional to the parameter "btl_tcp_eager_limit" When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an array up to 2e07 double (152MB). I didn't found much informations about btl_tcp_eager_limit other than in the "ompi_info --all" command. If I let it at 2e08, will it impacts the performance of OpenMPI ? It may be noteworth also that if the master (rank 0) is a 32bit machines, I don't have segfault. I can send big array with small "btl_tcp_eager_limit" from a 64bit machine to a 32bit one. Do I have to move this thread to devel mailing list ? Regards, TMHieu On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu wrote: > Hello, > > Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I > compiled with : > $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous > --enable-cxx-exceptions --enable-shared > --enable-orterun-prefix-by-default > $ make all install > > I attach the output of ompi_info of my 2 machines. > > TMHieu > > On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres wrote: >> Did you configure Open MPI with --enable-heterogeneous? >> >> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote: >> >>> Hello, >>> >>> I have some problems running MPI on my heterogeneous cluster. More >>> precisley i got segmentation fault when sending a large array (about >>> 1) of double from a i686 machine to a x86_64 machine. It does not >>> happen with small array. Here is the send/recv code source (complete >>> source is in attached file) : >>> code >>> if (me == 0 ) { >>> for (int pe=1; pe>> { >>> printf("Receiving from proc %d : ",pe); fflush(stdout); >>> d=(double *)malloc(sizeof(double)*n); >>> MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status); >>> printf("OK\n"); fflush(stdout); >>> } >>> printf("All done.\n"); >>> } >>> else { >>> d=(double *)malloc(sizeof(double)*n); >>> MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD); >>> } >>> code >>> >>> I got segmentation fault with n=1 but no error with n=1000 >>> I have 2 machines : >>> sbtn155 : Intel Xeon, x86_64 >>> sbtn211 : Intel Pentium 4, i686 >>> >>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1, >>> installed in /tmp/openmpi : >>> [mhtrinh@sbtn211 heterogenous]$ make hetero >>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o >>> hetero.i686.o >>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include >>> hetero.i686.o -o hetero.i686 -lm >>> >>> [mhtrinh@sbtn155 heterogenous]$ make hetero >>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o >>> hetero.x86_64.o >>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include >>> hetero.x86_64.o -o hetero.x86_64 -lm >>> >>> I run with the code using appfile and got thoses error : >>> $ cat appfile >>> --host sbtn155 -np 1 hetero.x86_64 >>> --host sbtn155 -np 1 hetero.x86_64 >>> --host sbtn211 -np 1 hetero.i686 >>> >>> $ mpirun -hetero --app appfile >>> Input array length : >>> 1 >>> Receiving from proc 1 : OK >>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal *** >>> [sbtn155:26386] Signal: Segmentation fault (11) >>> [sbtn155:26386] Signal code: Address not mapped (1) >>> [sbtn155:26386] Failing at address: 0x200627bd8 >>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540] >>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so >>> [0x2d8d7908] >>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so >>> [0x2e2fc6e3] >>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2afe39db] >>> [sbtn155:26386] [ 4] >>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2afd8b9e] >>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so >>> [0x2d8d4b25] >>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b) >>> [0x2ab30f9b] >>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe] >>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074] >>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29] >>> [sbtn155:26386] *** End of error message *** >>> -- >>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155 >>> exited on signal 11 (Segmentation fault). >>> -- >>> >>> Am I missing an option in order to run in heterogenous cluster ? >>> MPI_Send/Recv have limit array size when using heterogeneous cluster ? >>> Thanks for your help. Regards >>> >>> -- >>> >>> M. TRINH Minh Hieu >>> CEA, IBEB, SBTN/LIRM, >>> F-30207 Bagnols-sur-Cèze, FRANCE >>> ==
Re: [OMPI users] checkpointing multi node and multi process applications
On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos wrote: > Is there anything I can do to provide more information about this bug? > E.g. try to compile the code in the SVN trunk? I also have kept the > snapshots intact, I can tar them up and upload them somewhere in case > you guys need it. I can also provide the source code to the ring > program, but it's really the canonical ring MPI example. > I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags). This time taking the checkpoint didn't generate any error message: root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np 2 --host debian1,debian2 ring >>> Process 1 sending 2761 to 0 >>> Process 1 received 2760 >>> Process 1 sending 2760 to 0 root@debian1:~# But restoring it did: root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt [debian1:23129] Error: Unable to access the path [/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]! -- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- -- mpirun has exited due to process rank 1 with PID 23129 on node debian1 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- root@debian1:~# Indeed, opal_snapshot_1.ckpt does not exist exist: root@debian1:~# find ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data ompi_global_snapshot_23071.ckpt/restart-appfile ompi_global_snapshot_23071.ckpt/0 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data root@debian1:~# It can be found in debian2: root@debian2:~# find ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/0 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501 root@debian2:~# Then I tried supplying a hostfile for ompi-run and it worked just fine! I thought the checkpoint included the hosts information? So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN? Thanks a bunch,