[OMPI users] Unable to run WRF on large core counts (1024+), queue pair error
I am trying to run WRF on 1024 cores with OpenMPI 1.3.3 and 1.4. I can get the code to run with 512 cores, but it crashes at startup on 1024 cores. I am getting the following error message: [n172][[43536,1],0][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory [n172][[43536,1],0][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect From google, I have tried to change the settings for btl_openib_receive_queues, but my tries have not worked. Here was my latest try to reduce the total queue pairs. mpirun -np 1024 \ -mca btl_openib_receive_queues P,128,2048,128,128:S,65536,256,192,128 \ `wrf.exe These settings did not help. Am I looking in the right place? System setup: Centos-5.3 Ofed-1.4.1 Intel Compiler 11.1.038 Openmpi-1.3.3 and 1.4 Build options: ./configure CC=icc CXX=icpc F77=ifort F90=ifort FC=ifort --prefix=/opt/openmpi/1.3.3-intel --without-sge --with-openib --enable-io-romio --with-io-romio-flags=--with-file-system=lustre --with-pic Thanks, Craig
Re: [OMPI users] NetBSD OpenMPI - SGE - PETSc - PISM
On Dec 17, 2009, at 5:55 PM, wrote: > I am happy to be able to inform you that the problems we were > seeing would seem to have been arising down at the OpenMPI > level. Happy for *them*, at least. ;-) > If I remove any acknowledgement of IPv6 within the OpenMPI > code, then both the PETSc examples and PISM application > have been seen to be running upon my initial 8-processor > parallel environment when submitted as an Sun Grid Engine > job. Ok, that's good. > I guess this means that the PISM and PETSc guys can "stand easy" > whilst the OpenMPI community needs to follow up on why there's > a "addr.sa_len=0" creeping through the interface inspection > code (upon NetBSD at least) when it passes thru the various > IPv6 stanzas. Ok. We're still somewhat at a loss here, because we don't have any NetBSD to test on. :-( We're happy to provide any help that we can, and just like you, we'd love to see this problem resolved -- but NetBSD still isn't on any of our core competency lists. :-( FWIW, we might want to move this discussion to the de...@open-mpi.org mailing list... -- Jeff Squyres jsquy...@cisco.com
[OMPI users] NetBSD OpenMPI - SGE - PETSc - PISM
A whole swathe of people have been made aware of the issues that have arisen as a result of a researcher here looking to run PISM, which sits on top of PETSc, which sits on top of OpenMPI. I am happy to be able to inform you that the problems we were seeing would seem to have been arising down at the OpenMPI level. If I remove any acknowledgement of IPv6 within the OpenMPI code, then both the PETSc examples and PISM application have been seen to be running upon my initial 8-processor parallel environment when submitted as an Sun Grid Engine job. I guess this means that the PISM and PETSc guys can "stand easy" whilst the OpenMPI community needs to follow up on why there's a "addr.sa_len=0" creeping through the interface inspection code (upon NetBSD at least) when it passes thru the various IPv6 stanzas. Thanks for all the feedback on this from all the quarters, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand
Re: [OMPI users] Notifier Framework howto
That would be great! On Dec 17, 2009, at 3:52 PM, Ralph Castain wrote: > If it would help, I have time and am willing to add notifier calls to this > area of the code base. You'll still get the errors shown here as I always > bury the notifier call behind the error check that surrounds these error > messages to avoid impacting the critical path, but you would be able to gt > syslog messages (or whatever channel you choose) as well. > > Ralph > > On Dec 10, 2009, at 4:19 PM, Jeff Squyres wrote: > > > On Dec 10, 2009, at 5:06 PM, Brock Palen wrote: > > > >> I would like to try out the notifier framework, problem is I am having > >> trouble finding documentation for it, I am digging around the website > >> and not finding much. > >> > >> Currently we have a problem where hosts are throwing up errors like: > >> [nyx0891.engin.umich.edu][[25560,1],45][btl_tcp_endpoint.c: > >> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed: > >> Connection timed out (110) > > > > Yoinks. Any idea why this is happening? > > > >> We would like when this happens to notify us, so we can put time > >> stamps on events going on on the network. Is this even possible with > >> the frame work? See we don't show any interfaces coming up and down, > >> or any errors on interfaces, so we are looking to isolate the problem > >> more. Only the MPI library knows when this happens. > > > > It's not well documented. So let's start here... > > > > The first issue is that we currently only have notifier calls down in the > > openib BTL -- not any of the others. :-( We put it there because there > > was specific requests to be notified when IB links went down. We then used > > those as a request for comment from the community, asking "do you like > > this? do you want more?" We kinda got nothing back, and I'll admit that we > > kinda forgot about it -- and therefore never added notifier calls elsewhere > > in the code. :-\ > > > > We designed the notifier in OMPI to be trivially easy to use throughout the > > code base -- it's just adding a single function call where the error > > occurs. Would you, perchance, be interested in adding any of these in the > > TCP BTL? I'd be happy to point you in the right direction... :-) > > > > After that, it's just a matter of enabling a notifier: > > > > mpirun --mca notifier syslog ... > > > > Each notifier has some MCA params that are fairly obvious -- use: > > > > ompi_info --param notifier all > > > > to see them. There's 3 notifier plugins: > > > > - command: execute any arbitrary command. It must run in finite (short) > > time. You use MCA params to set the command (we can pass some strings down > > to the command; see the ompi_info help string for more details), and set a > > timeout such that if the command runs for that many seconds without > > exiting, we'll kill it. > > > > - syslog: because it was simple to do -- we just output a string to the > > syslog. > > > > - twitter: because it was fun to do. ;-) Actually, the rationale was that > > you can tweet to a private feed and then slave an RSS reader to it to see > > if anything happens. It will need to be able to reach the general internet > > (i.e., twitter.com); proxies are not supported. Set your twitter > > username/password via MCA params. > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Notifier Framework howto
If it would help, I have time and am willing to add notifier calls to this area of the code base. You'll still get the errors shown here as I always bury the notifier call behind the error check that surrounds these error messages to avoid impacting the critical path, but you would be able to gt syslog messages (or whatever channel you choose) as well. Ralph On Dec 10, 2009, at 4:19 PM, Jeff Squyres wrote: > On Dec 10, 2009, at 5:06 PM, Brock Palen wrote: > >> I would like to try out the notifier framework, problem is I am having >> trouble finding documentation for it, I am digging around the website and >> not finding much. >> >> Currently we have a problem where hosts are throwing up errors like: >> [nyx0891.engin.umich.edu][[25560,1],45][btl_tcp_endpoint.c: >> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed: >> Connection timed out (110) > > Yoinks. Any idea why this is happening? > >> We would like when this happens to notify us, so we can put time >> stamps on events going on on the network. Is this even possible with >> the frame work? See we don't show any interfaces coming up and down, >> or any errors on interfaces, so we are looking to isolate the problem >> more. Only the MPI library knows when this happens. > > It's not well documented. So let's start here... > > The first issue is that we currently only have notifier calls down in the > openib BTL -- not any of the others. :-( We put it there because there was > specific requests to be notified when IB links went down. We then used those > as a request for comment from the community, asking "do you like this? do you > want more?" We kinda got nothing back, and I'll admit that we kinda forgot > about it -- and therefore never added notifier calls elsewhere in the code. > :-\ > > We designed the notifier in OMPI to be trivially easy to use throughout the > code base -- it's just adding a single function call where the error occurs. > Would you, perchance, be interested in adding any of these in the TCP BTL? > I'd be happy to point you in the right direction... :-) > > After that, it's just a matter of enabling a notifier: > > mpirun --mca notifier syslog ... > > Each notifier has some MCA params that are fairly obvious -- use: > > ompi_info --param notifier all > > to see them. There's 3 notifier plugins: > > - command: execute any arbitrary command. It must run in finite (short) > time. You use MCA params to set the command (we can pass some strings down > to the command; see the ompi_info help string for more details), and set a > timeout such that if the command runs for that many seconds without exiting, > we'll kill it. > > - syslog: because it was simple to do -- we just output a string to the > syslog. > > - twitter: because it was fun to do. ;-) Actually, the rationale was that > you can tweet to a private feed and then slave an RSS reader to it to see if > anything happens. It will need to be able to reach the general internet > (i.e., twitter.com); proxies are not supported. Set your twitter > username/password via MCA params. > > -- > Jeff Squyres > jsquy...@cisco.com > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] error performing MPI_Comm_spawn
Will be in the 1.4 nightly tarball generated later tonight... Thanks again Ralph On Dec 17, 2009, at 4:07 AM, Marcia Cristina Cera wrote: > very good news > I will wait carefully for the release :) > > Thanks, Ralph > márcia. > > On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain wrote: > Ah crumb - I found the problem. Sigh. > > I actually fixed this in the trunk over 5 months ago when the problem first > surfaced in my own testing, but it never came across to the stable release > branch. The problem is that we weren't serializing the comm_spawn requests, > and so the launch system gets confused over what has and hasn't completed > launch. That's why it works fine on the trunk. > > I'm creating the 1.4 patch right now. Thanks for catching this. Old brain > completely forgot until I started tracking the commit history and found my > own footprints! > > Ralph > > On Dec 16, 2009, at 5:43 AM, Marcia Cristina Cera wrote: > >> Hi Ralph, >> >> I am afraid I have been a little hasty! >> I remake my tests with more care and I got the same error also with the >> 1.3.3 :-/ >> but in such version the error happens after some successful executions... >> because of that I did not realize before! >> Furthermore, I increased the number of levels of the tree (that means have >> more concurrently dynamic process creations in the lower levels) and I never >> arrive to execute without error, unless I add the delay. >> Perhaps the problem might even be a race condition :( >> >> I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work >> with LAM for years, but I migrate o OpenMP last year once LAM will be >> discontinued... >> >> I think that I can continue the development of my application adding the >> delay, while I wait for a release... and I leave the performance tests to be >> made in the future :) >> >> Thank you again Ralph, >> márcia. >> >> >> On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain wrote: >> Okay, I can replicate this. >> >> FWIW: your test program works fine with the OMPI trunk and 1.3.3. It only >> has a problem with 1.4. Since I can replicate it on multiple machines every >> single time, I don't think it is actually a race condition. >> >> I think someone made a change to the 1.4 branch that created a failure mode >> :-/ >> >> Will have to get back to you on this - may take awhile, and won't be in the >> 1.4.1 release. >> >> Thanks for the replicator! >> >> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote: >> >>> Thank you, Ralph >>> >>> I will use the 1.3.3 for now... >>> while waiting for a future fix release that break this race condiction. >>> >>> márcia >>> >>> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain wrote: >>> Looks to me like it is a race condition, and the timing between 1.3.3 and >>> 1.4 is just enough to trip it. I can break the race, but it will have to be >>> in a future fix release. >>> >>> Meantime, your best bet is to either stick with 1.3.3 or add the delay. >>> >>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote: >>> Hi, I intend to develop an application using the MPI_Comm_spawn to create dynamically new MPI tasks (or processes). The structure of the program is like a tree: each node creates 2 new ones until reaches a predefined number of levels. I developed a small program to explain my problem as can be seen in attachment. -- start.c: launches (through MPI_Comm_spawn, in which the argv has the level value) the root of the tree (a ch_rec program). Afterward spawn, a message is sent to child and the process block in an MPI_Recv. -- ch_rec.c: gets its level value and receives the parent message, then if its level is less than a predefined limit, it will creates 2 children: - set the level value; - spawn 1 child; - send a message; - call an MPI_Irecv; - repeat the 4 previous steps for the second child; - call an MPI_Waitany waiting for children returns. When children messages are received, the process send a message to its parent and call MPI_Finalize. Using the openmpi-1.3.3 version the program runs as expected but with openmpi-1.4 I get the following error: $ mpirun -np 1 start level 0 level = 1 Parent sent: level 0 (pid:4279) level = 2 Parent sent: level 1 (pid:4281) [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758 The error happens when my program try to launch the second child immediately after the first spawn call. In my tests I try to put an sleep of 2 second between the first and the second spawn, and then the program runs as expected. Some one can help me with this version 1.4 bug? thanks, márcia. >
Re: [OMPI users] MPI_Comm_spawn lots of times
Ok, I'll give it a try. Thanks, nick On Thu, Dec 17, 2009 at 12:44, Ralph Castain wrote: > In case you missed it, this patch should be in the 1.4 nightly tarballs - > feel free to test and let me know what you find. > > Thanks > Ralph > > On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote: > > That was quick. I will try the patch as soon as you release it. > > nick > > > On Wed, Dec 2, 2009 at 21:06, Ralph Castain wrote: > >> Patch is built and under review... >> >> Thanks again >> Ralph >> >> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote: >> >> Thanks >> >> On Wed, Dec 2, 2009 at 17:04, Ralph Castain wrote: >> >>> Yeah, that's the one all right! Definitely missing from 1.3.x. >>> >>> Thanks - I'll build a patch for the next bug-fix release >>> >>> >>> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote: >>> >>> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain >>> wrote: >>> >> Indeed - that is very helpful! Thanks! >>> >> Looks like we aren't cleaning up high enough - missing the directory >>> level. >>> >> I seem to recall seeing that error go by and that someone fixed it on >>> our >>> >> devel trunk, so this is likely a repair that didn't get moved over to >>> the >>> >> release branch as it should have done. >>> >> I'll look into it and report back. >>> > >>> > You are probably referring to >>> > https://svn.open-mpi.org/trac/ompi/changeset/21498 >>> > >>> > There was an issue about orte_session_dir_finalize() not >>> > cleaning up the session directories properly. >>> > >>> > Hope that helps. >>> > >>> > Abhishek >>> > >>> >> Thanks again >>> >> Ralph >>> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote: >>> >> >>> >> >>> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain wrote: >>> >>> >>> >>> Hmmif you are willing to keep trying, could you perhaps let it >>> run for >>> >>> a brief time, ctrl-z it, and then do an ls on a directory from a >>> process >>> >>> that has already terminated? The pids will be in order, so just look >>> for an >>> >>> early number (not mpirun or the parent, of course). >>> >>> It would help if you could give us the contents of a directory from a >>> >>> child process that has terminated - would tell us what subsystem is >>> failing >>> >>> to properly cleanup. >>> >> >>> >> Ok, so I Ctrl-Z the master. In >>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one >>> >> directory >>> >> >>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 >>> >> >>> >> I can't find that PID though. mpirun has PID 4230, orted does not >>> exist, >>> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it >>> again, >>> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, >>> there >>> >> are 70 sequentially numbered directories starting at 0. Every >>> directory >>> >> contains another directory called "0". There is nothing in any of >>> those >>> >> directories. I see for instance: >>> >> >>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70 >>> >> total 4.0K >>> >> drwx-- 2 nbock users 4.0K Dec 2 14:41 0 >>> >> >>> >> and >>> >> >>> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls >>> -lh >>> >> 70/0/ >>> >> total 0 >>> >> >>> >> I hope this information helps. Did I understand your question >>> correctly? >>> >> >>> >> nick >>> >> >>> >> ___ >>> >> users mailing list >>> >> us...@open-mpi.org >>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >>> >> ___ >>> >> users mailing list >>> >> us...@open-mpi.org >>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >>> > >>> > ___ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] [visit-developers] /usr/bin/ld: cannot find -lrdmacm on 9184
Simon Su writes: > Hi Tom, > > "hello world" MPI program also won't compile if > librdmacm-devel-1.0.8-5.el5 is not installed. I have asked the person > who maintain the openmpi package on how they were compiled. My guess > is librdmacm-devel-1.0.8-5.el5 may need to be added as dependency > package for openmpi010208-gcc-devel-1.2.8-8.cses.5.PU_IAS.5 package > (where I got my openmpi installation) to solve the problem and to > verify that we have the correct openmpi compilation. Yes, I agree, from this message && your last (to OMPI: Simon mentioned that the issue goes away if he installs librdmacm-devel), it sounds like librdmacm-devel is a "build-dep", but the openmpi-devel package needs it as a "dep". Anyway, I'll leave it up to you to forward the error/conclusion to whomever your upstream is. Thanks for digging into this, -tom > On Wed, Dec 16, 2009 at 5:45 PM, tom fogal wrote: > > > Hi, > > > > Jeff sent this reply to our inquiry yesterday. > > > > Simon -- can you give it a read? In particular, validating you've got > > the right mpic++ sounds like a good idea. We're also curious if a > > simple "hello world" MPI program can link using gcc + the flags from > > mpic++ -show. > > > > -tom > > > > --- Forwarded Message > > > > From: Jeff Squyres > > In-Reply-To: > > Date: Wed, 16 Dec 2009 17:20:35 -0500 > > To: "Open MPI Users" > > Cc: VisIt Developers > > Subject: Re: [OMPI users] [visit-developers] /usr/bin/ld: cannot > >find-lrdmacm on 9184 > > > > It depends on how Open MPI was built. > > > > If Open MPI was built without plugins (i.e., all the plugins are slurped up > > into libmpi and friends), then yes, applications need to link against > > librdmacm to use the RDMA CM mode of OpenFabrics transport. > > > > If Open MPI was built with plugins (which is the default), then apps don't > > need to link against librdmacm because the only use of rdmacm is in an Open > > MPI plugin, and that plugin was linked against librdmacm. > > > > Make sense? > > > > That being said, the output from mpic++ --showme should give you something > > that is directly compile-/link-able. So it is odd if mpic++ is showing you > > something that can't (or shouldn't?) be done. Did a -L argument get lost > > somewhere, perchance? > > > > Does linking MPI applications with mpic++ work properly, or does it result > > in the same error? If it results in the same error, then perhaps something > > has changed since Open MPI was installed...? > > > > All this being said, two other random points: > > > > 1. Ensure that you're using the "right" mpic++. I.e., make sure it matches > > the version/installation of Open MPI that you're trying to use. > > > > 2. If you don't link with the librdmacm, you're probably not losing any > > important functionality unless you have an iWarp-based cluster (that's the > > only transport that *needs* librdmacm). IB-based networks can use > > librdmacm, but don't *need* it (it's only used for making initial > > connections, so using librdmacm or not has no implications on overall MPI > > performance). It's still odd that mpic++ wants it and it can't be found, > > though... > > > > Does that helps? > > > > > > On Dec 15, 2009, at 11:11 PM, tom fogal wrote: > > > > > Simon Su writes: > > > > Hi Tom, > > > > > > > > I am using the standard openmpi package that run on all the cluster > > > > machines here at Princeton. So, maybe I shouldn't touch openmpi. But, > > > > removing -lrdmacm from the MPI_LIBS line in the machinename.conf file > > > > worked. Any implication from doing this? > > > > > > The only thing it could possibly do is disable RDMA for you. However, > > > since removing it did not produce any undefined symbol errors, my guess > > > is that your OpenMPI isn't using RDMA anyway. > > > > > > There might be an OpenMPI bug here, though. I've cc'd the OpenMPI > > > community to see if they have any input. As a summary for them: Simon > > > is trying to build our MPI-enabled application. A script which tries > > > to automate this adds the output of "mpic++ -show". His build then > > > failed because it attempted to link against librdmacm, which does not > > > exist in his normal search paths (or maybe at all). Is it possible > > > that `mpic++ -show' includes/adds "-lrdmacm" even when OpenMPI is not > > > itself using the library? > > > > > > Thanks, > > > > > > -tom > > > > > > > On Tue, Dec 15, 2009 at 8:46 PM, tom fogal > > wrote: > > > > > > > > > Simon Su writes: > > > > > > I am getting this error message while building 9184. > > > > > [snip] > > > > > > -lz -lm -ldl -lpthread -L/usr/local/openmpi/1.3.3/gcc/x86_64/lib64 > > > > > > -lmpi_cxx -lmpi -lopen-rte -lopen-pal -lrdmacm -libverbs -lnuma > > -ldl > > > > > -lnsl > > > > > > -lutil -lm -lcognomen \ > > > > > > -L/usr/local/openmpi/1.3.3/gcc/x86_64/lib64 -lmpi_cxx -lmpi > > > > > > -lopen-rte -lopen-pal -lrdmacm -libverbs -lnuma -ldl -lnsl -lutil > > -lm > > > > > > -lcognomen > > > > >
Re: [OMPI users] MPI_Comm_spawn lots of times
In case you missed it, this patch should be in the 1.4 nightly tarballs - feel free to test and let me know what you find. Thanks Ralph On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote: > That was quick. I will try the patch as soon as you release it. > > nick > > > On Wed, Dec 2, 2009 at 21:06, Ralph Castain wrote: > Patch is built and under review... > > Thanks again > Ralph > > On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote: > >> Thanks >> >> On Wed, Dec 2, 2009 at 17:04, Ralph Castain wrote: >> Yeah, that's the one all right! Definitely missing from 1.3.x. >> >> Thanks - I'll build a patch for the next bug-fix release >> >> >> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote: >> >> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain wrote: >> >> Indeed - that is very helpful! Thanks! >> >> Looks like we aren't cleaning up high enough - missing the directory >> >> level. >> >> I seem to recall seeing that error go by and that someone fixed it on our >> >> devel trunk, so this is likely a repair that didn't get moved over to the >> >> release branch as it should have done. >> >> I'll look into it and report back. >> > >> > You are probably referring to >> > https://svn.open-mpi.org/trac/ompi/changeset/21498 >> > >> > There was an issue about orte_session_dir_finalize() not >> > cleaning up the session directories properly. >> > >> > Hope that helps. >> > >> > Abhishek >> > >> >> Thanks again >> >> Ralph >> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote: >> >> >> >> >> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain wrote: >> >>> >> >>> Hmmif you are willing to keep trying, could you perhaps let it run >> >>> for >> >>> a brief time, ctrl-z it, and then do an ls on a directory from a process >> >>> that has already terminated? The pids will be in order, so just look for >> >>> an >> >>> early number (not mpirun or the parent, of course). >> >>> It would help if you could give us the contents of a directory from a >> >>> child process that has terminated - would tell us what subsystem is >> >>> failing >> >>> to properly cleanup. >> >> >> >> Ok, so I Ctrl-Z the master. In >> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one >> >> directory >> >> >> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 >> >> >> >> I can't find that PID though. mpirun has PID 4230, orted does not exist, >> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again, >> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there >> >> are 70 sequentially numbered directories starting at 0. Every directory >> >> contains another directory called "0". There is nothing in any of those >> >> directories. I see for instance: >> >> >> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70 >> >> total 4.0K >> >> drwx-- 2 nbock users 4.0K Dec 2 14:41 0 >> >> >> >> and >> >> >> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh >> >> 70/0/ >> >> total 0 >> >> >> >> I hope this information helps. Did I understand your question correctly? >> >> >> >> nick >> >> >> >> ___ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> ___ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] About openmpi-mpirun
Hi Min, I've had a chance to use the openmpi-mpirun script. Though unfortunately I do not have openmpi available at the moment (customer where I'm at runs RHEL4 with LAM/MPI) I've been able to see what quotes need to be used and this one seems to work on my end: bsub -e ERR -o OUT -n 16 './openmpi-mpirun /bin/sh -c "ulimit -s unlimited ; ./wrf.exe " ' Could you give this a try? Regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 5:56 PM, Min Zhu wrote: > Hi, Jeroen, > > Thanks a lot. Unfortunately I don't think I have got mpirun.lsf. The Dell > company only asked us to use openmpi-mpirun as a wrapper script. > > Cheers, > > Min Zhu > > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeroen Kleijer > Sent: 17 December 2009 16:49 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > Hi Min, > > Sorry for the mixup but it's been a while since I've actually used LSF. > I've had a look at my notes and to use mpirun with LSF you should give > something like this: > > bsub -a openmpi -n 16 "mpirun.lsf -x PATH -x LD_LIBRARY_PATH -x > MPI_BUFFER_SIZE \" ulimit -s unlimited ; ./wrf.exe \" " > > (at least I think you should :) ) > > The mpirun.lsf is a wrapper provided by LSF and the -a openmpi tells > it to set the necessary openmpi environment varibales etc. > > Kind regards, > > Jeroen Kleijer > > On Thu, Dec 17, 2009 at 5:32 PM, Min Zhu wrote: >> Hi, >> >> This time the OUT file is >> >> Sender: LSF System >> Subject: Job 667: > ./wrf.exe ' " > Exited >> >> Job was >> submitted from host by user . >> Job was executed on host(s) <8*compute-01>, in queue , as user . >> <8*compute-12> >> was used as the home directory. >> was used as the working directory. >> Started at Thu Dec 17 18:27:58 2009 >> Results reported at Thu Dec 17 18:28:04 2009 >> >> Your job looked like: >> >> >> # LSBATCH: User input >> openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' " >> >> >> Exited with exit code 2. >> >> Resource usage summary: >> >> CPU time : 0.07 sec. >> >> The output (if any) follows: >> >> >> >> PS: >> >> Read file for stderr output of this job. >> - >> >> ERR file is, >> >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -s: -c: line 0: unexpected EOF while looking for matching `'' >> -s: -c: line 1: syntax error: unexpected end of file >> -- >> >> Cheers, >> >> Min Zhu >> >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Jeroen Kleijer >> Sent: 17 December 2009 16:24 >> To: Open MPI Users >> Subject: Re: [OMPI users] About openmpi-mpirun >> >> Hi Min, >> >> Seems like the command ulimit was executed 16 times but after that >> (the "; ./wrf.exe") was ignored. >> Could you give the following a try: >> >> bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s >> unlimited ; ./wrf.exe ' \" " >> >> Somewhere, quoting g
Re: [OMPI users] About openmpi-mpirun
Hi, Jeroen, Thanks a lot. Unfortunately I don't think I have got mpirun.lsf. The Dell company only asked us to use openmpi-mpirun as a wrapper script. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeroen Kleijer Sent: 17 December 2009 16:49 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun Hi Min, Sorry for the mixup but it's been a while since I've actually used LSF. I've had a look at my notes and to use mpirun with LSF you should give something like this: bsub -a openmpi -n 16 "mpirun.lsf -x PATH -x LD_LIBRARY_PATH -x MPI_BUFFER_SIZE \" ulimit -s unlimited ; ./wrf.exe \" " (at least I think you should :) ) The mpirun.lsf is a wrapper provided by LSF and the -a openmpi tells it to set the necessary openmpi environment varibales etc. Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 5:32 PM, Min Zhu wrote: > Hi, > > This time the OUT file is > > Sender: LSF System > Subject: Job 667: ./wrf.exe ' " > Exited > > Job was > submitted from host by user . > Job was executed on host(s) <8*compute-01>, in queue , as user . > <8*compute-12> > was used as the home directory. > was used as the working directory. > Started at Thu Dec 17 18:27:58 2009 > Results reported at Thu Dec 17 18:28:04 2009 > > Your job looked like: > > > # LSBATCH: User input > openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' " > > > Exited with exit code 2. > > Resource usage summary: > > CPU time : 0.07 sec. > > The output (if any) follows: > > > > PS: > > Read file for stderr output of this job. > - > > ERR file is, > > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -- > > Cheers, > > Min Zhu > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeroen Kleijer > Sent: 17 December 2009 16:24 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > Hi Min, > > Seems like the command ulimit was executed 16 times but after that > (the "; ./wrf.exe") was ignored. > Could you give the following a try: > > bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s > unlimited ; ./wrf.exe ' \" " > > Somewhere, quoting goes wrong and I'm trying to figure out where > > Kind regards, > > Jeroen Kleijer > > On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu wrote: >> Hi, Jeroen, >> >> Here is the OUT file, ERR file is empty. >> >> -- >> Sender: LSF System >> Subject: Job 662: > ' > Done >> >> Job was >> submitted from host by user . >> Job was executed on host(s) <8*compute-10>, in queue , as user . >> <8*compute-11> >> was used as the home directory. >> was used as the working directory. >> Started at Thu Dec 17 17:37
Re: [OMPI users] About openmpi-mpirun
Hi Min, Sorry for the mixup but it's been a while since I've actually used LSF. I've had a look at my notes and to use mpirun with LSF you should give something like this: bsub -a openmpi -n 16 "mpirun.lsf -x PATH -x LD_LIBRARY_PATH -x MPI_BUFFER_SIZE \" ulimit -s unlimited ; ./wrf.exe \" " (at least I think you should :) ) The mpirun.lsf is a wrapper provided by LSF and the -a openmpi tells it to set the necessary openmpi environment varibales etc. Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 5:32 PM, Min Zhu wrote: > Hi, > > This time the OUT file is > > Sender: LSF System > Subject: Job 667: ./wrf.exe ' " > Exited > > Job was > submitted from host by user . > Job was executed on host(s) <8*compute-01>, in queue , as user . > <8*compute-12> > was used as the home directory. > was used as the working directory. > Started at Thu Dec 17 18:27:58 2009 > Results reported at Thu Dec 17 18:28:04 2009 > > Your job looked like: > > > # LSBATCH: User input > openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' " > > > Exited with exit code 2. > > Resource usage summary: > > CPU time : 0.07 sec. > > The output (if any) follows: > > > > PS: > > Read file for stderr output of this job. > - > > ERR file is, > > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -s: -c: line 0: unexpected EOF while looking for matching `'' > -s: -c: line 1: syntax error: unexpected end of file > -- > > Cheers, > > Min Zhu > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeroen Kleijer > Sent: 17 December 2009 16:24 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > Hi Min, > > Seems like the command ulimit was executed 16 times but after that > (the "; ./wrf.exe") was ignored. > Could you give the following a try: > > bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s > unlimited ; ./wrf.exe ' \" " > > Somewhere, quoting goes wrong and I'm trying to figure out where > > Kind regards, > > Jeroen Kleijer > > On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu wrote: >> Hi, Jeroen, >> >> Here is the OUT file, ERR file is empty. >> >> -- >> Sender: LSF System >> Subject: Job 662: > ' > Done >> >> Job was >> submitted from host by user . >> Job was executed on host(s) <8*compute-10>, in queue , as user . >> <8*compute-11> >> was used as the home directory. >> was used as the working directory. >> Started at Thu Dec 17 17:37:09 2009 >> Results reported at Thu Dec 17 17:37:15 2009 >> >> Your job looked like: >> >> >> # LSBATCH: User input >> openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' >> >> >> Successfully completed. >> >> Resource usage summary: >> >> CPU time : 0.0
Re: [OMPI users] About openmpi-mpirun
Hi, Ashley, I changed the content in openmpi-mpirun script you mentioned but wrf.exe still not executed. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ashley Pittman Sent: 17 December 2009 16:39 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun On Thu, 2009-12-17 at 14:40 +, Min Zhu wrote: > Here is the content of openmpi-mpirun file, so maybe something needs to > be changed? > if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then > usage > exit -1 > fi > > MYARGS=$* Shouldn't this be MYARGS=$@ It'll change the way quoted args are forwarded to the parallel job. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains information that may be confidential, and is protected by copyright. It is directed to the intended recipient(s) only. If you have received this e-mail in error please e-mail the sender by replying to this message, and then delete the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail is prohibited. Any communication of a personal nature in this e-mail is not made by or on behalf of any RES group company. E-mails sent or received may be monitored to ensure compliance with the law, regulation and/or our policies.
Re: [OMPI users] About openmpi-mpirun
Hi, This time the OUT file is Sender: LSF System Subject: Job 667: Exited Job was submitted from host by user . Job was executed on host(s) <8*compute-01>, in queue , as user . <8*compute-12> was used as the home directory. was used as the working directory. Started at Thu Dec 17 18:27:58 2009 Results reported at Thu Dec 17 18:28:04 2009 Your job looked like: # LSBATCH: User input openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' " Exited with exit code 2. Resource usage summary: CPU time : 0.07 sec. The output (if any) follows: PS: Read file for stderr output of this job. - ERR file is, -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -s: -c: line 0: unexpected EOF while looking for matching `'' -s: -c: line 1: syntax error: unexpected end of file -- Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeroen Kleijer Sent: 17 December 2009 16:24 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun Hi Min, Seems like the command ulimit was executed 16 times but after that (the "; ./wrf.exe") was ignored. Could you give the following a try: bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' \" " Somewhere, quoting goes wrong and I'm trying to figure out where Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu wrote: > Hi, Jeroen, > > Here is the OUT file, ERR file is empty. > > -- > Sender: LSF System > Subject: Job 662: ' > Done > > Job was > submitted from host by user . > Job was executed on host(s) <8*compute-10>, in queue , as user . > <8*compute-11> > was used as the home directory. > was used as the working directory. > Started at Thu Dec 17 17:37:09 2009 > Results reported at Thu Dec 17 17:37:15 2009 > > Your job looked like: > > > # LSBATCH: User input > openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' > > > Successfully completed. > > Resource usage summary: > > CPU time : 0.05 sec. > > The output (if any) follows: > > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > > > PS: > > Read file for stderr output of this job. > > -- > > Thanks, > > Min Zhu > > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeroen Kleijer > Sent: 17 December 2009 16:05 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > Hi Min > > Did you get any type of error message in the ERR or OUT files? > I don't have mpirun installed in the environment at the moment but > giving the
Re: [OMPI users] About openmpi-mpirun
On Thu, 2009-12-17 at 14:40 +, Min Zhu wrote: > Here is the content of openmpi-mpirun file, so maybe something needs to > be changed? > if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then > usage > exit -1 > fi > > MYARGS=$* Shouldn't this be MYARGS=$@ It'll change the way quoted args are forwarded to the parallel job. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] About openmpi-mpirun
Hi Min, Seems like the command ulimit was executed 16 times but after that (the "; ./wrf.exe") was ignored. Could you give the following a try: bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' \" " Somewhere, quoting goes wrong and I'm trying to figure out where Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu wrote: > Hi, Jeroen, > > Here is the OUT file, ERR file is empty. > > -- > Sender: LSF System > Subject: Job 662: ' > Done > > Job was > submitted from host by user . > Job was executed on host(s) <8*compute-10>, in queue , as user . > <8*compute-11> > was used as the home directory. > was used as the working directory. > Started at Thu Dec 17 17:37:09 2009 > Results reported at Thu Dec 17 17:37:15 2009 > > Your job looked like: > > > # LSBATCH: User input > openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' > > > Successfully completed. > > Resource usage summary: > > CPU time : 0.05 sec. > > The output (if any) follows: > > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > unlimited > > > PS: > > Read file for stderr output of this job. > > -- > > Thanks, > > Min Zhu > > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeroen Kleijer > Sent: 17 December 2009 16:05 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > Hi Min > > Did you get any type of error message in the ERR or OUT files? > I don't have mpirun installed in the environment at the moment but > giving the following: > > bsub -q interq -I "ssh /bin/sh -c 'ulimit -s unlimited ; > /bin/hostname ' " > > seems to work for me, so I'm kind of curious what the error message is > you're seeing. > > Kind regards, > > Jeroen Kleijer > > On Thu, Dec 17, 2009 at 4:41 PM, Min Zhu wrote: >> Hi, Jeroen, >> >> Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 >> "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe >> not executed. >> >> Cheers, >> >> Min Zhu >> >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Jeroen Kleijer >> Sent: 17 December 2009 15:34 >> To: Open MPI Users >> Subject: Re: [OMPI users] About openmpi-mpirun >> >> It's just that the "'s on the command line get parsed by LSF / bash >> (or whatever shell you use) >> >> If you wish to use it without the script you can give this a try: >> bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s >> unlimited; ./wrf.exe ' " >> >> This causes to pass the whole string "openmpi-mpirun " to be >> passed as a single string / command to LSF. >> The second line between the single quotes is then passed as a single >> argument to /bin/sh which is run by openmpi-mpirun. >> >> Kind regards, >> >> Jeroen Kleijer >> >> On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu wrote: >>> Hi, Jeff, >>> >>> Your script method works for me. Thank you very much, >>> >>> Cheers, >>> >>> Min Zhu >>> >>> >>> -Original Message- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >>> Behalf Of Jeff Squyres >>> Sent: 17 December 2009 14:56 >>> To: Open MPI Users >>> Subject: Re: [OMPI users] About openmpi-mpirun >>> >>> This might be something you need to talk to Platform about...? >>> >>> Another option would be to openmpi-mpirun a script that is just a few >>> lines long: >>> >>> #!/bin/sh >>> ulimit -s unlimited >>> ./wrf.exe >>> >>> >>> >>> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: >>> Hi, Jeff, Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. Here is the content of openmpi-mpirun file, so maybe something needs >>> to be changed? -- #!/bin/sh # # Copyright (c) 2007 Platform Computing # # This script is a wrapper for openmpi mpirun # it generates the machine file based on the hosts # given to it by Lava. # usage() { cat <>>> USAGE: $0 This command is a wrapper for mpirun (openmpi). It can only be run within Lava using bsub e.g. bsub -n # "$0 -np # {my mpi command and args}" The wrapper will automatically generate the machinefile used by mpirun. NOTE: The list of hosts cannot exceed 4KBytes. USEEOF } if [ x"${LSB_JOBFILENAME}" = x -o
Re: [OMPI users] About openmpi-mpirun
Hi, Jeroen, Here is the OUT file, ERR file is empty. -- Sender: LSF System Subject: Job 662: Done Job was submitted from host by user . Job was executed on host(s) <8*compute-10>, in queue , as user . <8*compute-11> was used as the home directory. was used as the working directory. Started at Thu Dec 17 17:37:09 2009 Results reported at Thu Dec 17 17:37:15 2009 Your job looked like: # LSBATCH: User input openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' Successfully completed. Resource usage summary: CPU time : 0.05 sec. The output (if any) follows: unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited PS: Read file for stderr output of this job. -- Thanks, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeroen Kleijer Sent: 17 December 2009 16:05 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun Hi Min Did you get any type of error message in the ERR or OUT files? I don't have mpirun installed in the environment at the moment but giving the following: bsub -q interq -I "ssh /bin/sh -c 'ulimit -s unlimited ; /bin/hostname ' " seems to work for me, so I'm kind of curious what the error message is you're seeing. Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 4:41 PM, Min Zhu wrote: > Hi, Jeroen, > > Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 > "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe > not executed. > > Cheers, > > Min Zhu > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeroen Kleijer > Sent: 17 December 2009 15:34 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > It's just that the "'s on the command line get parsed by LSF / bash > (or whatever shell you use) > > If you wish to use it without the script you can give this a try: > bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s > unlimited; ./wrf.exe ' " > > This causes to pass the whole string "openmpi-mpirun " to be > passed as a single string / command to LSF. > The second line between the single quotes is then passed as a single > argument to /bin/sh which is run by openmpi-mpirun. > > Kind regards, > > Jeroen Kleijer > > On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu wrote: >> Hi, Jeff, >> >> Your script method works for me. Thank you very much, >> >> Cheers, >> >> Min Zhu >> >> >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Jeff Squyres >> Sent: 17 December 2009 14:56 >> To: Open MPI Users >> Subject: Re: [OMPI users] About openmpi-mpirun >> >> This might be something you need to talk to Platform about...? >> >> Another option would be to openmpi-mpirun a script that is just a few >> lines long: >> >> #!/bin/sh >> ulimit -s unlimited >> ./wrf.exe >> >> >> >> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: >> >>> Hi, Jeff, >>> >>> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit >>> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. >>> >>> Here is the content of openmpi-mpirun file, so maybe something needs >> to >>> be changed? >>> >>> -- >>> #!/bin/sh >>> # >>> # Copyright (c) 2007 Platform Computing >>> # >>> # This script is a wrapper for openmpi mpirun >>> # it generates the machine file based on the hosts >>> # given to it by Lava. >>> # >>> >>> usage() { >>> cat <>> USAGE: $0 >>> This command is a wrapper for mpirun (openmpi). It can >>> only be run within Lava using bsub e.g. >>> bsub -n # "$0 -np # {my mpi command and args}" >>> >>> The wrapper will automatically generate the >>> machinefile used by mpirun. >>> >>> NOTE: The list of hosts cannot exceed 4KBytes. >>> USEEOF >>> } >>> >>> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then >>> usage >>> exit -1 >>> fi >>> >>> MYARGS=$* >>> WORKDIR=`dirname ${LSB_JOBFILENAME}` >>> MACHFILE=${WORKDIR}/mpi_machines >>> ARGLIST=${WORKDIR}/mpi_args >>> >>> # Check if mpirun is in the PATH >>> T=`which mpirun` >>> if [ $? -ne 0 ]; then >>> echo "Error: mpirun is not in your PATH." >>> exit -2 >>> fi >>> >>> echo "${MYARGS}" > ${ARGLIST} >>> T=`grep -- -machinefile ${ARGLIST} |wc -l` >>> if [ $T -gt 0 ]; then >>> echo "Error: Do not provide the machinefile for mpirun." >>> echo " It is generated automatically for you." >>> exit -3 >>> fi >>> >>> # Make the
Re: [OMPI users] About openmpi-mpirun
Hi Min Did you get any type of error message in the ERR or OUT files? I don't have mpirun installed in the environment at the moment but giving the following: bsub -q interq -I "ssh /bin/sh -c 'ulimit -s unlimited ; /bin/hostname ' " seems to work for me, so I'm kind of curious what the error message is you're seeing. Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 4:41 PM, Min Zhu wrote: > Hi, Jeroen, > > Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 > "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe > not executed. > > Cheers, > > Min Zhu > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeroen Kleijer > Sent: 17 December 2009 15:34 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > It's just that the "'s on the command line get parsed by LSF / bash > (or whatever shell you use) > > If you wish to use it without the script you can give this a try: > bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s > unlimited; ./wrf.exe ' " > > This causes to pass the whole string "openmpi-mpirun " to be > passed as a single string / command to LSF. > The second line between the single quotes is then passed as a single > argument to /bin/sh which is run by openmpi-mpirun. > > Kind regards, > > Jeroen Kleijer > > On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu wrote: >> Hi, Jeff, >> >> Your script method works for me. Thank you very much, >> >> Cheers, >> >> Min Zhu >> >> >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Jeff Squyres >> Sent: 17 December 2009 14:56 >> To: Open MPI Users >> Subject: Re: [OMPI users] About openmpi-mpirun >> >> This might be something you need to talk to Platform about...? >> >> Another option would be to openmpi-mpirun a script that is just a few >> lines long: >> >> #!/bin/sh >> ulimit -s unlimited >> ./wrf.exe >> >> >> >> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: >> >>> Hi, Jeff, >>> >>> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit >>> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. >>> >>> Here is the content of openmpi-mpirun file, so maybe something needs >> to >>> be changed? >>> >>> -- >>> #!/bin/sh >>> # >>> # Copyright (c) 2007 Platform Computing >>> # >>> # This script is a wrapper for openmpi mpirun >>> # it generates the machine file based on the hosts >>> # given to it by Lava. >>> # >>> >>> usage() { >>> cat <>> USAGE: $0 >>> This command is a wrapper for mpirun (openmpi). It can >>> only be run within Lava using bsub e.g. >>> bsub -n # "$0 -np # {my mpi command and args}" >>> >>> The wrapper will automatically generate the >>> machinefile used by mpirun. >>> >>> NOTE: The list of hosts cannot exceed 4KBytes. >>> USEEOF >>> } >>> >>> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then >>> usage >>> exit -1 >>> fi >>> >>> MYARGS=$* >>> WORKDIR=`dirname ${LSB_JOBFILENAME}` >>> MACHFILE=${WORKDIR}/mpi_machines >>> ARGLIST=${WORKDIR}/mpi_args >>> >>> # Check if mpirun is in the PATH >>> T=`which mpirun` >>> if [ $? -ne 0 ]; then >>> echo "Error: mpirun is not in your PATH." >>> exit -2 >>> fi >>> >>> echo "${MYARGS}" > ${ARGLIST} >>> T=`grep -- -machinefile ${ARGLIST} |wc -l` >>> if [ $T -gt 0 ]; then >>> echo "Error: Do not provide the machinefile for mpirun." >>> echo " It is generated automatically for you." >>> exit -3 >>> fi >>> >>> # Make the open-mpi machine file >>> echo "${LSB_HOSTS}" > ${MACHFILE}.lst >>> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE} >>> >>> MPIRUN=`which --skip-alias mpirun` >>> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS} >>> >>> exit $? >>> >>> -- >>> >>> >>> Cheers, >>> >>> Min Zhu >>> >>> -Original Message- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >> On >>> Behalf Of Jeff Squyres >>> Sent: 17 December 2009 14:29 >>> To: Open MPI Users >>> Subject: Re: [OMPI users] About openmpi-mpirun >>> >>> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: >>> >>> > Thanks for your reply. Yes, your mpirun command works for me. But I >>> need to use bsub job scheduler. I wonder why >>> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s >>> unlimited; ./wrf.exe" doesn't work. >>> >>> Try with different quoting...? I don't know the details of the >>> openmpi-mpirun script, but perhaps it's trying to exec the whole >> quoted >>> string as a single executable (which doesn't exist). Perhaps: >>> >>> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s >> unlimited; >>> ./wrf.exe" >>> >>> That's a (somewhat educated) guess... >>> >>> -- >>> >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> >>> __
Re: [OMPI users] About openmpi-mpirun
Hi, Jeroen, Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe not executed. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeroen Kleijer Sent: 17 December 2009 15:34 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun It's just that the "'s on the command line get parsed by LSF / bash (or whatever shell you use) If you wish to use it without the script you can give this a try: bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " This causes to pass the whole string "openmpi-mpirun " to be passed as a single string / command to LSF. The second line between the single quotes is then passed as a single argument to /bin/sh which is run by openmpi-mpirun. Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu wrote: > Hi, Jeff, > > Your script method works for me. Thank you very much, > > Cheers, > > Min Zhu > > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: 17 December 2009 14:56 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > This might be something you need to talk to Platform about...? > > Another option would be to openmpi-mpirun a script that is just a few > lines long: > > #!/bin/sh > ulimit -s unlimited > ./wrf.exe > > > > On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: > >> Hi, Jeff, >> >> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit >> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. >> >> Here is the content of openmpi-mpirun file, so maybe something needs > to >> be changed? >> >> -- >> #!/bin/sh >> # >> # Copyright (c) 2007 Platform Computing >> # >> # This script is a wrapper for openmpi mpirun >> # it generates the machine file based on the hosts >> # given to it by Lava. >> # >> >> usage() { >> cat <> USAGE: $0 >> This command is a wrapper for mpirun (openmpi). It can >> only be run within Lava using bsub e.g. >> bsub -n # "$0 -np # {my mpi command and args}" >> >> The wrapper will automatically generate the >> machinefile used by mpirun. >> >> NOTE: The list of hosts cannot exceed 4KBytes. >> USEEOF >> } >> >> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then >> usage >> exit -1 >> fi >> >> MYARGS=$* >> WORKDIR=`dirname ${LSB_JOBFILENAME}` >> MACHFILE=${WORKDIR}/mpi_machines >> ARGLIST=${WORKDIR}/mpi_args >> >> # Check if mpirun is in the PATH >> T=`which mpirun` >> if [ $? -ne 0 ]; then >> echo "Error: mpirun is not in your PATH." >> exit -2 >> fi >> >> echo "${MYARGS}" > ${ARGLIST} >> T=`grep -- -machinefile ${ARGLIST} |wc -l` >> if [ $T -gt 0 ]; then >> echo "Error: Do not provide the machinefile for mpirun." >> echo " It is generated automatically for you." >> exit -3 >> fi >> >> # Make the open-mpi machine file >> echo "${LSB_HOSTS}" > ${MACHFILE}.lst >> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE} >> >> MPIRUN=`which --skip-alias mpirun` >> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS} >> >> exit $? >> >> -- >> >> >> Cheers, >> >> Min Zhu >> >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On >> Behalf Of Jeff Squyres >> Sent: 17 December 2009 14:29 >> To: Open MPI Users >> Subject: Re: [OMPI users] About openmpi-mpirun >> >> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: >> >> > Thanks for your reply. Yes, your mpirun command works for me. But I >> need to use bsub job scheduler. I wonder why >> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s >> unlimited; ./wrf.exe" doesn't work. >> >> Try with different quoting...? I don't know the details of the >> openmpi-mpirun script, but perhaps it's trying to exec the whole > quoted >> string as a single executable (which doesn't exist). Perhaps: >> >> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s > unlimited; >> ./wrf.exe" >> >> That's a (somewhat educated) guess... >> >> -- >> >> Jeff Squyres >> jsquy...@cisco.com >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> CONFIDENTIALITY NOTICE: This e-mail, including any attachments, > contains information that may be confidential, and is protected by > copyright. It is directed to the intended recipient(s) only. If you > have received this e-mail in error please e-mail the sender by replying > to this message, and then delete the e-mail. Unauthorised disclosure, > publication, copying or use of this e-mail is prohibited. Any > communication of a personal nature in this e-mail is not ma
Re: [OMPI users] About openmpi-mpirun
It's just that the "'s on the command line get parsed by LSF / bash (or whatever shell you use) If you wish to use it without the script you can give this a try: bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " This causes to pass the whole string "openmpi-mpirun " to be passed as a single string / command to LSF. The second line between the single quotes is then passed as a single argument to /bin/sh which is run by openmpi-mpirun. Kind regards, Jeroen Kleijer On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu wrote: > Hi, Jeff, > > Your script method works for me. Thank you very much, > > Cheers, > > Min Zhu > > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: 17 December 2009 14:56 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > This might be something you need to talk to Platform about...? > > Another option would be to openmpi-mpirun a script that is just a few > lines long: > > #!/bin/sh > ulimit -s unlimited > ./wrf.exe > > > > On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: > >> Hi, Jeff, >> >> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit >> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. >> >> Here is the content of openmpi-mpirun file, so maybe something needs > to >> be changed? >> >> -- >> #!/bin/sh >> # >> # Copyright (c) 2007 Platform Computing >> # >> # This script is a wrapper for openmpi mpirun >> # it generates the machine file based on the hosts >> # given to it by Lava. >> # >> >> usage() { >> cat <> USAGE: $0 >> This command is a wrapper for mpirun (openmpi). It can >> only be run within Lava using bsub e.g. >> bsub -n # "$0 -np # {my mpi command and args}" >> >> The wrapper will automatically generate the >> machinefile used by mpirun. >> >> NOTE: The list of hosts cannot exceed 4KBytes. >> USEEOF >> } >> >> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then >> usage >> exit -1 >> fi >> >> MYARGS=$* >> WORKDIR=`dirname ${LSB_JOBFILENAME}` >> MACHFILE=${WORKDIR}/mpi_machines >> ARGLIST=${WORKDIR}/mpi_args >> >> # Check if mpirun is in the PATH >> T=`which mpirun` >> if [ $? -ne 0 ]; then >> echo "Error: mpirun is not in your PATH." >> exit -2 >> fi >> >> echo "${MYARGS}" > ${ARGLIST} >> T=`grep -- -machinefile ${ARGLIST} |wc -l` >> if [ $T -gt 0 ]; then >> echo "Error: Do not provide the machinefile for mpirun." >> echo " It is generated automatically for you." >> exit -3 >> fi >> >> # Make the open-mpi machine file >> echo "${LSB_HOSTS}" > ${MACHFILE}.lst >> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE} >> >> MPIRUN=`which --skip-alias mpirun` >> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS} >> >> exit $? >> >> -- >> >> >> Cheers, >> >> Min Zhu >> >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On >> Behalf Of Jeff Squyres >> Sent: 17 December 2009 14:29 >> To: Open MPI Users >> Subject: Re: [OMPI users] About openmpi-mpirun >> >> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: >> >> > Thanks for your reply. Yes, your mpirun command works for me. But I >> need to use bsub job scheduler. I wonder why >> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s >> unlimited; ./wrf.exe" doesn't work. >> >> Try with different quoting...? I don't know the details of the >> openmpi-mpirun script, but perhaps it's trying to exec the whole > quoted >> string as a single executable (which doesn't exist). Perhaps: >> >> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s > unlimited; >> ./wrf.exe" >> >> That's a (somewhat educated) guess... >> >> -- >> >> Jeff Squyres >> jsquy...@cisco.com >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> CONFIDENTIALITY NOTICE: This e-mail, including any attachments, > contains information that may be confidential, and is protected by > copyright. It is directed to the intended recipient(s) only. If you > have received this e-mail in error please e-mail the sender by replying > to this message, and then delete the e-mail. Unauthorised disclosure, > publication, copying or use of this e-mail is prohibited. Any > communication of a personal nature in this e-mail is not made by or on > behalf of any RES group company. E-mails sent or received may be > monitored to ensure compliance with the law, regulation and/or our > policies. >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > -- > > Jeff Squyres > jsquy...@cisco.com > > > _
Re: [OMPI users] About openmpi-mpirun
Hi, Jeff, Your script method works for me. Thank you very much, Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: 17 December 2009 14:56 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun This might be something you need to talk to Platform about...? Another option would be to openmpi-mpirun a script that is just a few lines long: #!/bin/sh ulimit -s unlimited ./wrf.exe On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: > Hi, Jeff, > > Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit > -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. > > Here is the content of openmpi-mpirun file, so maybe something needs to > be changed? > > -- > #!/bin/sh > # > # Copyright (c) 2007 Platform Computing > # > # This script is a wrapper for openmpi mpirun > # it generates the machine file based on the hosts > # given to it by Lava. > # > > usage() { > cat < USAGE: $0 > This command is a wrapper for mpirun (openmpi). It can > only be run within Lava using bsub e.g. > bsub -n # "$0 -np # {my mpi command and args}" > > The wrapper will automatically generate the > machinefile used by mpirun. > > NOTE: The list of hosts cannot exceed 4KBytes. > USEEOF > } > > if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then > usage > exit -1 > fi > > MYARGS=$* > WORKDIR=`dirname ${LSB_JOBFILENAME}` > MACHFILE=${WORKDIR}/mpi_machines > ARGLIST=${WORKDIR}/mpi_args > > # Check if mpirun is in the PATH > T=`which mpirun` > if [ $? -ne 0 ]; then > echo "Error: mpirun is not in your PATH." > exit -2 > fi > > echo "${MYARGS}" > ${ARGLIST} > T=`grep -- -machinefile ${ARGLIST} |wc -l` > if [ $T -gt 0 ]; then > echo "Error: Do not provide the machinefile for mpirun." > echo "It is generated automatically for you." > exit -3 > fi > > # Make the open-mpi machine file > echo "${LSB_HOSTS}" > ${MACHFILE}.lst > tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE} > > MPIRUN=`which --skip-alias mpirun` > ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS} > > exit $? > > -- > > > Cheers, > > Min Zhu > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: 17 December 2009 14:29 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: > > > Thanks for your reply. Yes, your mpirun command works for me. But I > need to use bsub job scheduler. I wonder why > > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s > unlimited; ./wrf.exe" doesn't work. > > Try with different quoting...? I don't know the details of the > openmpi-mpirun script, but perhaps it's trying to exec the whole quoted > string as a single executable (which doesn't exist). Perhaps: > > bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; > ./wrf.exe" > > That's a (somewhat educated) guess... > > -- > > Jeff Squyres > jsquy...@cisco.com > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains information that may be confidential, and is protected by copyright. It is directed to the intended recipient(s) only. If you have received this e-mail in error please e-mail the sender by replying to this message, and then delete the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail is prohibited. Any communication of a personal nature in this e-mail is not made by or on behalf of any RES group company. E-mails sent or received may be monitored to ensure compliance with the law, regulation and/or our policies. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains information that may be confidential, and is protected by copyright. It is directed to the intended recipient(s) only. If you have received this e-mail in error please e-mail the sender by replying to this message, and then delete the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail is prohibited. Any communication of a personal nature in this e-mail is not made by or on behalf of any RES group company. E-mails sent or received may be monitored to ensure compliance with the law, regulation and/or our policies.
Re: [OMPI users] Debugging spawned processes
On Dec 17, 2009, at 3:54 AM, jody wrote: > yeah, know that you mention it, i remember (old brain here, as well) > But IIRC you created a OMPI version which was called 1.4a1r or something, > where i indeed could use this xterm. When i updated to 1.3.2, i sort > of forgot about it again... The 1.4a1rxxx version was just our development trunk at the time - it was then moved to the official release branch (1.3). > > Another question though: > You said "If it includes the -xterm option, then that option gets > applied to the dynamically spawned procs too" > Does this passing on also apply to the -x options? Certainly should, though I can't claim to have personally tested it. > > Thanks > Jody > > On Wed, Dec 16, 2009 at 3:42 PM, Ralph Castain wrote: >> It is in a later version - pretty sure it made 1.3.3. IIRC, I added it at >> your request :-) >> >> On Dec 16, 2009, at 7:20 AM, jody wrote: >> >>> Thanks for your reply >>> >>> That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not >>> to recognize the --xterm option. >>> [jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf >>> -- >>> mpirun was unable to launch the specified application as it could not >>> find an executable: >>> >>> Executable: 1 >>> Node: aim-plankton.uzh.ch >>> >>> while attempting to start process rank 0. >>> -- >>> (if i reverse the --xterm and -np 1, it complains about not finding >>> executable '9') >>> Do i need to install a higher version, or is this something i'd have >>> to set as option in configure? >>> >>> Thank You >>> Jody >>> >>> On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain wrote: Depends on the version you are working with. If it includes the -xterm option, then that option gets applied to the dynamically spawned procs too, so this should be automatically taken care of...but in that case, you wouldn't need your script to open an xterm anyway. You would just do: mpirun --xterm -np 5 gdb ./my_app or the equivalent. You would then comm_spawn an argv[0] of "gdb", with argv[1] being your target app. I don't know how to avoid including that "gdb" in the comm_spawn argv's - I once added an mpirun cmd line option to automatically add it, but got loudly told to remove it. Of course, it should be easy to pass an option to your app itself that tells it whether or not to do so! HTH Ralph On Dec 16, 2009, at 4:06 AM, jody wrote: > Hi > Until now i always wrote applications for which the number of processes > was given on the command line with -np. > To debug these applications i wrote a script, run_gdb.sh which basically > open a xterm and starts gdb in it for my application. > This allowed me to have a window for each of the processes being debugged. > > Now, however, i write my first application in which additional processes > are > being spawned. My question is now: how can i open xterm windows in which > gdb runs for the spawned processes? > > The only way i can think of is to pass my script run_gdb.sh into the argv > parameters of MPI_Spawn. > Would this be correct? > If yes, what about other parameters passed to the spawning process, such > as > environment variables passed via -x? Are they being passed to the spawned > processes as well? In my case this would be necessary so that processes > on other machine will get the $DISPLAY environment variable in order to > display their xterms with gdb on my workstation. > > Another negative point would be the need to change the argv parameters > every time one switches between debugging and normal running. > > Has anybody got some hints on how to debug spawned processes? > > Thank You > Jody > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] About openmpi-mpirun
This might be something you need to talk to Platform about...? Another option would be to openmpi-mpirun a script that is just a few lines long: #!/bin/sh ulimit -s unlimited ./wrf.exe On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: > Hi, Jeff, > > Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit > -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. > > Here is the content of openmpi-mpirun file, so maybe something needs to > be changed? > > -- > #!/bin/sh > # > # Copyright (c) 2007 Platform Computing > # > # This script is a wrapper for openmpi mpirun > # it generates the machine file based on the hosts > # given to it by Lava. > # > > usage() { > cat < USAGE: $0 > This command is a wrapper for mpirun (openmpi). It can > only be run within Lava using bsub e.g. > bsub -n # "$0 -np # {my mpi command and args}" > > The wrapper will automatically generate the > machinefile used by mpirun. > > NOTE: The list of hosts cannot exceed 4KBytes. > USEEOF > } > > if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then > usage > exit -1 > fi > > MYARGS=$* > WORKDIR=`dirname ${LSB_JOBFILENAME}` > MACHFILE=${WORKDIR}/mpi_machines > ARGLIST=${WORKDIR}/mpi_args > > # Check if mpirun is in the PATH > T=`which mpirun` > if [ $? -ne 0 ]; then > echo "Error: mpirun is not in your PATH." > exit -2 > fi > > echo "${MYARGS}" > ${ARGLIST} > T=`grep -- -machinefile ${ARGLIST} |wc -l` > if [ $T -gt 0 ]; then > echo "Error: Do not provide the machinefile for mpirun." > echo "It is generated automatically for you." > exit -3 > fi > > # Make the open-mpi machine file > echo "${LSB_HOSTS}" > ${MACHFILE}.lst > tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE} > > MPIRUN=`which --skip-alias mpirun` > ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS} > > exit $? > > -- > > > Cheers, > > Min Zhu > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: 17 December 2009 14:29 > To: Open MPI Users > Subject: Re: [OMPI users] About openmpi-mpirun > > On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: > > > Thanks for your reply. Yes, your mpirun command works for me. But I > need to use bsub job scheduler. I wonder why > > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s > unlimited; ./wrf.exe" doesn't work. > > Try with different quoting...? I don't know the details of the > openmpi-mpirun script, but perhaps it's trying to exec the whole quoted > string as a single executable (which doesn't exist). Perhaps: > > bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; > ./wrf.exe" > > That's a (somewhat educated) guess... > > -- > > Jeff Squyres > jsquy...@cisco.com > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains > information that may be confidential, and is protected by copyright. It is > directed to the intended recipient(s) only. If you have received this e-mail > in error please e-mail the sender by replying to this message, and then > delete the e-mail. Unauthorised disclosure, publication, copying or use of > this e-mail is prohibited. Any communication of a personal nature in this > e-mail is not made by or on behalf of any RES group company. E-mails sent or > received may be monitored to ensure compliance with the law, regulation > and/or our policies. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] About openmpi-mpirun
Hi, Jeff, Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. Here is the content of openmpi-mpirun file, so maybe something needs to be changed? -- #!/bin/sh # # Copyright (c) 2007 Platform Computing # # This script is a wrapper for openmpi mpirun # it generates the machine file based on the hosts # given to it by Lava. # usage() { cat < ${ARGLIST} T=`grep -- -machinefile ${ARGLIST} |wc -l` if [ $T -gt 0 ]; then echo "Error: Do not provide the machinefile for mpirun." echo "It is generated automatically for you." exit -3 fi # Make the open-mpi machine file echo "${LSB_HOSTS}" > ${MACHFILE}.lst tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE} MPIRUN=`which --skip-alias mpirun` ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS} exit $? -- Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: 17 December 2009 14:29 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: > Thanks for your reply. Yes, your mpirun command works for me. But I need to use bsub job scheduler. I wonder why > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" doesn't work. Try with different quoting...? I don't know the details of the openmpi-mpirun script, but perhaps it's trying to exec the whole quoted string as a single executable (which doesn't exist). Perhaps: bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; ./wrf.exe" That's a (somewhat educated) guess... -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains information that may be confidential, and is protected by copyright. It is directed to the intended recipient(s) only. If you have received this e-mail in error please e-mail the sender by replying to this message, and then delete the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail is prohibited. Any communication of a personal nature in this e-mail is not made by or on behalf of any RES group company. E-mails sent or received may be monitored to ensure compliance with the law, regulation and/or our policies.
Re: [OMPI users] About openmpi-mpirun
On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: > Thanks for your reply. Yes, your mpirun command works for me. But I need to > use bsub job scheduler. I wonder why > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; > ./wrf.exe" doesn't work. Try with different quoting...? I don't know the details of the openmpi-mpirun script, but perhaps it's trying to exec the whole quoted string as a single executable (which doesn't exist). Perhaps: bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; ./wrf.exe" That's a (somewhat educated) guess... -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] About openmpi-mpirun
Hi, Thanks for your reply. Yes, your mpirun command works for me. But I need to use bsub job scheduler. I wonder why bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" doesn't work. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Romaric David Sent: 17 December 2009 13:58 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun Min Zhu a écrit : > Hi, > > I excuted > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; > ./wrf.exe" > Hello, Here we use : mpirun /bin/tcsh -c " limit stacksize unlimited ; /full/path/to/wrf.exe" Regards, R. David ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains information that may be confidential, and is protected by copyright. It is directed to the intended recipient(s) only. If you have received this e-mail in error please e-mail the sender by replying to this message, and then delete the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail is prohibited. Any communication of a personal nature in this e-mail is not made by or on behalf of any RES group company. E-mails sent or received may be monitored to ensure compliance with the law, regulation and/or our policies.
Re: [OMPI users] About openmpi-mpirun
Min Zhu a écrit : Hi, I excuted bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" Hello, Here we use : mpirun /bin/tcsh -c " limit stacksize unlimited ; /full/path/to/wrf.exe" Regards, R. David
Re: [OMPI users] About openmpi-mpirun
Hi, I excuted bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" Here is the results, it seems that wrf.exe didn't run. ERR file is empty. Sender: LSF System Subject: Job 647: Done Job was submitted from host by user . Job was executed on host(s) <8*compute-10>, in queue , as user . <8*compute-11> was used as the home directory. was used as the working directory. Started at Thu Dec 17 15:27:18 2009 Results reported at Thu Dec 17 15:27:27 2009 Your job looked like: # LSBATCH: User input openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" Successfully completed. Resource usage summary: CPU time : 0.05 sec. The output (if any) follows: unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited unlimited PS: Read file for stderr output of this job. - Thanks, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Romaric David Sent: 17 December 2009 13:17 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpirun Min Zhu a écrit : > Dear all, > > I have a question to ask you. If I issue a command "bsub -n 16 > openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job > failed to run due to stacksize problem. Openmpi-mpirun is a wrapper > script written by Platform. If I want to add '/bin/sh -c ulimit -s > unlimited' to the above bsub command, how I can do it? Thank you very > much in advance, > Hello, you should add it before wrf.exe : openmpi-mpirun "sh -c ulmit -s unlimited ; ./wrf.exe" Regards, R. David ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains information that may be confidential, and is protected by copyright. It is directed to the intended recipient(s) only. If you have received this e-mail in error please e-mail the sender by replying to this message, and then delete the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail is prohibited. Any communication of a personal nature in this e-mail is not made by or on behalf of any RES group company. E-mails sent or received may be monitored to ensure compliance with the law, regulation and/or our policies.
Re: [OMPI users] About openmpi-mpirun
Min Zhu a écrit : Dear all, I have a question to ask you. If I issue a command "bsub -n 16 openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job failed to run due to stacksize problem. Openmpi-mpirun is a wrapper script written by Platform. If I want to add '/bin/sh -c ulimit -s unlimited' to the above bsub command, how I can do it? Thank you very much in advance, Hello, you should add it before wrf.exe : openmpi-mpirun "sh -c ulmit -s unlimited ; ./wrf.exe" Regards, R. David
[OMPI users] About openmpi-mpirun
Dear all, I have a question to ask you. If I issue a command "bsub -n 16 openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job failed to run due to stacksize problem. Openmpi-mpirun is a wrapper script written by Platform. If I want to add '/bin/sh -c ulimit -s unlimited' to the above bsub command, how I can do it? Thank you very much in advance, Cheers, Min Zhu CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains information that may be confidential, and is protected by copyright. It is directed to the intended recipient(s) only. If you have received this e-mail in error please e-mail the sender by replying to this message, and then delete the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail is prohibited. Any communication of a personal nature in this e-mail is not made by or on behalf of any RES group company. E-mails sent or received may be monitored to ensure compliance with the law, regulation and/or our policies.
[OMPI users] MPI-IO, providing buffers
Hi all I have a doubt. I'm starting using MPI-IO and was wondering if I can use the MPI_BUFFER_ATTACH to provide the necessary IO buffer (or will it use the array I'm passing the MPI_Write...??) many thanks, Ricardo Reis 'Non Serviam' PhD candidate @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt Cultural Instigator @ Rádio Zero http://www.radiozero.pt Keep them Flying! Ajude a/help Aero Fénix! http://www.aeronauta.com/aero.fenix http://www.flickr.com/photos/rreis/
Re: [OMPI users] error performing MPI_Comm_spawn
very good news I will wait carefully for the release :) Thanks, Ralph márcia. On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain wrote: > Ah crumb - I found the problem. Sigh. > > I actually fixed this in the trunk over 5 months ago when the problem first > surfaced in my own testing, but it never came across to the stable release > branch. The problem is that we weren't serializing the comm_spawn requests, > and so the launch system gets confused over what has and hasn't completed > launch. That's why it works fine on the trunk. > > I'm creating the 1.4 patch right now. Thanks for catching this. Old brain > completely forgot until I started tracking the commit history and found my > own footprints! > > Ralph > > On Dec 16, 2009, at 5:43 AM, Marcia Cristina Cera wrote: > > Hi Ralph, > > I am afraid I have been a little hasty! > I remake my tests with more care and I got the same error also with the > 1.3.3 :-/ > but in such version the error happens after some successful executions... > because of that I did not realize before! > Furthermore, I increased the number of levels of the tree (that means have > more concurrently dynamic process creations in the lower levels) and I never > arrive to execute without error, unless I add the delay. > Perhaps the problem might even be a race condition :( > > I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work > with LAM for years, but I migrate o OpenMP last year once LAM will be > discontinued... > > I think that I can continue the development of my application adding the > delay, while I wait for a release... and I leave the performance tests to be > made in the future :) > > Thank you again Ralph, > márcia. > > > On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain wrote: > >> Okay, I can replicate this. >> >> FWIW: your test program works fine with the OMPI trunk and 1.3.3. It only >> has a problem with 1.4. Since I can replicate it on multiple machines every >> single time, I don't think it is actually a race condition. >> >> I think someone made a change to the 1.4 branch that created a failure >> mode :-/ >> >> Will have to get back to you on this - may take awhile, and won't be in >> the 1.4.1 release. >> >> Thanks for the replicator! >> >> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote: >> >> Thank you, Ralph >> >> I will use the 1.3.3 for now... >> while waiting for a future fix release that break this race condiction. >> >> márcia >> >> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain wrote: >> >>> Looks to me like it is a race condition, and the timing between 1.3.3 and >>> 1.4 is just enough to trip it. I can break the race, but it will have to be >>> in a future fix release. >>> >>> Meantime, your best bet is to either stick with 1.3.3 or add the delay. >>> >>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote: >>> >>> Hi, >>> >>> I intend to develop an application using the MPI_Comm_spawn to create >>> dynamically new MPI tasks (or processes). >>> The structure of the program is like a tree: each node creates 2 new ones >>> until reaches a predefined number of levels. >>> >>> I developed a small program to explain my problem as can be seen in >>> attachment. >>> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the >>> level value) the root of the tree (a ch_rec program). Afterward spawn, a >>> message is sent to child and the process block in an MPI_Recv. >>> -- ch_rec.c: gets its level value and receives the parent message, then >>> if its level is less than a predefined limit, it will creates 2 children: >>> - set the level value; >>> - spawn 1 child; >>> - send a message; >>> - call an MPI_Irecv; >>> - repeat the 4 previous steps for the second child; >>> - call an MPI_Waitany waiting for children returns. >>> When children messages are received, the process send a message to its >>> parent and call MPI_Finalize. >>> >>> Using the openmpi-1.3.3 version the program runs as expected but with >>> openmpi-1.4 I get the following error: >>> >>> $ mpirun -np 1 start >>> level 0 >>> level = 1 >>> Parent sent: level 0 (pid:4279) >>> level = 2 >>> Parent sent: level 1 (pid:4281) >>> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0] >>> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758 >>> >>> The error happens when my program try to launch the second child >>> immediately after the first spawn call. >>> In my tests I try to put an sleep of 2 second between the first and the >>> second spawn, and then the program runs as expected. >>> >>> Some one can help me with this version 1.4 bug? >>> >>> thanks, >>> márcia. >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listi
Re: [OMPI users] Debugging spawned processes
yeah, know that you mention it, i remember (old brain here, as well) But IIRC you created a OMPI version which was called 1.4a1r or something, where i indeed could use this xterm. When i updated to 1.3.2, i sort of forgot about it again... Another question though: You said "If it includes the -xterm option, then that option gets applied to the dynamically spawned procs too" Does this passing on also apply to the -x options? Thanks Jody On Wed, Dec 16, 2009 at 3:42 PM, Ralph Castain wrote: > It is in a later version - pretty sure it made 1.3.3. IIRC, I added it at > your request :-) > > On Dec 16, 2009, at 7:20 AM, jody wrote: > >> Thanks for your reply >> >> That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not >> to recognize the --xterm option. >> [jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf >> -- >> mpirun was unable to launch the specified application as it could not >> find an executable: >> >> Executable: 1 >> Node: aim-plankton.uzh.ch >> >> while attempting to start process rank 0. >> -- >> (if i reverse the --xterm and -np 1, it complains about not finding >> executable '9') >> Do i need to install a higher version, or is this something i'd have >> to set as option in configure? >> >> Thank You >> Jody >> >> On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain wrote: >>> Depends on the version you are working with. If it includes the -xterm >>> option, then that option gets applied to the dynamically spawned procs too, >>> so this should be automatically taken care of...but in that case, you >>> wouldn't need your script to open an xterm anyway. You would just do: >>> >>> mpirun --xterm -np 5 gdb ./my_app >>> >>> or the equivalent. You would then comm_spawn an argv[0] of "gdb", with >>> argv[1] being your target app. >>> >>> I don't know how to avoid including that "gdb" in the comm_spawn argv's - I >>> once added an mpirun cmd line option to automatically add it, but got >>> loudly told to remove it. Of course, it should be easy to pass an option >>> to your app itself that tells it whether or not to do so! >>> >>> HTH >>> Ralph >>> >>> >>> On Dec 16, 2009, at 4:06 AM, jody wrote: >>> Hi Until now i always wrote applications for which the number of processes was given on the command line with -np. To debug these applications i wrote a script, run_gdb.sh which basically open a xterm and starts gdb in it for my application. This allowed me to have a window for each of the processes being debugged. Now, however, i write my first application in which additional processes are being spawned. My question is now: how can i open xterm windows in which gdb runs for the spawned processes? The only way i can think of is to pass my script run_gdb.sh into the argv parameters of MPI_Spawn. Would this be correct? If yes, what about other parameters passed to the spawning process, such as environment variables passed via -x? Are they being passed to the spawned processes as well? In my case this would be necessary so that processes on other machine will get the $DISPLAY environment variable in order to display their xterms with gdb on my workstation. Another negative point would be the need to change the argv parameters every time one switches between debugging and normal running. Has anybody got some hints on how to debug spawned processes? Thank You Jody ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Pointers for understanding failure messages on NetBSD
> You could confirm that it is the IPv6 loop by simply disabling IPv6 > support - configure with --disable-ipv6 and see if you still get the error > messages > > Thanks for continuing to pursue this! > Ralph > Yeah, but if you disable the IPv6 stuff then there's a completely different path taken through the routine in question. It just does a simple: inet_ntoa(((struct sockaddr_in*) addr)->sin_addr); as opposed to the error = getnameinfo(addr, addrlen, name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST); that gets invoked, for EVERY addr, if you do enable IPv6, which seems a little odd. Actually, looking at the logic of how execution happens in that routine, it's not clear that an IPv6 addr on NetBSD should get into the code where the getnameinfo() is called anyway ! A comment just above there even says: /* hotfix for netbsd: on my netbsd machine, getnameinfo returns an unkown error code. */ Turning off IPv6 therefore doesn't actual tickle the previous error in a different way for IPv4, it simply routes around it altogether. I still reckon that the "sa_len = 0" is the key here and that must be being set elsewhere when the addr that is blowing things up is allowed through. I'll try recompiling the PETSc/PISM stuff on top of an --disable-ipv6 OpenMPI and see what happens. Certainly, when running the SkaMPI tests the error messages go away, but then they aren't in the code that is being compiled anymore, so one might expect that !!! -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand