[OMPI users] Unable to run WRF on large core counts (1024+), queue pair error

2009-12-17 Thread Craig Tierney

I am trying to run WRF on 1024 cores with OpenMPI 1.3.3 and
1.4.  I can get the code to run with 512 cores, but it crashes
at startup on 1024 cores.  I am getting the following error message:

[n172][[43536,1],0][connect/btl_openib_connect_oob.c:463:qp_create_one] error 
creating qp errno says Cannot allocate memory
[n172][[43536,1],0][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in 
endpoint reply start connect

From google, I have tried to change the settings for btl_openib_receive_queues,
but my tries have not worked.  Here was my latest try to reduce the
total queue pairs.

mpirun -np 1024 \
   -mca btl_openib_receive_queues P,128,2048,128,128:S,65536,256,192,128 \
  `wrf.exe

These settings did not help.

Am I looking in the right place?

System setup:
Centos-5.3
Ofed-1.4.1
Intel Compiler 11.1.038
Openmpi-1.3.3 and 1.4

Build options:

./configure CC=icc CXX=icpc F77=ifort F90=ifort FC=ifort --prefix=/opt/openmpi/1.3.3-intel --without-sge --with-openib --enable-io-romio 
--with-io-romio-flags=--with-file-system=lustre --with-pic


Thanks,
Craig


Re: [OMPI users] NetBSD OpenMPI - SGE - PETSc - PISM

2009-12-17 Thread Jeff Squyres
On Dec 17, 2009, at 5:55 PM,  wrote:

> I am happy to be able to inform you that the problems we were
> seeing would seem to have been arising down at the OpenMPI
> level.

Happy for *them*, at least.  ;-)

> If I remove any acknowledgement of IPv6 within the OpenMPI
> code, then both the PETSc examples and PISM application
> have been seen to be running upon my initial 8-processor
> parallel environment when submitted as an Sun Grid Engine
> job.

Ok, that's good.

> I guess this means that the PISM and PETSc guys can "stand easy"
> whilst the OpenMPI community needs to follow up on why there's
> a "addr.sa_len=0" creeping through the interface inspection
> code (upon NetBSD at least) when it passes thru the various
> IPv6 stanzas.

Ok.  We're still somewhat at a loss here, because we don't have any NetBSD to 
test on.  :-(  We're happy to provide any help that we can, and just like you, 
we'd love to see this problem resolved -- but NetBSD still isn't on any of our 
core competency lists.  :-(

FWIW, we might want to move this discussion to the de...@open-mpi.org mailing 
list...

-- 
Jeff Squyres
jsquy...@cisco.com




[OMPI users] NetBSD OpenMPI - SGE - PETSc - PISM

2009-12-17 Thread Kevin . Buckley
A whole swathe of people have been made aware of the issues
that have arisen as a result of a researcher here looking to
run PISM, which sits on top of PETSc, which sits on top of
OpenMPI.

I am happy to be able to inform you that the problems we were
seeing would seem to have been arising down at the OpenMPI
level.

If I remove any acknowledgement of IPv6 within the OpenMPI
code, then both the PETSc examples and PISM application
have been seen to be running upon my initial 8-processor
parallel environment when submitted as an Sun Grid Engine
job.

I guess this means that the PISM and PETSc guys can "stand easy"
whilst the OpenMPI community needs to follow up on why there's
a "addr.sa_len=0" creeping through the interface inspection
code (upon NetBSD at least) when it passes thru the various
IPv6 stanzas.


Thanks for all the feedback on this from all the quarters,
Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI users] Notifier Framework howto

2009-12-17 Thread Jeff Squyres
That would be great!

On Dec 17, 2009, at 3:52 PM, Ralph Castain wrote:

> If it would help, I have time and am willing to add notifier calls to this 
> area of the code base. You'll still get the errors shown here as I always 
> bury the notifier call behind the error check that surrounds these error 
> messages to avoid impacting the critical path, but you would be able to gt 
> syslog messages (or whatever channel you choose) as well.
> 
> Ralph
> 
> On Dec 10, 2009, at 4:19 PM, Jeff Squyres wrote:
> 
> > On Dec 10, 2009, at 5:06 PM, Brock Palen wrote:
> >
> >> I would like to try out the notifier framework, problem is I am having  
> >> trouble finding documentation for it,  I am digging around the website  
> >> and not finding much.
> >>
> >> Currently we have a problem where hosts are throwing up errors like:
> >> [nyx0891.engin.umich.edu][[25560,1],45][btl_tcp_endpoint.c:
> >> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed:
> >> Connection timed out (110)
> >
> > Yoinks.  Any idea why this is happening?
> >
> >> We would like when this happens to notify us, so we can put time
> >> stamps on events going on on the network.  Is this even possible with
> >> the frame work?  See we don't show any interfaces coming up and down,
> >> or any errors on interfaces, so we are looking to isolate the problem
> >> more.  Only the MPI library knows when this happens.
> >
> > It's not well documented.  So let's start here...
> >
> > The first issue is that we currently only have notifier calls down in the 
> > openib BTL -- not any of the others.  :-(  We put it there because there 
> > was specific requests to be notified when IB links went down.  We then used 
> > those as a request for comment from the community, asking "do you like 
> > this? do you want more?"  We kinda got nothing back, and I'll admit that we 
> > kinda forgot about it -- and therefore never added notifier calls elsewhere 
> > in the code.  :-\
> >
> > We designed the notifier in OMPI to be trivially easy to use throughout the 
> > code base -- it's just adding a single function call where the error 
> > occurs.  Would you, perchance, be interested in adding any of these in the 
> > TCP BTL?  I'd be happy to point you in the right direction... :-)
> >
> > After that, it's just a matter of enabling a notifier:
> >
> >  mpirun --mca notifier syslog ...
> >
> > Each notifier has some MCA params that are fairly obvious -- use:
> >
> >  ompi_info --param notifier all
> >
> > to see them.  There's 3 notifier plugins:
> >
> > - command: execute any arbitrary command.  It must run in finite (short) 
> > time.  You use MCA params to set the command (we can pass some strings down 
> > to the command; see the ompi_info help string for more details), and set a 
> > timeout such that if the command runs for that many seconds without 
> > exiting, we'll kill it.
> >
> > - syslog: because it was simple to do -- we just output a string to the 
> > syslog.
> >
> > - twitter: because it was fun to do.  ;-)  Actually, the rationale was that 
> > you can tweet to a private feed and then slave an RSS reader to it to see 
> > if anything happens.  It will need to be able to reach the general internet 
> > (i.e., twitter.com); proxies are not supported.  Set your twitter 
> > username/password via MCA params.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] Notifier Framework howto

2009-12-17 Thread Ralph Castain
If it would help, I have time and am willing to add notifier calls to this area 
of the code base. You'll still get the errors shown here as I always bury the 
notifier call behind the error check that surrounds these error messages to 
avoid impacting the critical path, but you would be able to gt syslog messages 
(or whatever channel you choose) as well.

Ralph

On Dec 10, 2009, at 4:19 PM, Jeff Squyres wrote:

> On Dec 10, 2009, at 5:06 PM, Brock Palen wrote:
> 
>> I would like to try out the notifier framework, problem is I am having  
>> trouble finding documentation for it,  I am digging around the website  and 
>> not finding much.
>> 
>> Currently we have a problem where hosts are throwing up errors like:
>> [nyx0891.engin.umich.edu][[25560,1],45][btl_tcp_endpoint.c:
>> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed: 
>> Connection timed out (110)
> 
> Yoinks.  Any idea why this is happening?
> 
>> We would like when this happens to notify us, so we can put time 
>> stamps on events going on on the network.  Is this even possible with 
>> the frame work?  See we don't show any interfaces coming up and down, 
>> or any errors on interfaces, so we are looking to isolate the problem 
>> more.  Only the MPI library knows when this happens.
> 
> It's not well documented.  So let's start here...
> 
> The first issue is that we currently only have notifier calls down in the 
> openib BTL -- not any of the others.  :-(  We put it there because there was 
> specific requests to be notified when IB links went down.  We then used those 
> as a request for comment from the community, asking "do you like this? do you 
> want more?"  We kinda got nothing back, and I'll admit that we kinda forgot 
> about it -- and therefore never added notifier calls elsewhere in the code.  
> :-\
> 
> We designed the notifier in OMPI to be trivially easy to use throughout the 
> code base -- it's just adding a single function call where the error occurs.  
> Would you, perchance, be interested in adding any of these in the TCP BTL?  
> I'd be happy to point you in the right direction... :-)
> 
> After that, it's just a matter of enabling a notifier:
> 
>  mpirun --mca notifier syslog ...
> 
> Each notifier has some MCA params that are fairly obvious -- use:
> 
>  ompi_info --param notifier all
> 
> to see them.  There's 3 notifier plugins:
> 
> - command: execute any arbitrary command.  It must run in finite (short) 
> time.  You use MCA params to set the command (we can pass some strings down 
> to the command; see the ompi_info help string for more details), and set a 
> timeout such that if the command runs for that many seconds without exiting, 
> we'll kill it.
> 
> - syslog: because it was simple to do -- we just output a string to the 
> syslog.
> 
> - twitter: because it was fun to do.  ;-)  Actually, the rationale was that 
> you can tweet to a private feed and then slave an RSS reader to it to see if 
> anything happens.  It will need to be able to reach the general internet 
> (i.e., twitter.com); proxies are not supported.  Set your twitter 
> username/password via MCA params.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] error performing MPI_Comm_spawn

2009-12-17 Thread Ralph Castain
Will be in the 1.4 nightly tarball generated later tonight...

Thanks again
Ralph

On Dec 17, 2009, at 4:07 AM, Marcia Cristina Cera wrote:

> very good news 
> I will wait carefully for the release :)
> 
> Thanks, Ralph
> márcia.
> 
> On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain  wrote:
> Ah crumb - I found the problem. Sigh.
> 
> I actually fixed this in the trunk over 5 months ago when the problem first 
> surfaced in my own testing, but it never came across to the stable release 
> branch. The problem is that we weren't serializing the comm_spawn requests, 
> and so the launch system gets confused over what has and hasn't completed 
> launch. That's why it works fine on the trunk.
> 
> I'm creating the 1.4 patch right now. Thanks for catching this. Old brain 
> completely forgot until I started tracking the commit history and found my 
> own footprints!
> 
> Ralph
> 
> On Dec 16, 2009, at 5:43 AM, Marcia Cristina Cera wrote:
> 
>> Hi Ralph,
>> 
>> I am afraid I have been a little hasty!
>> I remake my tests with more care and I got the same error also with the 
>> 1.3.3 :-/
>> but in such version the error happens after some successful executions... 
>> because of that I did not realize before!
>> Furthermore, I increased the number of levels of the tree (that means have 
>> more concurrently dynamic process creations in the lower levels) and I never 
>> arrive to execute without error, unless I add the delay. 
>> Perhaps the problem might even be a race condition :(
>> 
>> I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work 
>> with LAM for years, but I migrate o OpenMP last year once LAM will be 
>> discontinued... 
>> 
>> I think that I can continue the development of my application adding the 
>> delay, while I wait for a release... and I leave the performance tests to be 
>> made in the future :)
>> 
>> Thank you again Ralph,
>> márcia.
>> 
>> 
>> On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain  wrote:
>> Okay, I can replicate this.
>> 
>> FWIW: your  test program works fine with the OMPI trunk and 1.3.3. It only 
>> has a problem with 1.4. Since I can replicate it on multiple machines every 
>> single time, I don't think it is actually a race condition.
>> 
>> I think someone made a change to the 1.4 branch that created a failure mode 
>> :-/
>> 
>> Will have to get back to you on this - may take awhile, and won't be in the 
>> 1.4.1 release.
>> 
>> Thanks for the replicator!
>> 
>> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote:
>> 
>>> Thank you, Ralph
>>> 
>>> I will use the 1.3.3 for now... 
>>> while waiting for a future fix release that break this race condiction.
>>> 
>>> márcia
>>> 
>>> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain  wrote:
>>> Looks to me like it is a race condition, and the timing between 1.3.3 and 
>>> 1.4 is just enough to trip it. I can break the race, but it will have to be 
>>> in a future fix release.
>>> 
>>> Meantime, your best bet is to either stick with 1.3.3 or add the delay.
>>> 
>>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:
>>> 
 Hi,
 
 I intend to develop an application using the MPI_Comm_spawn to create 
 dynamically new MPI tasks (or processes). 
 The structure of the program is like a tree: each node creates 2 new ones 
 until reaches a predefined number of levels.
 
 I developed a small program to explain my problem as can be seen in 
 attachment.
 -- start.c: launches (through MPI_Comm_spawn, in which the argv has the 
 level value) the root of the tree (a ch_rec program). Afterward spawn, a 
 message is sent to  child and the process block in an MPI_Recv.
 -- ch_rec.c: gets its level value and receives the parent message, then if 
 its level is less than a predefined limit, it will creates 2 children: 
 - set the level value;
 - spawn 1 child;
 - send a message;
 - call an MPI_Irecv;
 - repeat the 4 previous steps for the second child;
 - call an MPI_Waitany waiting for children returns.
 When children messages are received, the process send a message to its 
 parent and call MPI_Finalize.
 
 Using the openmpi-1.3.3 version the program runs as expected but with 
 openmpi-1.4 I get the following error:
 
 $ mpirun -np 1 start
 level 0
 level = 1
 Parent sent: level 0 (pid:4279)
 level = 2
 Parent sent: level 1 (pid:4281)
 [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0] 
 ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 
 758
 
 The error happens when my program try to launch the second child 
 immediately after the first spawn call. 
 In my tests I try to put an sleep of 2 second between the first and the 
 second spawn, and then the program runs as expected.
 
 Some one can help me with this version 1.4 bug? 
 
 thanks,
 márcia.
 
>

Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-17 Thread Nicolas Bock
Ok, I'll give it a try.

Thanks, nick


On Thu, Dec 17, 2009 at 12:44, Ralph Castain  wrote:

> In case you missed it, this patch should be in the 1.4 nightly tarballs -
> feel free to test and let me know what you find.
>
> Thanks
> Ralph
>
> On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote:
>
> That was quick. I will try the patch as soon as you release it.
>
> nick
>
>
> On Wed, Dec 2, 2009 at 21:06, Ralph Castain  wrote:
>
>> Patch is built and under review...
>>
>> Thanks again
>> Ralph
>>
>> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:
>>
>> Thanks
>>
>> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
>>
>>> Yeah, that's the one all right! Definitely missing from 1.3.x.
>>>
>>> Thanks - I'll build a patch for the next bug-fix release
>>>
>>>
>>> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
>>>
>>> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain 
>>> wrote:
>>> >> Indeed - that is very helpful! Thanks!
>>> >> Looks like we aren't cleaning up high enough - missing the directory
>>> level.
>>> >> I seem to recall seeing that error go by and that someone fixed it on
>>> our
>>> >> devel trunk, so this is likely a repair that didn't get moved over to
>>> the
>>> >> release branch as it should have done.
>>> >> I'll look into it and report back.
>>> >
>>> > You are probably referring to
>>> > https://svn.open-mpi.org/trac/ompi/changeset/21498
>>> >
>>> > There was an issue about orte_session_dir_finalize() not
>>> > cleaning up the session directories properly.
>>> >
>>> > Hope that helps.
>>> >
>>> > Abhishek
>>> >
>>> >> Thanks again
>>> >> Ralph
>>> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>>> >>
>>> >>
>>> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>>> >>>
>>> >>> Hmmif you are willing to keep trying, could you perhaps let it
>>> run for
>>> >>> a brief time, ctrl-z it, and then do an ls on a directory from a
>>> process
>>> >>> that has already terminated? The pids will be in order, so just look
>>> for an
>>> >>> early number (not mpirun or the parent, of course).
>>> >>> It would help if you could give us the contents of a directory from a
>>> >>> child process that has terminated - would tell us what subsystem is
>>> failing
>>> >>> to properly cleanup.
>>> >>
>>> >> Ok, so I Ctrl-Z the master. In
>>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
>>> >> directory
>>> >>
>>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>>> >>
>>> >> I can't find that PID though. mpirun has PID 4230, orted does not
>>> exist,
>>> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it
>>> again,
>>> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68,
>>> there
>>> >> are 70 sequentially numbered directories starting at 0. Every
>>> directory
>>> >> contains another directory called "0". There is nothing in any of
>>> those
>>> >> directories. I see for instance:
>>> >>
>>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
>>> >> total 4.0K
>>> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>>> >>
>>> >> and
>>> >>
>>> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls
>>> -lh
>>> >> 70/0/
>>> >> total 0
>>> >>
>>> >> I hope this information helps. Did I understand your question
>>> correctly?
>>> >>
>>> >> nick
>>> >>
>>> >> ___
>>> >> users mailing list
>>> >> us...@open-mpi.org
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >>
>>> >> ___
>>> >> users mailing list
>>> >> us...@open-mpi.org
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >>
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] [visit-developers] /usr/bin/ld: cannot find -lrdmacm on 9184

2009-12-17 Thread tom fogal
Simon Su  writes:
> Hi Tom,
> 
> "hello world" MPI program also won't compile if
> librdmacm-devel-1.0.8-5.el5 is not installed. I have asked the person
> who maintain the openmpi package on how they were compiled. My guess
> is librdmacm-devel-1.0.8-5.el5 may need to be added as dependency
> package for openmpi010208-gcc-devel-1.2.8-8.cses.5.PU_IAS.5 package
> (where I got my openmpi installation) to solve the problem and to
> verify that we have the correct openmpi compilation.

Yes, I agree, from this message && your last (to OMPI: Simon mentioned
that the issue goes away if he installs librdmacm-devel), it sounds
like librdmacm-devel is a "build-dep", but the openmpi-devel package
needs it as a "dep".

Anyway, I'll leave it up to you to forward the error/conclusion to
whomever your upstream is.  Thanks for digging into this,

-tom

> On Wed, Dec 16, 2009 at 5:45 PM, tom fogal  wrote:
> 
> > Hi,
> >
> > Jeff sent this reply to our inquiry yesterday.
> >
> > Simon -- can you give it a read?  In particular, validating you've got
> > the right mpic++ sounds like a good idea.  We're also curious if a
> > simple "hello world" MPI program can link using gcc + the flags from
> > mpic++ -show.
> >
> > -tom
> >
> > --- Forwarded Message
> >
> > From: Jeff Squyres 
> > In-Reply-To: 
> > Date: Wed, 16 Dec 2009 17:20:35 -0500
> > To: "Open MPI Users" 
> > Cc: VisIt Developers 
> > Subject: Re: [OMPI users] [visit-developers] /usr/bin/ld: cannot
> >find-lrdmacm on 9184
> >
> > It depends on how Open MPI was built.
> >
> > If Open MPI was built without plugins (i.e., all the plugins are slurped up
> > into libmpi and friends), then yes, applications need to link against
> > librdmacm to use the RDMA CM mode of OpenFabrics transport.
> >
> > If Open MPI was built with plugins (which is the default), then apps don't
> > need to link against librdmacm because the only use of rdmacm is in an Open
> > MPI plugin, and that plugin was linked against librdmacm.
> >
> > Make sense?
> >
> > That being said, the output from mpic++ --showme should give you something
> > that is directly compile-/link-able.  So it is odd if mpic++ is showing you
> > something that can't (or shouldn't?) be done.  Did a -L argument get lost
> > somewhere, perchance?
> >
> > Does linking MPI applications with mpic++ work properly, or does it result
> > in the same error?  If it results in the same error, then perhaps something
> > has changed since Open MPI was installed...?
> >
> > All this being said, two other random points:
> >
> > 1. Ensure that you're using the "right" mpic++.  I.e., make sure it matches
> > the version/installation of Open MPI that you're trying to use.
> >
> > 2. If you don't link with the librdmacm, you're probably not losing any
> > important functionality unless you have an iWarp-based cluster (that's the
> > only transport that *needs* librdmacm).  IB-based networks can use
> > librdmacm, but don't *need* it (it's only used for making initial
> > connections, so using librdmacm or not has no implications on overall MPI
> > performance).  It's still odd that mpic++ wants it and it can't be found,
> > though...
> >
> > Does that helps?
> >
> >
> > On Dec 15, 2009, at 11:11 PM, tom fogal wrote:
> >
> > > Simon Su  writes:
> > > > Hi Tom,
> > > >
> > > > I am using the standard openmpi package that run on all the cluster
> > > > machines here at Princeton. So, maybe I shouldn't touch openmpi. But,
> > > > removing -lrdmacm from the MPI_LIBS line in the machinename.conf file
> > > > worked.  Any implication from doing this?
> > >
> > > The only thing it could possibly do is disable RDMA for you.  However,
> > > since removing it did not produce any undefined symbol errors, my guess
> > > is that your OpenMPI isn't using RDMA anyway.
> > >
> > > There might be an OpenMPI bug here, though.  I've cc'd the OpenMPI
> > > community to see if they have any input.  As a summary for them: Simon
> > > is trying to build our MPI-enabled application.  A script which tries
> > > to automate this adds the output of "mpic++ -show".  His build then
> > > failed because it attempted to link against librdmacm, which does not
> > > exist in his normal search paths (or maybe at all).  Is it possible
> > > that `mpic++ -show' includes/adds "-lrdmacm" even when OpenMPI is not
> > > itself using the library?
> > >
> > > Thanks,
> > >
> > > -tom
> > >
> > > > On Tue, Dec 15, 2009 at 8:46 PM, tom fogal 
> > wrote:
> > > >
> > > > > Simon Su  writes:
> > > > > > I am getting this error message while building 9184.
> > > > > [snip]
> > > > > > -lz -lm -ldl  -lpthread -L/usr/local/openmpi/1.3.3/gcc/x86_64/lib64
> > > > > > -lmpi_cxx -lmpi -lopen-rte -lopen-pal -lrdmacm -libverbs -lnuma
> > -ldl
> > > > > -lnsl
> > > > > > -lutil -lm  -lcognomen \
> > > > > > -L/usr/local/openmpi/1.3.3/gcc/x86_64/lib64 -lmpi_cxx -lmpi
> > > > > > -lopen-rte -lopen-pal -lrdmacm -libverbs -lnuma -ldl -lnsl -lutil
> > -lm
> > > > > > -lcognomen
> > > > >

Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-17 Thread Ralph Castain
In case you missed it, this patch should be in the 1.4 nightly tarballs - feel 
free to test and let me know what you find.

Thanks
Ralph

On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote:

> That was quick. I will try the patch as soon as you release it.
> 
> nick
> 
> 
> On Wed, Dec 2, 2009 at 21:06, Ralph Castain  wrote:
> Patch is built and under review...
> 
> Thanks again
> Ralph
> 
> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:
> 
>> Thanks
>> 
>> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
>> Yeah, that's the one all right! Definitely missing from 1.3.x.
>> 
>> Thanks - I'll build a patch for the next bug-fix release
>> 
>> 
>> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
>> 
>> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
>> >> Indeed - that is very helpful! Thanks!
>> >> Looks like we aren't cleaning up high enough - missing the directory 
>> >> level.
>> >> I seem to recall seeing that error go by and that someone fixed it on our
>> >> devel trunk, so this is likely a repair that didn't get moved over to the
>> >> release branch as it should have done.
>> >> I'll look into it and report back.
>> >
>> > You are probably referring to
>> > https://svn.open-mpi.org/trac/ompi/changeset/21498
>> >
>> > There was an issue about orte_session_dir_finalize() not
>> > cleaning up the session directories properly.
>> >
>> > Hope that helps.
>> >
>> > Abhishek
>> >
>> >> Thanks again
>> >> Ralph
>> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>> >>
>> >>
>> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>> >>>
>> >>> Hmmif you are willing to keep trying, could you perhaps let it run 
>> >>> for
>> >>> a brief time, ctrl-z it, and then do an ls on a directory from a process
>> >>> that has already terminated? The pids will be in order, so just look for 
>> >>> an
>> >>> early number (not mpirun or the parent, of course).
>> >>> It would help if you could give us the contents of a directory from a
>> >>> child process that has terminated - would tell us what subsystem is 
>> >>> failing
>> >>> to properly cleanup.
>> >>
>> >> Ok, so I Ctrl-Z the master. In
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
>> >> directory
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>> >>
>> >> I can't find that PID though. mpirun has PID 4230, orted does not exist,
>> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again,
>> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there
>> >> are 70 sequentially numbered directories starting at 0. Every directory
>> >> contains another directory called "0". There is nothing in any of those
>> >> directories. I see for instance:
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
>> >> total 4.0K
>> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>> >>
>> >> and
>> >>
>> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh
>> >> 70/0/
>> >> total 0
>> >>
>> >> I hope this information helps. Did I understand your question correctly?
>> >>
>> >> nick
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min,

I've had a chance  to use the openmpi-mpirun script.
Though unfortunately I do not have openmpi available at the moment
(customer where I'm at runs RHEL4 with LAM/MPI) I've been able to see
what quotes need to be used and this one seems to work on my end:

 bsub -e ERR -o OUT -n 16 './openmpi-mpirun /bin/sh -c "ulimit -s
unlimited ; ./wrf.exe " '

Could you give this a try?

Regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 5:56 PM, Min Zhu  wrote:
> Hi, Jeroen,
>
> Thanks a lot. Unfortunately I don't think I have got mpirun.lsf. The Dell 
> company only asked us to use openmpi-mpirun as a wrapper script.
>
> Cheers,
>
> Min Zhu
>
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Jeroen Kleijer
> Sent: 17 December 2009 16:49
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> Hi Min,
>
> Sorry for the mixup but it's been a while since I've actually used LSF.
> I've had a look at my notes and to use mpirun with LSF you should give
> something like this:
>
> bsub -a openmpi -n 16 "mpirun.lsf  -x PATH -x LD_LIBRARY_PATH -x
> MPI_BUFFER_SIZE  \" ulimit -s unlimited ; ./wrf.exe \" "
>
> (at least I think you should :) )
>
> The mpirun.lsf is a wrapper provided by LSF and the -a openmpi tells
> it to set the necessary openmpi environment varibales etc.
>
> Kind regards,
>
> Jeroen Kleijer
>
> On Thu, Dec 17, 2009 at 5:32 PM, Min Zhu  wrote:
>> Hi,
>>
>> This time the OUT file is
>> 
>> Sender: LSF System 
>> Subject: Job 667: > ./wrf.exe ' " > Exited
>>
>> Job  was 
>> submitted from host  by user .
>> Job was executed on host(s) <8*compute-01>, in queue , as user .
>>                            <8*compute-12>
>>  was used as the home directory.
>>  was used as the working directory.
>> Started at Thu Dec 17 18:27:58 2009
>> Results reported at Thu Dec 17 18:28:04 2009
>>
>> Your job looked like:
>>
>> 
>> # LSBATCH: User input
>> openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' "
>> 
>>
>> Exited with exit code 2.
>>
>> Resource usage summary:
>>
>>    CPU time   :      0.07 sec.
>>
>> The output (if any) follows:
>>
>>
>>
>> PS:
>>
>> Read file  for stderr output of this job.
>> -
>>
>> ERR file is,
>> 
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> -s: -c: line 0: unexpected EOF while looking for matching `''
>> -s: -c: line 1: syntax error: unexpected end of file
>> --
>>
>> Cheers,
>>
>> Min Zhu
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>> Behalf Of Jeroen Kleijer
>> Sent: 17 December 2009 16:24
>> To: Open MPI Users
>> Subject: Re: [OMPI users] About openmpi-mpirun
>>
>> Hi Min,
>>
>> Seems like the command ulimit was executed 16 times but after that
>> (the "; ./wrf.exe") was ignored.
>> Could you give the following a try:
>>
>> bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s
>> unlimited ; ./wrf.exe ' \" "
>>
>> Somewhere, quoting g

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeroen,

Thanks a lot. Unfortunately I don't think I have got mpirun.lsf. The Dell 
company only asked us to use openmpi-mpirun as a wrapper script.

Cheers,

Min Zhu


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeroen Kleijer
Sent: 17 December 2009 16:49
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

Hi Min,

Sorry for the mixup but it's been a while since I've actually used LSF.
I've had a look at my notes and to use mpirun with LSF you should give
something like this:

bsub -a openmpi -n 16 "mpirun.lsf  -x PATH -x LD_LIBRARY_PATH -x
MPI_BUFFER_SIZE  \" ulimit -s unlimited ; ./wrf.exe \" "

(at least I think you should :) )

The mpirun.lsf is a wrapper provided by LSF and the -a openmpi tells
it to set the necessary openmpi environment varibales etc.

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 5:32 PM, Min Zhu  wrote:
> Hi,
>
> This time the OUT file is
> 
> Sender: LSF System 
> Subject: Job 667:  ./wrf.exe ' " > Exited
>
> Job  was 
> submitted from host  by user .
> Job was executed on host(s) <8*compute-01>, in queue , as user .
>                            <8*compute-12>
>  was used as the home directory.
>  was used as the working directory.
> Started at Thu Dec 17 18:27:58 2009
> Results reported at Thu Dec 17 18:28:04 2009
>
> Your job looked like:
>
> 
> # LSBATCH: User input
> openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' "
> 
>
> Exited with exit code 2.
>
> Resource usage summary:
>
>    CPU time   :      0.07 sec.
>
> The output (if any) follows:
>
>
>
> PS:
>
> Read file  for stderr output of this job.
> -
>
> ERR file is,
> 
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> --
>
> Cheers,
>
> Min Zhu
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Jeroen Kleijer
> Sent: 17 December 2009 16:24
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> Hi Min,
>
> Seems like the command ulimit was executed 16 times but after that
> (the "; ./wrf.exe") was ignored.
> Could you give the following a try:
>
> bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s
> unlimited ; ./wrf.exe ' \" "
>
> Somewhere, quoting goes wrong and I'm trying to figure out where
>
> Kind regards,
>
> Jeroen Kleijer
>
> On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu  wrote:
>> Hi, Jeroen,
>>
>> Here is the OUT file, ERR file is empty.
>>
>> --
>> Sender: LSF System 
>> Subject: Job 662: > ' > Done
>>
>> Job  was 
>> submitted from host  by user .
>> Job was executed on host(s) <8*compute-10>, in queue , as user .
>>                            <8*compute-11>
>>  was used as the home directory.
>>  was used as the working directory.
>> Started at Thu Dec 17 17:37

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min,

Sorry for the mixup but it's been a while since I've actually used LSF.
I've had a look at my notes and to use mpirun with LSF you should give
something like this:

bsub -a openmpi -n 16 "mpirun.lsf  -x PATH -x LD_LIBRARY_PATH -x
MPI_BUFFER_SIZE  \" ulimit -s unlimited ; ./wrf.exe \" "

(at least I think you should :) )

The mpirun.lsf is a wrapper provided by LSF and the -a openmpi tells
it to set the necessary openmpi environment varibales etc.

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 5:32 PM, Min Zhu  wrote:
> Hi,
>
> This time the OUT file is
> 
> Sender: LSF System 
> Subject: Job 667:  ./wrf.exe ' " > Exited
>
> Job  was 
> submitted from host  by user .
> Job was executed on host(s) <8*compute-01>, in queue , as user .
>                            <8*compute-12>
>  was used as the home directory.
>  was used as the working directory.
> Started at Thu Dec 17 18:27:58 2009
> Results reported at Thu Dec 17 18:28:04 2009
>
> Your job looked like:
>
> 
> # LSBATCH: User input
> openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' "
> 
>
> Exited with exit code 2.
>
> Resource usage summary:
>
>    CPU time   :      0.07 sec.
>
> The output (if any) follows:
>
>
>
> PS:
>
> Read file  for stderr output of this job.
> -
>
> ERR file is,
> 
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> -s: -c: line 0: unexpected EOF while looking for matching `''
> -s: -c: line 1: syntax error: unexpected end of file
> --
>
> Cheers,
>
> Min Zhu
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Jeroen Kleijer
> Sent: 17 December 2009 16:24
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> Hi Min,
>
> Seems like the command ulimit was executed 16 times but after that
> (the "; ./wrf.exe") was ignored.
> Could you give the following a try:
>
> bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s
> unlimited ; ./wrf.exe ' \" "
>
> Somewhere, quoting goes wrong and I'm trying to figure out where
>
> Kind regards,
>
> Jeroen Kleijer
>
> On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu  wrote:
>> Hi, Jeroen,
>>
>> Here is the OUT file, ERR file is empty.
>>
>> --
>> Sender: LSF System 
>> Subject: Job 662: > ' > Done
>>
>> Job  was 
>> submitted from host  by user .
>> Job was executed on host(s) <8*compute-10>, in queue , as user .
>>                            <8*compute-11>
>>  was used as the home directory.
>>  was used as the working directory.
>> Started at Thu Dec 17 17:37:09 2009
>> Results reported at Thu Dec 17 17:37:15 2009
>>
>> Your job looked like:
>>
>> 
>> # LSBATCH: User input
>> openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe '
>> 
>>
>> Successfully completed.
>>
>> Resource usage summary:
>>
>>    CPU time   :      0.0

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Ashley,

I changed the content in openmpi-mpirun script you mentioned but wrf.exe
still not executed.

Cheers,

Min Zhu

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ashley Pittman
Sent: 17 December 2009 16:39
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

On Thu, 2009-12-17 at 14:40 +, Min Zhu wrote:

> Here is the content of openmpi-mpirun file, so maybe something needs
to
> be changed?
> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
> usage
> exit -1
> fi
> 
> MYARGS=$*

Shouldn't this be MYARGS=$@  It'll change the way quoted args are
forwarded to the parallel job.

Ashley,


-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains 
information that may be confidential, and is protected by copyright. It is 
directed to the intended recipient(s) only.  If you have received this e-mail 
in error please e-mail the sender by replying to this message, and then delete 
the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail 
is prohibited.  Any communication of a personal nature in this e-mail is not 
made by or on behalf of any RES group company. E-mails sent or received may be 
monitored to ensure compliance with the law, regulation and/or our policies.



Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi,

This time the OUT file is

Sender: LSF System 
Subject: Job 667:  Exited

Job  was 
submitted from host  by user .
Job was executed on host(s) <8*compute-01>, in queue , as user .
<8*compute-12>
 was used as the home directory.
 was used as the working directory.
Started at Thu Dec 17 18:27:58 2009
Results reported at Thu Dec 17 18:28:04 2009

Your job looked like:


# LSBATCH: User input
openmpi-mpirun "/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' "


Exited with exit code 2.

Resource usage summary:

CPU time   :  0.07 sec.

The output (if any) follows:



PS:

Read file  for stderr output of this job.
-

ERR file is,

-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
-s: -c: line 0: unexpected EOF while looking for matching `''
-s: -c: line 1: syntax error: unexpected end of file
--

Cheers,

Min Zhu

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeroen Kleijer
Sent: 17 December 2009 16:24
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

Hi Min,

Seems like the command ulimit was executed 16 times but after that
(the "; ./wrf.exe") was ignored.
Could you give the following a try:

bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s
unlimited ; ./wrf.exe ' \" "

Somewhere, quoting goes wrong and I'm trying to figure out where

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu  wrote:
> Hi, Jeroen,
>
> Here is the OUT file, ERR file is empty.
>
> --
> Sender: LSF System 
> Subject: Job 662:  ' > Done
>
> Job  was 
> submitted from host  by user .
> Job was executed on host(s) <8*compute-10>, in queue , as user .
>                            <8*compute-11>
>  was used as the home directory.
>  was used as the working directory.
> Started at Thu Dec 17 17:37:09 2009
> Results reported at Thu Dec 17 17:37:15 2009
>
> Your job looked like:
>
> 
> # LSBATCH: User input
> openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe '
> 
>
> Successfully completed.
>
> Resource usage summary:
>
>    CPU time   :      0.05 sec.
>
> The output (if any) follows:
>
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
>
>
> PS:
>
> Read file  for stderr output of this job.
>
> --
>
> Thanks,
>
> Min Zhu
>
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Jeroen Kleijer
> Sent: 17 December 2009 16:05
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> Hi Min
>
> Did you get any type of error message in the ERR or OUT files?
> I don't have mpirun installed in the environment at the moment but
> giving the

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Ashley Pittman
On Thu, 2009-12-17 at 14:40 +, Min Zhu wrote:

> Here is the content of openmpi-mpirun file, so maybe something needs to
> be changed?
> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
> usage
> exit -1
> fi
> 
> MYARGS=$*

Shouldn't this be MYARGS=$@  It'll change the way quoted args are
forwarded to the parallel job.

Ashley,


-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min,

Seems like the command ulimit was executed 16 times but after that
(the "; ./wrf.exe") was ignored.
Could you give the following a try:

bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s
unlimited ; ./wrf.exe ' \" "

Somewhere, quoting goes wrong and I'm trying to figure out where

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 5:09 PM, Min Zhu  wrote:
> Hi, Jeroen,
>
> Here is the OUT file, ERR file is empty.
>
> --
> Sender: LSF System 
> Subject: Job 662:  ' > Done
>
> Job  was 
> submitted from host  by user .
> Job was executed on host(s) <8*compute-10>, in queue , as user .
>                            <8*compute-11>
>  was used as the home directory.
>  was used as the working directory.
> Started at Thu Dec 17 17:37:09 2009
> Results reported at Thu Dec 17 17:37:15 2009
>
> Your job looked like:
>
> 
> # LSBATCH: User input
> openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe '
> 
>
> Successfully completed.
>
> Resource usage summary:
>
>    CPU time   :      0.05 sec.
>
> The output (if any) follows:
>
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
> unlimited
>
>
> PS:
>
> Read file  for stderr output of this job.
>
> --
>
> Thanks,
>
> Min Zhu
>
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Jeroen Kleijer
> Sent: 17 December 2009 16:05
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> Hi Min
>
> Did you get any type of error message in the ERR or OUT files?
> I don't have mpirun installed in the environment at the moment but
> giving the following:
>
> bsub -q interq -I "ssh  /bin/sh -c 'ulimit -s unlimited ;
> /bin/hostname ' "
>
> seems to work for me, so I'm kind of curious what the error message is
> you're seeing.
>
> Kind regards,
>
> Jeroen Kleijer
>
> On Thu, Dec 17, 2009 at 4:41 PM, Min Zhu  wrote:
>> Hi, Jeroen,
>>
>> Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 
>> "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe 
>> not executed.
>>
>> Cheers,
>>
>> Min Zhu
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>> Behalf Of Jeroen Kleijer
>> Sent: 17 December 2009 15:34
>> To: Open MPI Users
>> Subject: Re: [OMPI users] About openmpi-mpirun
>>
>> It's just that the "'s on the command line get parsed by LSF / bash
>> (or whatever shell you use)
>>
>> If you wish to use it without the script you can give this a try:
>> bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s
>> unlimited; ./wrf.exe ' "
>>
>> This causes to pass the whole string "openmpi-mpirun " to be
>> passed as a single string / command to LSF.
>> The second line between the single quotes is then passed as a single
>> argument to /bin/sh which is run by openmpi-mpirun.
>>
>> Kind regards,
>>
>> Jeroen Kleijer
>>
>> On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu  wrote:
>>> Hi, Jeff,
>>>
>>> Your script method works for me. Thank you very much,
>>>
>>> Cheers,
>>>
>>> Min Zhu
>>>
>>>
>>> -Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of Jeff Squyres
>>> Sent: 17 December 2009 14:56
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] About openmpi-mpirun
>>>
>>> This might be something you need to talk to Platform about...?
>>>
>>> Another option would be to openmpi-mpirun a script that is just a few
>>> lines long:
>>>
>>> #!/bin/sh
>>> ulimit -s unlimited
>>> ./wrf.exe
>>>
>>>
>>>
>>> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote:
>>>
 Hi, Jeff,

 Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
 -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.

 Here is the content of openmpi-mpirun file, so maybe something needs
>>> to
 be changed?

 --
 #!/bin/sh
 #
 #  Copyright (c) 2007 Platform Computing
 #
 # This script is a wrapper for openmpi mpirun
 # it generates the machine file based on the hosts
 # given to it by Lava.
 #

 usage() {
         cat <>>> USAGE:  $0
         This command is a wrapper for mpirun (openmpi).  It can
         only be run within Lava using bsub e.g.
                 bsub -n # "$0 -np # {my mpi command and args}"

         The wrapper will automatically generate the
         machinefile used by mpirun.

         NOTE:  The list of hosts cannot exceed 4KBytes.
 USEEOF
 }

 if [ x"${LSB_JOBFILENAME}" = x -o 

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeroen,

Here is the OUT file, ERR file is empty.

--
Sender: LSF System 
Subject: Job 662:  Done

Job  was 
submitted from host  by user .
Job was executed on host(s) <8*compute-10>, in queue , as user .
<8*compute-11>
 was used as the home directory.
 was used as the working directory.
Started at Thu Dec 17 17:37:09 2009
Results reported at Thu Dec 17 17:37:15 2009

Your job looked like:


# LSBATCH: User input
openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe '


Successfully completed.

Resource usage summary:

CPU time   :  0.05 sec.

The output (if any) follows:

unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited


PS:

Read file  for stderr output of this job.

--

Thanks,

Min Zhu


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeroen Kleijer
Sent: 17 December 2009 16:05
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

Hi Min

Did you get any type of error message in the ERR or OUT files?
I don't have mpirun installed in the environment at the moment but
giving the following:

bsub -q interq -I "ssh  /bin/sh -c 'ulimit -s unlimited ;
/bin/hostname ' "

seems to work for me, so I'm kind of curious what the error message is
you're seeing.

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 4:41 PM, Min Zhu  wrote:
> Hi, Jeroen,
>
> Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 
> "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe 
> not executed.
>
> Cheers,
>
> Min Zhu
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Jeroen Kleijer
> Sent: 17 December 2009 15:34
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> It's just that the "'s on the command line get parsed by LSF / bash
> (or whatever shell you use)
>
> If you wish to use it without the script you can give this a try:
> bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s
> unlimited; ./wrf.exe ' "
>
> This causes to pass the whole string "openmpi-mpirun " to be
> passed as a single string / command to LSF.
> The second line between the single quotes is then passed as a single
> argument to /bin/sh which is run by openmpi-mpirun.
>
> Kind regards,
>
> Jeroen Kleijer
>
> On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu  wrote:
>> Hi, Jeff,
>>
>> Your script method works for me. Thank you very much,
>>
>> Cheers,
>>
>> Min Zhu
>>
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of Jeff Squyres
>> Sent: 17 December 2009 14:56
>> To: Open MPI Users
>> Subject: Re: [OMPI users] About openmpi-mpirun
>>
>> This might be something you need to talk to Platform about...?
>>
>> Another option would be to openmpi-mpirun a script that is just a few
>> lines long:
>>
>> #!/bin/sh
>> ulimit -s unlimited
>> ./wrf.exe
>>
>>
>>
>> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote:
>>
>>> Hi, Jeff,
>>>
>>> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
>>> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.
>>>
>>> Here is the content of openmpi-mpirun file, so maybe something needs
>> to
>>> be changed?
>>>
>>> --
>>> #!/bin/sh
>>> #
>>> #  Copyright (c) 2007 Platform Computing
>>> #
>>> # This script is a wrapper for openmpi mpirun
>>> # it generates the machine file based on the hosts
>>> # given to it by Lava.
>>> #
>>>
>>> usage() {
>>>         cat <>> USAGE:  $0
>>>         This command is a wrapper for mpirun (openmpi).  It can
>>>         only be run within Lava using bsub e.g.
>>>                 bsub -n # "$0 -np # {my mpi command and args}"
>>>
>>>         The wrapper will automatically generate the
>>>         machinefile used by mpirun.
>>>
>>>         NOTE:  The list of hosts cannot exceed 4KBytes.
>>> USEEOF
>>> }
>>>
>>> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
>>>     usage
>>>     exit -1
>>> fi
>>>
>>> MYARGS=$*
>>> WORKDIR=`dirname ${LSB_JOBFILENAME}`
>>> MACHFILE=${WORKDIR}/mpi_machines
>>> ARGLIST=${WORKDIR}/mpi_args
>>>
>>> # Check if mpirun is in the PATH
>>> T=`which mpirun`
>>> if [ $? -ne 0 ]; then
>>>     echo "Error:  mpirun is not in your PATH."
>>>     exit -2
>>> fi
>>>
>>> echo "${MYARGS}" > ${ARGLIST}
>>> T=`grep -- -machinefile ${ARGLIST} |wc -l`
>>> if [ $T -gt 0 ]; then
>>>     echo "Error:  Do not provide the machinefile for mpirun."
>>>     echo "        It is generated automatically for you."
>>>     exit -3
>>> fi
>>>
>>> # Make the

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min

Did you get any type of error message in the ERR or OUT files?
I don't have mpirun installed in the environment at the moment but
giving the following:

bsub -q interq -I "ssh  /bin/sh -c 'ulimit -s unlimited ;
/bin/hostname ' "

seems to work for me, so I'm kind of curious what the error message is
you're seeing.

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 4:41 PM, Min Zhu  wrote:
> Hi, Jeroen,
>
> Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 
> "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe 
> not executed.
>
> Cheers,
>
> Min Zhu
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Jeroen Kleijer
> Sent: 17 December 2009 15:34
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> It's just that the "'s on the command line get parsed by LSF / bash
> (or whatever shell you use)
>
> If you wish to use it without the script you can give this a try:
> bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s
> unlimited; ./wrf.exe ' "
>
> This causes to pass the whole string "openmpi-mpirun " to be
> passed as a single string / command to LSF.
> The second line between the single quotes is then passed as a single
> argument to /bin/sh which is run by openmpi-mpirun.
>
> Kind regards,
>
> Jeroen Kleijer
>
> On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu  wrote:
>> Hi, Jeff,
>>
>> Your script method works for me. Thank you very much,
>>
>> Cheers,
>>
>> Min Zhu
>>
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of Jeff Squyres
>> Sent: 17 December 2009 14:56
>> To: Open MPI Users
>> Subject: Re: [OMPI users] About openmpi-mpirun
>>
>> This might be something you need to talk to Platform about...?
>>
>> Another option would be to openmpi-mpirun a script that is just a few
>> lines long:
>>
>> #!/bin/sh
>> ulimit -s unlimited
>> ./wrf.exe
>>
>>
>>
>> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote:
>>
>>> Hi, Jeff,
>>>
>>> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
>>> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.
>>>
>>> Here is the content of openmpi-mpirun file, so maybe something needs
>> to
>>> be changed?
>>>
>>> --
>>> #!/bin/sh
>>> #
>>> #  Copyright (c) 2007 Platform Computing
>>> #
>>> # This script is a wrapper for openmpi mpirun
>>> # it generates the machine file based on the hosts
>>> # given to it by Lava.
>>> #
>>>
>>> usage() {
>>>         cat <>> USAGE:  $0
>>>         This command is a wrapper for mpirun (openmpi).  It can
>>>         only be run within Lava using bsub e.g.
>>>                 bsub -n # "$0 -np # {my mpi command and args}"
>>>
>>>         The wrapper will automatically generate the
>>>         machinefile used by mpirun.
>>>
>>>         NOTE:  The list of hosts cannot exceed 4KBytes.
>>> USEEOF
>>> }
>>>
>>> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
>>>     usage
>>>     exit -1
>>> fi
>>>
>>> MYARGS=$*
>>> WORKDIR=`dirname ${LSB_JOBFILENAME}`
>>> MACHFILE=${WORKDIR}/mpi_machines
>>> ARGLIST=${WORKDIR}/mpi_args
>>>
>>> # Check if mpirun is in the PATH
>>> T=`which mpirun`
>>> if [ $? -ne 0 ]; then
>>>     echo "Error:  mpirun is not in your PATH."
>>>     exit -2
>>> fi
>>>
>>> echo "${MYARGS}" > ${ARGLIST}
>>> T=`grep -- -machinefile ${ARGLIST} |wc -l`
>>> if [ $T -gt 0 ]; then
>>>     echo "Error:  Do not provide the machinefile for mpirun."
>>>     echo "        It is generated automatically for you."
>>>     exit -3
>>> fi
>>>
>>> # Make the open-mpi machine file
>>> echo "${LSB_HOSTS}" > ${MACHFILE}.lst
>>> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}
>>>
>>> MPIRUN=`which --skip-alias mpirun`
>>> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}
>>>
>>> exit $?
>>>
>>> --
>>>
>>>
>>> Cheers,
>>>
>>> Min Zhu
>>>
>>> -Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>> On
>>> Behalf Of Jeff Squyres
>>> Sent: 17 December 2009 14:29
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] About openmpi-mpirun
>>>
>>> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote:
>>>
>>> > Thanks for your reply. Yes, your mpirun command works for me. But I
>>> need to use bsub job scheduler. I wonder why
>>> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s
>>> unlimited; ./wrf.exe" doesn't work.
>>>
>>> Try with different quoting...?  I don't know the details of the
>>> openmpi-mpirun script, but perhaps it's trying to exec the whole
>> quoted
>>> string as a single executable (which doesn't exist).  Perhaps:
>>>
>>> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s
>> unlimited;
>>> ./wrf.exe"
>>>
>>> That's a (somewhat educated) guess...
>>>
>>> --
>>>
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>>
>>>
>>> __

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeroen,

Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 
"openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe not 
executed. 

Cheers,

Min Zhu

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeroen Kleijer
Sent: 17 December 2009 15:34
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

It's just that the "'s on the command line get parsed by LSF / bash
(or whatever shell you use)

If you wish to use it without the script you can give this a try:
bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s
unlimited; ./wrf.exe ' "

This causes to pass the whole string "openmpi-mpirun " to be
passed as a single string / command to LSF.
The second line between the single quotes is then passed as a single
argument to /bin/sh which is run by openmpi-mpirun.

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu  wrote:
> Hi, Jeff,
>
> Your script method works for me. Thank you very much,
>
> Cheers,
>
> Min Zhu
>
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: 17 December 2009 14:56
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> This might be something you need to talk to Platform about...?
>
> Another option would be to openmpi-mpirun a script that is just a few
> lines long:
>
> #!/bin/sh
> ulimit -s unlimited
> ./wrf.exe
>
>
>
> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote:
>
>> Hi, Jeff,
>>
>> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
>> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.
>>
>> Here is the content of openmpi-mpirun file, so maybe something needs
> to
>> be changed?
>>
>> --
>> #!/bin/sh
>> #
>> #  Copyright (c) 2007 Platform Computing
>> #
>> # This script is a wrapper for openmpi mpirun
>> # it generates the machine file based on the hosts
>> # given to it by Lava.
>> #
>>
>> usage() {
>>         cat <> USAGE:  $0
>>         This command is a wrapper for mpirun (openmpi).  It can
>>         only be run within Lava using bsub e.g.
>>                 bsub -n # "$0 -np # {my mpi command and args}"
>>
>>         The wrapper will automatically generate the
>>         machinefile used by mpirun.
>>
>>         NOTE:  The list of hosts cannot exceed 4KBytes.
>> USEEOF
>> }
>>
>> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
>>     usage
>>     exit -1
>> fi
>>
>> MYARGS=$*
>> WORKDIR=`dirname ${LSB_JOBFILENAME}`
>> MACHFILE=${WORKDIR}/mpi_machines
>> ARGLIST=${WORKDIR}/mpi_args
>>
>> # Check if mpirun is in the PATH
>> T=`which mpirun`
>> if [ $? -ne 0 ]; then
>>     echo "Error:  mpirun is not in your PATH."
>>     exit -2
>> fi
>>
>> echo "${MYARGS}" > ${ARGLIST}
>> T=`grep -- -machinefile ${ARGLIST} |wc -l`
>> if [ $T -gt 0 ]; then
>>     echo "Error:  Do not provide the machinefile for mpirun."
>>     echo "        It is generated automatically for you."
>>     exit -3
>> fi
>>
>> # Make the open-mpi machine file
>> echo "${LSB_HOSTS}" > ${MACHFILE}.lst
>> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}
>>
>> MPIRUN=`which --skip-alias mpirun`
>> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}
>>
>> exit $?
>>
>> --
>>
>>
>> Cheers,
>>
>> Min Zhu
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
>> Behalf Of Jeff Squyres
>> Sent: 17 December 2009 14:29
>> To: Open MPI Users
>> Subject: Re: [OMPI users] About openmpi-mpirun
>>
>> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote:
>>
>> > Thanks for your reply. Yes, your mpirun command works for me. But I
>> need to use bsub job scheduler. I wonder why
>> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s
>> unlimited; ./wrf.exe" doesn't work.
>>
>> Try with different quoting...?  I don't know the details of the
>> openmpi-mpirun script, but perhaps it's trying to exec the whole
> quoted
>> string as a single executable (which doesn't exist).  Perhaps:
>>
>> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s
> unlimited;
>> ./wrf.exe"
>>
>> That's a (somewhat educated) guess...
>>
>> --
>>
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> CONFIDENTIALITY NOTICE: This e-mail, including any attachments,
> contains information that may be confidential, and is protected by
> copyright. It is directed to the intended recipient(s) only.  If you
> have received this e-mail in error please e-mail the sender by replying
> to this message, and then delete the e-mail. Unauthorised disclosure,
> publication, copying or use of this e-mail is prohibited.  Any
> communication of a personal nature in this e-mail is not ma

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
It's just that the "'s on the command line get parsed by LSF / bash
(or whatever shell you use)

If you wish to use it without the script you can give this a try:
bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s
unlimited; ./wrf.exe ' "

This causes to pass the whole string "openmpi-mpirun " to be
passed as a single string / command to LSF.
The second line between the single quotes is then passed as a single
argument to /bin/sh which is run by openmpi-mpirun.

Kind regards,

Jeroen Kleijer

On Thu, Dec 17, 2009 at 4:03 PM, Min Zhu  wrote:
> Hi, Jeff,
>
> Your script method works for me. Thank you very much,
>
> Cheers,
>
> Min Zhu
>
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: 17 December 2009 14:56
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
>
> This might be something you need to talk to Platform about...?
>
> Another option would be to openmpi-mpirun a script that is just a few
> lines long:
>
> #!/bin/sh
> ulimit -s unlimited
> ./wrf.exe
>
>
>
> On Dec 17, 2009, at 9:40 AM, Min Zhu wrote:
>
>> Hi, Jeff,
>>
>> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
>> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.
>>
>> Here is the content of openmpi-mpirun file, so maybe something needs
> to
>> be changed?
>>
>> --
>> #!/bin/sh
>> #
>> #  Copyright (c) 2007 Platform Computing
>> #
>> # This script is a wrapper for openmpi mpirun
>> # it generates the machine file based on the hosts
>> # given to it by Lava.
>> #
>>
>> usage() {
>>         cat <> USAGE:  $0
>>         This command is a wrapper for mpirun (openmpi).  It can
>>         only be run within Lava using bsub e.g.
>>                 bsub -n # "$0 -np # {my mpi command and args}"
>>
>>         The wrapper will automatically generate the
>>         machinefile used by mpirun.
>>
>>         NOTE:  The list of hosts cannot exceed 4KBytes.
>> USEEOF
>> }
>>
>> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
>>     usage
>>     exit -1
>> fi
>>
>> MYARGS=$*
>> WORKDIR=`dirname ${LSB_JOBFILENAME}`
>> MACHFILE=${WORKDIR}/mpi_machines
>> ARGLIST=${WORKDIR}/mpi_args
>>
>> # Check if mpirun is in the PATH
>> T=`which mpirun`
>> if [ $? -ne 0 ]; then
>>     echo "Error:  mpirun is not in your PATH."
>>     exit -2
>> fi
>>
>> echo "${MYARGS}" > ${ARGLIST}
>> T=`grep -- -machinefile ${ARGLIST} |wc -l`
>> if [ $T -gt 0 ]; then
>>     echo "Error:  Do not provide the machinefile for mpirun."
>>     echo "        It is generated automatically for you."
>>     exit -3
>> fi
>>
>> # Make the open-mpi machine file
>> echo "${LSB_HOSTS}" > ${MACHFILE}.lst
>> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}
>>
>> MPIRUN=`which --skip-alias mpirun`
>> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}
>>
>> exit $?
>>
>> --
>>
>>
>> Cheers,
>>
>> Min Zhu
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
>> Behalf Of Jeff Squyres
>> Sent: 17 December 2009 14:29
>> To: Open MPI Users
>> Subject: Re: [OMPI users] About openmpi-mpirun
>>
>> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote:
>>
>> > Thanks for your reply. Yes, your mpirun command works for me. But I
>> need to use bsub job scheduler. I wonder why
>> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s
>> unlimited; ./wrf.exe" doesn't work.
>>
>> Try with different quoting...?  I don't know the details of the
>> openmpi-mpirun script, but perhaps it's trying to exec the whole
> quoted
>> string as a single executable (which doesn't exist).  Perhaps:
>>
>> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s
> unlimited;
>> ./wrf.exe"
>>
>> That's a (somewhat educated) guess...
>>
>> --
>>
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> CONFIDENTIALITY NOTICE: This e-mail, including any attachments,
> contains information that may be confidential, and is protected by
> copyright. It is directed to the intended recipient(s) only.  If you
> have received this e-mail in error please e-mail the sender by replying
> to this message, and then delete the e-mail. Unauthorised disclosure,
> publication, copying or use of this e-mail is prohibited.  Any
> communication of a personal nature in this e-mail is not made by or on
> behalf of any RES group company. E-mails sent or received may be
> monitored to ensure compliance with the law, regulation and/or our
> policies.
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
>
> Jeff Squyres
> jsquy...@cisco.com
>
>
> _

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeff,

Your script method works for me. Thank you very much,

Cheers,

Min Zhu


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Jeff Squyres
Sent: 17 December 2009 14:56
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

This might be something you need to talk to Platform about...?

Another option would be to openmpi-mpirun a script that is just a few
lines long:

#!/bin/sh
ulimit -s unlimited
./wrf.exe



On Dec 17, 2009, at 9:40 AM, Min Zhu wrote:

> Hi, Jeff,
> 
> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.
> 
> Here is the content of openmpi-mpirun file, so maybe something needs
to
> be changed?
> 
> --
> #!/bin/sh
> #
> #  Copyright (c) 2007 Platform Computing
> #
> # This script is a wrapper for openmpi mpirun
> # it generates the machine file based on the hosts
> # given to it by Lava.
> #
> 
> usage() {
> cat < USAGE:  $0
> This command is a wrapper for mpirun (openmpi).  It can
> only be run within Lava using bsub e.g.
> bsub -n # "$0 -np # {my mpi command and args}"
> 
> The wrapper will automatically generate the
> machinefile used by mpirun.
> 
> NOTE:  The list of hosts cannot exceed 4KBytes.
> USEEOF
> }
> 
> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
> usage
> exit -1
> fi
> 
> MYARGS=$*
> WORKDIR=`dirname ${LSB_JOBFILENAME}`
> MACHFILE=${WORKDIR}/mpi_machines
> ARGLIST=${WORKDIR}/mpi_args
> 
> # Check if mpirun is in the PATH
> T=`which mpirun`
> if [ $? -ne 0 ]; then
> echo "Error:  mpirun is not in your PATH."
> exit -2
> fi
> 
> echo "${MYARGS}" > ${ARGLIST}
> T=`grep -- -machinefile ${ARGLIST} |wc -l`
> if [ $T -gt 0 ]; then
> echo "Error:  Do not provide the machinefile for mpirun."
> echo "It is generated automatically for you."
> exit -3
> fi
> 
> # Make the open-mpi machine file
> echo "${LSB_HOSTS}" > ${MACHFILE}.lst
> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}
> 
> MPIRUN=`which --skip-alias mpirun`
> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}
> 
> exit $?
> 
> --
> 
> 
> Cheers,
> 
> Min Zhu
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
> Behalf Of Jeff Squyres
> Sent: 17 December 2009 14:29
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
> 
> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote:
> 
> > Thanks for your reply. Yes, your mpirun command works for me. But I
> need to use bsub job scheduler. I wonder why
> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s
> unlimited; ./wrf.exe" doesn't work.
> 
> Try with different quoting...?  I don't know the details of the
> openmpi-mpirun script, but perhaps it's trying to exec the whole
quoted
> string as a single executable (which doesn't exist).  Perhaps:
> 
> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s
unlimited;
> ./wrf.exe"
> 
> That's a (somewhat educated) guess...
> 
> --
> 
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> CONFIDENTIALITY NOTICE: This e-mail, including any attachments,
contains information that may be confidential, and is protected by
copyright. It is directed to the intended recipient(s) only.  If you
have received this e-mail in error please e-mail the sender by replying
to this message, and then delete the e-mail. Unauthorised disclosure,
publication, copying or use of this e-mail is prohibited.  Any
communication of a personal nature in this e-mail is not made by or on
behalf of any RES group company. E-mails sent or received may be
monitored to ensure compliance with the law, regulation and/or our
policies.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 

Jeff Squyres
jsquy...@cisco.com


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains 
information that may be confidential, and is protected by copyright. It is 
directed to the intended recipient(s) only.  If you have received this e-mail 
in error please e-mail the sender by replying to this message, and then delete 
the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail 
is prohibited.  Any communication of a personal nature in this e-mail is not 
made by or on behalf of any RES group company. E-mails sent or received may be 
monitored to ensure compliance with the law, regulation and/or our policies.



Re: [OMPI users] Debugging spawned processes

2009-12-17 Thread Ralph Castain

On Dec 17, 2009, at 3:54 AM, jody wrote:

> yeah, know that you mention it, i remember (old brain here, as well)
> But IIRC you created a OMPI version which was called 1.4a1r or something,
> where i indeed could use this xterm. When i updated to 1.3.2, i sort
> of forgot about it again...

The 1.4a1rxxx version was just our development trunk at the time - it was then 
moved to the official release branch (1.3).

> 
> Another question though:
> You said "If it includes the -xterm option, then that option gets
> applied to the dynamically spawned procs too"
> Does this passing on also apply to the -x options?

Certainly should, though I can't claim to have personally tested it.

> 
> Thanks
>  Jody
> 
> On Wed, Dec 16, 2009 at 3:42 PM, Ralph Castain  wrote:
>> It is in a later version - pretty sure it made 1.3.3. IIRC, I added it at 
>> your request :-)
>> 
>> On Dec 16, 2009, at 7:20 AM, jody wrote:
>> 
>>> Thanks for your reply
>>> 
>>> That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not
>>> to recognize the --xterm option.
>>> [jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf
>>> --
>>> mpirun was unable to launch the specified application as it could not
>>> find an executable:
>>> 
>>> Executable: 1
>>> Node: aim-plankton.uzh.ch
>>> 
>>> while attempting to start process rank 0.
>>> --
>>> (if i reverse the --xterm and -np 1, it complains about not finding
>>> executable '9')
>>> Do i need to install a higher version, or is this something i'd have
>>> to set as option in configure?
>>> 
>>> Thank You
>>>  Jody
>>> 
>>> On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain  wrote:
 Depends on the version you are working with. If it includes the -xterm 
 option, then that option gets applied to the dynamically spawned procs 
 too, so this should be automatically taken care of...but in that case, you 
 wouldn't need your script to open an xterm anyway. You would just do:
 
 mpirun --xterm -np 5 gdb ./my_app
 
 or the equivalent. You would then comm_spawn an argv[0] of "gdb", with 
 argv[1] being your target app.
 
 I don't know how to avoid including that "gdb" in the comm_spawn argv's - 
 I once added an mpirun cmd line option to automatically add it, but got 
 loudly told to remove it.  Of course, it should be easy to pass an option 
 to your app itself that tells it whether or not to do so!
 
 HTH
 Ralph
 
 
 On Dec 16, 2009, at 4:06 AM, jody wrote:
 
> Hi
> Until now i always wrote applications for which the number of processes
> was given on the command line with -np.
> To debug these applications i wrote a script, run_gdb.sh which basically
> open a xterm and starts gdb in it for my application.
> This allowed me to have a window for each of the processes being debugged.
> 
> Now, however, i write my first application in which additional processes 
> are
> being spawned. My question is now: how can i open xterm windows in which
> gdb runs for the spawned processes?
> 
> The only way i can think of is to pass my script run_gdb.sh into the argv
> parameters of MPI_Spawn.
> Would this be correct?
> If yes, what about other parameters passed to the spawning process, such 
> as
> environment variables passed via -x? Are they being passed to the spawned
> processes as well? In my case this would be necessary so that processes
> on other machine will get the $DISPLAY environment variable in order to
> display their xterms with gdb on my workstation.
> 
> Another negative point would be the need to change the argv parameters
> every time one switches between debugging and normal running.
> 
> Has anybody got some hints on how to debug spawned processes?
> 
> Thank You
>  Jody
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeff Squyres
This might be something you need to talk to Platform about...?

Another option would be to openmpi-mpirun a script that is just a few lines 
long:

#!/bin/sh
ulimit -s unlimited
./wrf.exe



On Dec 17, 2009, at 9:40 AM, Min Zhu wrote:

> Hi, Jeff,
> 
> Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
> -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.
> 
> Here is the content of openmpi-mpirun file, so maybe something needs to
> be changed?
> 
> --
> #!/bin/sh
> #
> #  Copyright (c) 2007 Platform Computing
> #
> # This script is a wrapper for openmpi mpirun
> # it generates the machine file based on the hosts
> # given to it by Lava.
> #
> 
> usage() {
> cat < USAGE:  $0
> This command is a wrapper for mpirun (openmpi).  It can
> only be run within Lava using bsub e.g.
> bsub -n # "$0 -np # {my mpi command and args}"
> 
> The wrapper will automatically generate the
> machinefile used by mpirun.
> 
> NOTE:  The list of hosts cannot exceed 4KBytes.
> USEEOF
> }
> 
> if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
> usage
> exit -1
> fi
> 
> MYARGS=$*
> WORKDIR=`dirname ${LSB_JOBFILENAME}`
> MACHFILE=${WORKDIR}/mpi_machines
> ARGLIST=${WORKDIR}/mpi_args
> 
> # Check if mpirun is in the PATH
> T=`which mpirun`
> if [ $? -ne 0 ]; then
> echo "Error:  mpirun is not in your PATH."
> exit -2
> fi
> 
> echo "${MYARGS}" > ${ARGLIST}
> T=`grep -- -machinefile ${ARGLIST} |wc -l`
> if [ $T -gt 0 ]; then
> echo "Error:  Do not provide the machinefile for mpirun."
> echo "It is generated automatically for you."
> exit -3
> fi
> 
> # Make the open-mpi machine file
> echo "${LSB_HOSTS}" > ${MACHFILE}.lst
> tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}
> 
> MPIRUN=`which --skip-alias mpirun`
> ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}
> 
> exit $?
> 
> --
> 
> 
> Cheers,
> 
> Min Zhu
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: 17 December 2009 14:29
> To: Open MPI Users
> Subject: Re: [OMPI users] About openmpi-mpirun
> 
> On Dec 17, 2009, at 9:15 AM, Min Zhu wrote:
> 
> > Thanks for your reply. Yes, your mpirun command works for me. But I
> need to use bsub job scheduler. I wonder why
> > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s
> unlimited; ./wrf.exe" doesn't work.
> 
> Try with different quoting...?  I don't know the details of the
> openmpi-mpirun script, but perhaps it's trying to exec the whole quoted
> string as a single executable (which doesn't exist).  Perhaps:
> 
> bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited;
> ./wrf.exe"
> 
> That's a (somewhat educated) guess...
> 
> --
> 
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains 
> information that may be confidential, and is protected by copyright. It is 
> directed to the intended recipient(s) only.  If you have received this e-mail 
> in error please e-mail the sender by replying to this message, and then 
> delete the e-mail. Unauthorised disclosure, publication, copying or use of 
> this e-mail is prohibited.  Any communication of a personal nature in this 
> e-mail is not made by or on behalf of any RES group company. E-mails sent or 
> received may be monitored to ensure compliance with the law, regulation 
> and/or our policies.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeff,

Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit
-s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed.

Here is the content of openmpi-mpirun file, so maybe something needs to
be changed?

--
#!/bin/sh
#
#  Copyright (c) 2007 Platform Computing
#
# This script is a wrapper for openmpi mpirun
# it generates the machine file based on the hosts
# given to it by Lava.
#

usage() {
cat < ${ARGLIST}
T=`grep -- -machinefile ${ARGLIST} |wc -l`
if [ $T -gt 0 ]; then
echo "Error:  Do not provide the machinefile for mpirun."
echo "It is generated automatically for you."
exit -3
fi

# Make the open-mpi machine file
echo "${LSB_HOSTS}" > ${MACHFILE}.lst
tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}

MPIRUN=`which --skip-alias mpirun`
${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}

exit $?

--


Cheers,

Min Zhu

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Jeff Squyres
Sent: 17 December 2009 14:29
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

On Dec 17, 2009, at 9:15 AM, Min Zhu wrote:

> Thanks for your reply. Yes, your mpirun command works for me. But I
need to use bsub job scheduler. I wonder why
> bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s
unlimited; ./wrf.exe" doesn't work.

Try with different quoting...?  I don't know the details of the
openmpi-mpirun script, but perhaps it's trying to exec the whole quoted
string as a single executable (which doesn't exist).  Perhaps:

bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited;
./wrf.exe"

That's a (somewhat educated) guess...

-- 

Jeff Squyres
jsquy...@cisco.com


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains 
information that may be confidential, and is protected by copyright. It is 
directed to the intended recipient(s) only.  If you have received this e-mail 
in error please e-mail the sender by replying to this message, and then delete 
the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail 
is prohibited.  Any communication of a personal nature in this e-mail is not 
made by or on behalf of any RES group company. E-mails sent or received may be 
monitored to ensure compliance with the law, regulation and/or our policies.



Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeff Squyres
On Dec 17, 2009, at 9:15 AM, Min Zhu wrote:

> Thanks for your reply. Yes, your mpirun command works for me. But I need to 
> use bsub job scheduler. I wonder why
> bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; 
> ./wrf.exe" doesn't work.

Try with different quoting...?  I don't know the details of the openmpi-mpirun 
script, but perhaps it's trying to exec the whole quoted string as a single 
executable (which doesn't exist).  Perhaps:

bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; 
./wrf.exe"

That's a (somewhat educated) guess...

-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi,

Thanks for your reply. Yes, your mpirun command works for me. But I need to use 
bsub job scheduler. I wonder why 
bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; 
./wrf.exe" doesn't work.

Cheers,

Min Zhu


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Romaric David
Sent: 17 December 2009 13:58
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun

Min Zhu a écrit :
> Hi, 
> 
> I excuted 
> bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; 
> ./wrf.exe"
> 
Hello,

Here we use :

  mpirun  /bin/tcsh -c " limit stacksize unlimited ; /full/path/to/wrf.exe"


Regards,
R. David
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains 
information that may be confidential, and is protected by copyright. It is 
directed to the intended recipient(s) only.  If you have received this e-mail 
in error please e-mail the sender by replying to this message, and then delete 
the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail 
is prohibited.  Any communication of a personal nature in this e-mail is not 
made by or on behalf of any RES group company. E-mails sent or received may be 
monitored to ensure compliance with the law, regulation and/or our policies.



Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Romaric David

Min Zhu a écrit :
Hi, 

I excuted 
bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe"



Hello,

Here we use :

 mpirun  /bin/tcsh -c " limit stacksize unlimited ; /full/path/to/wrf.exe"


Regards,
R. David


Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, 

I excuted 
bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; 
./wrf.exe"

Here is the results, it seems that wrf.exe didn't run. ERR file is empty.



Sender: LSF System 
Subject: Job 647:  
Done

Job  was submitted 
from host  by user .
Job was executed on host(s) <8*compute-10>, in queue , as user .
<8*compute-11>
 was used as the home directory.
 was used as the working directory.
Started at Thu Dec 17 15:27:18 2009
Results reported at Thu Dec 17 15:27:27 2009

Your job looked like:


# LSBATCH: User input
openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe"


Successfully completed.

Resource usage summary:

CPU time   :  0.05 sec.

The output (if any) follows:

unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited


PS:

Read file  for stderr output of this job.
-

Thanks,

Min Zhu



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Romaric David
Sent: 17 December 2009 13:17
To: Open MPI Users
Subject: Re: [OMPI users] About openmpi-mpirun



Min Zhu a écrit :
> Dear all,
> 
> I have a question to ask you. If I issue a command "bsub -n 16
> openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job
> failed to run due to stacksize problem. Openmpi-mpirun is a wrapper
> script written by Platform. If I want to add '/bin/sh -c ulimit -s
> unlimited' to the above bsub command, how I can do it? Thank you very
> much in advance,
> 
Hello,

you should add it before wrf.exe :

openmpi-mpirun "sh -c ulmit -s unlimited  ; ./wrf.exe"

Regards,
R. David



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains 
information that may be confidential, and is protected by copyright. It is 
directed to the intended recipient(s) only.  If you have received this e-mail 
in error please e-mail the sender by replying to this message, and then delete 
the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail 
is prohibited.  Any communication of a personal nature in this e-mail is not 
made by or on behalf of any RES group company. E-mails sent or received may be 
monitored to ensure compliance with the law, regulation and/or our policies.



Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Romaric David



Min Zhu a écrit :

Dear all,

I have a question to ask you. If I issue a command "bsub -n 16
openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job
failed to run due to stacksize problem. Openmpi-mpirun is a wrapper
script written by Platform. If I want to add '/bin/sh -c ulimit -s
unlimited' to the above bsub command, how I can do it? Thank you very
much in advance,


Hello,

you should add it before wrf.exe :

openmpi-mpirun "sh -c ulmit -s unlimited  ; ./wrf.exe"

Regards,
R. David





[OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Dear all,

I have a question to ask you. If I issue a command "bsub -n 16
openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job
failed to run due to stacksize problem. Openmpi-mpirun is a wrapper
script written by Platform. If I want to add '/bin/sh -c ulimit -s
unlimited' to the above bsub command, how I can do it? Thank you very
much in advance,

Cheers,

Min Zhu

CONFIDENTIALITY NOTICE: This e-mail, including any attachments, contains 
information that may be confidential, and is protected by copyright. It is 
directed to the intended recipient(s) only.  If you have received this e-mail 
in error please e-mail the sender by replying to this message, and then delete 
the e-mail. Unauthorised disclosure, publication, copying or use of this e-mail 
is prohibited.  Any communication of a personal nature in this e-mail is not 
made by or on behalf of any RES group company. E-mails sent or received may be 
monitored to ensure compliance with the law, regulation and/or our policies.



[OMPI users] MPI-IO, providing buffers

2009-12-17 Thread Ricardo Reis


 Hi all

 I have a doubt. I'm starting using MPI-IO and was wondering if I can use 
the MPI_BUFFER_ATTACH to provide the necessary IO buffer (or will it use 
the array I'm passing the MPI_Write...??)


 many thanks,

 Ricardo Reis

 'Non Serviam'

 PhD candidate @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 http://www.lasef.ist.utl.pt

 Cultural Instigator @ Rádio Zero
 http://www.radiozero.pt

 Keep them Flying! Ajude a/help Aero Fénix!

 http://www.aeronauta.com/aero.fenix

 http://www.flickr.com/photos/rreis/

Re: [OMPI users] error performing MPI_Comm_spawn

2009-12-17 Thread Marcia Cristina Cera
very good news
I will wait carefully for the release :)

Thanks, Ralph
márcia.

On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain  wrote:

> Ah crumb - I found the problem. Sigh.
>
> I actually fixed this in the trunk over 5 months ago when the problem first
> surfaced in my own testing, but it never came across to the stable release
> branch. The problem is that we weren't serializing the comm_spawn requests,
> and so the launch system gets confused over what has and hasn't completed
> launch. That's why it works fine on the trunk.
>
> I'm creating the 1.4 patch right now. Thanks for catching this. Old brain
> completely forgot until I started tracking the commit history and found my
> own footprints!
>
> Ralph
>
> On Dec 16, 2009, at 5:43 AM, Marcia Cristina Cera wrote:
>
> Hi Ralph,
>
> I am afraid I have been a little hasty!
> I remake my tests with more care and I got the same error also with the
> 1.3.3 :-/
> but in such version the error happens after some successful executions...
> because of that I did not realize before!
> Furthermore, I increased the number of levels of the tree (that means have
> more concurrently dynamic process creations in the lower levels) and I never
> arrive to execute without error, unless I add the delay.
> Perhaps the problem might even be a race condition :(
>
> I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work
> with LAM for years, but I migrate o OpenMP last year once LAM will be
> discontinued...
>
> I think that I can continue the development of my application adding the
> delay, while I wait for a release... and I leave the performance tests to be
> made in the future :)
>
> Thank you again Ralph,
> márcia.
>
>
> On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain  wrote:
>
>> Okay, I can replicate this.
>>
>> FWIW: your  test program works fine with the OMPI trunk and 1.3.3. It only
>> has a problem with 1.4. Since I can replicate it on multiple machines every
>> single time, I don't think it is actually a race condition.
>>
>> I think someone made a change to the 1.4 branch that created a failure
>> mode :-/
>>
>> Will have to get back to you on this - may take awhile, and won't be in
>> the 1.4.1 release.
>>
>> Thanks for the replicator!
>>
>> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote:
>>
>> Thank you, Ralph
>>
>> I will use the 1.3.3 for now...
>> while waiting for a future fix release that break this race condiction.
>>
>> márcia
>>
>> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain  wrote:
>>
>>> Looks to me like it is a race condition, and the timing between 1.3.3 and
>>> 1.4 is just enough to trip it. I can break the race, but it will have to be
>>> in a future fix release.
>>>
>>> Meantime, your best bet is to either stick with 1.3.3 or add the delay.
>>>
>>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:
>>>
>>> Hi,
>>>
>>> I intend to develop an application using the MPI_Comm_spawn to create
>>> dynamically new MPI tasks (or processes).
>>> The structure of the program is like a tree: each node creates 2 new ones
>>> until reaches a predefined number of levels.
>>>
>>> I developed a small program to explain my problem as can be seen in
>>> attachment.
>>> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the
>>> level value) the root of the tree (a ch_rec program). Afterward spawn, a
>>> message is sent to  child and the process block in an MPI_Recv.
>>> -- ch_rec.c: gets its level value and receives the parent message, then
>>> if its level is less than a predefined limit, it will creates 2 children:
>>> - set the level value;
>>> - spawn 1 child;
>>> - send a message;
>>> - call an MPI_Irecv;
>>> - repeat the 4 previous steps for the second child;
>>> - call an MPI_Waitany waiting for children returns.
>>> When children messages are received, the process send a message to its
>>> parent and call MPI_Finalize.
>>>
>>> Using the openmpi-1.3.3 version the program runs as expected but with
>>> openmpi-1.4 I get the following error:
>>>
>>> $ mpirun -np 1 start
>>> level 0
>>> level = 1
>>> Parent sent: level 0 (pid:4279)
>>> level = 2
>>> Parent sent: level 1 (pid:4281)
>>> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0]
>>> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758
>>>
>>> The error happens when my program try to launch the second child
>>> immediately after the first spawn call.
>>> In my tests I try to put an sleep of 2 second between the first and the
>>> second spawn, and then the program runs as expected.
>>>
>>> Some one can help me with this version 1.4 bug?
>>>
>>> thanks,
>>> márcia.
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listi

Re: [OMPI users] Debugging spawned processes

2009-12-17 Thread jody
yeah, know that you mention it, i remember (old brain here, as well)
But IIRC you created a OMPI version which was called 1.4a1r or something,
where i indeed could use this xterm. When i updated to 1.3.2, i sort
of forgot about it again...

Another question though:
You said "If it includes the -xterm option, then that option gets
applied to the dynamically spawned procs too"
Does this passing on also apply to the -x options?

Thanks
  Jody

On Wed, Dec 16, 2009 at 3:42 PM, Ralph Castain  wrote:
> It is in a later version - pretty sure it made 1.3.3. IIRC, I added it at 
> your request :-)
>
> On Dec 16, 2009, at 7:20 AM, jody wrote:
>
>> Thanks for your reply
>>
>> That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not
>> to recognize the --xterm option.
>> [jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf
>> --
>> mpirun was unable to launch the specified application as it could not
>> find an executable:
>>
>> Executable: 1
>> Node: aim-plankton.uzh.ch
>>
>> while attempting to start process rank 0.
>> --
>> (if i reverse the --xterm and -np 1, it complains about not finding
>> executable '9')
>> Do i need to install a higher version, or is this something i'd have
>> to set as option in configure?
>>
>> Thank You
>>  Jody
>>
>> On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain  wrote:
>>> Depends on the version you are working with. If it includes the -xterm 
>>> option, then that option gets applied to the dynamically spawned procs too, 
>>> so this should be automatically taken care of...but in that case, you 
>>> wouldn't need your script to open an xterm anyway. You would just do:
>>>
>>> mpirun --xterm -np 5 gdb ./my_app
>>>
>>> or the equivalent. You would then comm_spawn an argv[0] of "gdb", with 
>>> argv[1] being your target app.
>>>
>>> I don't know how to avoid including that "gdb" in the comm_spawn argv's - I 
>>> once added an mpirun cmd line option to automatically add it, but got 
>>> loudly told to remove it.  Of course, it should be easy to pass an option 
>>> to your app itself that tells it whether or not to do so!
>>>
>>> HTH
>>> Ralph
>>>
>>>
>>> On Dec 16, 2009, at 4:06 AM, jody wrote:
>>>
 Hi
 Until now i always wrote applications for which the number of processes
 was given on the command line with -np.
 To debug these applications i wrote a script, run_gdb.sh which basically
 open a xterm and starts gdb in it for my application.
 This allowed me to have a window for each of the processes being debugged.

 Now, however, i write my first application in which additional processes 
 are
 being spawned. My question is now: how can i open xterm windows in which
 gdb runs for the spawned processes?

 The only way i can think of is to pass my script run_gdb.sh into the argv
 parameters of MPI_Spawn.
 Would this be correct?
 If yes, what about other parameters passed to the spawning process, such as
 environment variables passed via -x? Are they being passed to the spawned
 processes as well? In my case this would be necessary so that processes
 on other machine will get the $DISPLAY environment variable in order to
 display their xterms with gdb on my workstation.

 Another negative point would be the need to change the argv parameters
 every time one switches between debugging and normal running.

 Has anybody got some hints on how to debug spawned processes?

 Thank You
  Jody
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] Pointers for understanding failure messages on NetBSD

2009-12-17 Thread Kevin . Buckley

> You could confirm that it is the IPv6 loop by simply disabling IPv6
> support - configure with --disable-ipv6 and see if you still get the error
> messages
>
> Thanks for continuing to pursue this!
> Ralph
>

Yeah, but if you disable the IPv6 stuff then there's a completely
different path taken through the routine in question.

It just does a simple:

inet_ntoa(((struct sockaddr_in*) addr)->sin_addr);

as opposed to the

error = getnameinfo(addr, addrlen,
name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST);


that gets invoked, for EVERY addr, if you do enable IPv6, which
seems a little odd.

Actually, looking at the logic of how execution happens in that
routine, it's not clear that an IPv6 addr on NetBSD should get
into the code where the getnameinfo() is called anyway !

A comment just above there even says:

/* hotfix for netbsd: on my netbsd machine, getnameinfo
   returns an unkown error code. */


Turning off IPv6 therefore doesn't actual tickle the previous error
in a different way for IPv4, it simply routes around it altogether.

I still reckon that the "sa_len = 0" is the key here and that must
be being set elsewhere when the addr that is blowing things up is
allowed through.

I'll try recompiling the PETSc/PISM stuff on top of an --disable-ipv6
OpenMPI and see what happens. Certainly, when running the SkaMPI tests
the error messages go away, but then they aren't in the code that is
being compiled anymore, so one might expect that !!!

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand