[OMPI users] OpenMPI exits when subsequent tail -f in script is interrupted
Hi, I'm having a bit of a problem with wrapping mpirun in a script. The script needs to run an MPI job in the background and tail -f the output. Pressing Ctrl+C should stop tail -f, and the MPI job should continue. However mpirun seems to detect the SIGINT that was meant for tail, and kills the job immediately. I've tried workarounds involving nohup, disown, trap, subshells (including calling the script from within itself), etc, to no avail. The problem is that this doesn't happen if I run the command directly instead, without mpirun. Attached is a script that reproduces the problem. It runs a simple counting script in the background which takes 10 seconds to run, and tails the output. If called with "nompi" as first argument, it will simply run bash -c "$SCRIPT" >& "$out" &, and with "mpi" it will do the same with 'mpirun -np 1' prepended. The output I get is: $ ./ompi_bug.sh mpi mpi: 1 2 3 4 ^C $ ./ompi_bug.sh nompi nompi: 1 2 3 4 ^C $ cat output.* mpi: 1 2 3 4 mpirun: killing job... -- mpirun noticed that process rank 0 with PID 1222 on node pablomme exited on signal 0 (Unknown signal 0). -- mpirun: clean termination accomplished nompi: 1 2 3 4 5 6 7 8 9 10 Done This convinces me that there is something strange with OpenMPI, since I expect no difference in signal handling when running a simple command with or without mpirun in the middle. I've tried looking for options to change this behaviour, but I don't seem to find any. Is there one, preferably in the form of an environment variable? Or is this a bug? I'm using OpenMPI v1.4.3 as distributed with Ubuntu 11.04, and also v1.2.8 as distributed with OpenSUSE 11.3. Thanks, Pablo ompi_bug.sh.gz Description: GNU Zip compressed data
Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3
On Apr 22, 2011, at 1:42 PM, ya...@adina.com wrote: > Open MPI 1.4.3 + Intel Compilers V8.1 summary: > (in case someone likes to refer to it later) > > (1) To make all Open MPI executables statically linked and > independent of any dynamic libraries, > "--disable-shared" and "--enable-static" options should BOTH be > fowarded to configure, and "-i-static" > option should be specified for intel compilers too. > > (2) It is confirmed that environment variables could be forwarded to > slave nodes, such as $PATH > and $LD_LIBARY_PATH, by specifying options to mpirun. > However, mpirun will invoke orted daemon on > master and slave nodes. This is not correct - mpirun will not invoke an orted daemon on the master node. mpirun itself acts as the local daemon. > These environment variables passed to > slave nodes via mpirun options does not > take into effect before orted started. This is not entirely correct. It depends on the launcher. For rsh/ssh launchers, we do indeed set the environmental variables prior to executing the orted daemon. Some launch environments do not support that functionality. > So if orted daemon needs > these environment variables to run, > the only way is to set these environment variables in a shared > .bashrc or .profile file, visible to > both master and slave nodes, say, on a shared NFS partition. It > seems no other way to resolve this kind > of dependence. > > Regards, > Yiguang > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3
Open MPI 1.4.3 + Intel Compilers V8.1 summary: (in case someone likes to refer to it later) (1) To make all Open MPI executables statically linked and independent of any dynamic libraries, "--disable-shared" and "--enable-static" options should BOTH be fowarded to configure, and "-i-static" option should be specified for intel compilers too. (2) It is confirmed that environment variables could be forwarded to slave nodes, such as $PATH and $LD_LIBARY_PATH, by specifying options to mpirun. However, mpirun will invoke orted daemon on master and slave nodes. These environment variables passed to slave nodes via mpirun options does not take into effect before orted started. So if orted daemon needs these environment variables to run, the only way is to set these environment variables in a shared .bashrc or .profile file, visible to both master and slave nodes, say, on a shared NFS partition. It seems no other way to resolve this kind of dependence. Regards, Yiguang
Re: [OMPI users] btl_openib_cpc_include rdmacm questions
On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote: > > On Apr 21, 2011, at 4:41 PM, Brock Palen wrote: > >> Given that part of our cluster is TCP only, openib wouldn't even startup on >> those hosts > > That is correct - it would have no impact on those hosts > >> and this would be ignored on hosts with IB adaptors? > > Ummm...not sure I understand this one. The param -will- be used on hosts with > IB adaptors because that is what it is controlling. > > However, it -won't- have any impact on hosts without IB adaptors, which is > what I suspect you meant to ask? Correct typo, Thanks, I am going to add the environment variable to our OpenMPI modules so rdmacm is our default for now, Thanks! > > >> >> Just checking thanks! >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> bro...@umich.edu >> (734)936-1985 >> >> >> >> On Apr 21, 2011, at 6:21 PM, Jeff Squyres wrote: >> >>> Over IB, I'm not sure there is much of a drawback. It might be slightly >>> slower to establish QP's, but I don't think that matters much. >>> >>> Over iWARP, rdmacm can cause connection storms as you scale to thousands of >>> MPI processes. >>> >>> >>> On Apr 20, 2011, at 5:03 PM, Brock Palen wrote: >>> We managed to have another user hit the bug that causes collectives (this time MPI_Bcast() ) to hang on IB that was fixed by setting: btl_openib_cpc_include rdmacm My question is if we set this to the default on our system with an environment variable does it introduce any performance or other issues we should be aware of? Is there a reason we should not use rdmacm? Thanks! Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] Bug in MPI_scatterv Fortran-90 implementation
Oops! Missed that; thanks. I've committed the change to the trunk and filed CMRs to bring the fix to v1.4 and v1.5. Thanks for reporting the issue. On Apr 22, 2011, at 1:03 AM, Stanislav Sazykin wrote: > Jeff, > > No, the patch did not solve the problem. Looking more, > there is another place where the interfaces come up, in > mpi-f90-interfaces.h.sh in ompi/mpi/f90/scripts > > If I manually change the two arguments to arrays from scalars > in both scripts after running configure but before "make", > then it works. > > Stan Sazykin > > > On 4/21/2011 11:07, Jeff Squyres wrote: >> I do believe you found a bona-fide bug. >> >> Could you try the attached patch? (I think it should only affect f90 >> "large" builds) You should be able to check it quickly via: >> >> cd top_of_ompi_source_tree >> patch -p0< scatterv-f90.patch >> cd ompi/mpi/f90 >> make clean >> rm mpi_scatterv_f90.f90 >> make all install >> >> >> >> On Apr 21, 2011, at 10:37 AM, Stanislav Sazykin wrote: >> >>> Hello, >>> >>> I came across what appears to be an error in implementation of >>> MPI_scatterv Fortran-90 version. I am using OpenMPI 1.4.3 on Linux. >>> This comes up when OpenMPI was configured with >>> --with-mpi-f90-size=medium or --with-mpi-f90-size=large >>> >>> The standard specifies that the interface is >>> MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RECVBUF, >>>RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) >>> SENDBUF(*), RECVBUF(*) >>>INTEGERSENDCOUNTS(*), DISPLS(*), SENDTYPE >>> >>> so that SENDCOUNTS and DISPLS are integer arrays. However, if >>> I compile a fortran code with calls to MPI_scatterv and compile >>> with argument checks, two Fortran compilers (Intel and Lahey) >>> produce fatal errors saying there is no matching interface. >>> >>> Looking in the source code of OpenMPI, I see that in >>> ompi/mpi/f90/scripts, the script mpi_scatterv_f90.f90.sh that >>> is invoked when running "make" produces Fortran interfaces >>> that list both SENDCOUNTS and DISPLS as >>> >>> integer, intent(in) :: >>> >>> This appears to be an error as it would be illegal to pass a scalar >>> variable and receive it as an array in the subroutine. I have not >>> figured out what happens in the code at this invocation (the code >>> is complicated), but seems like a segfault situation. >>> >>> -- >>> Stan Sazykin >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Gatherv error
I wonder if this is related to the problem reported in [OMPI users] Bug in MPI_scatterv Fortran-90 implementation On Thu, Apr 21, 2011 at 7:19 PM, Zhangping Wei wrote: > Dear all, > > I am a beginner of MPI, right now I try to use MPI_GATHERV in my code, the > test code just gather the value of array A to store them in array B, but I > found an error listed as follows, > > 'Fatal error in MPI_Gatherv: Invalid count, error stack: > > PMPI_Gatherv<398>: MPI_Gatherv failed rbuf=0049AC0, rcnts=003DCB8, displs=003D4C68, MPI_REAL, root=0, > MPI_COMM_WORLD> failed > > PMPI_Gatherv<317>: Negative count, value is -842150451’ > > Here I post my program with the email, I wonder anyone can help me to fix > it or not? I guess my error is from the sending or receiving buffer and the > displacement of the value stored, I tried to changed ‘B,jlen,idisp’ to ’ > B(1,1), jlen(myid),idisp(myid)’ or other things, but I still cannot work it > out. > > I am looking forward some help from you. > > Zhangping Wei > > > > my code is, > > PROGRAM MAIN > > IMPLICIT NONE > > INCLUDE 'mpif.h' > > INTEGER I,J,IWORK,JWORK,I1,I2,J1,J2 > > REAL A(16,16),B(16,16) > > INTEGER,ALLOCATABLE ::idisp(:),jlen(:) > > integer myid,numprocs,rc,ierr,istar,iend,jstar,jend > > integer status(MPI_STATUS_SIZE) > > CALL MPI_INIT(ierr) > > CALL MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) > > CALL MPI_COMM_SIZE(MPI_COMM_WORLD,numprocs,ierr) > > ! PRINT *,'process ',myid, 'of',numprocs, 'is alive.' > > allocate(idisp(0:numprocs-1),jlen(0:numprocs-1)) > > DO J=1,16 > > DO I=1,16 > > A(I,J)=I+J > > B(I,J)=0.0 > > ENDDO > > ENDDO > > I1=1;I2=16;J1=1;J2=16 > > JWORK=(J2-J1)/numprocs+1 > > JSTAR=MIN(myid*JWORK+J1,J2+1) > > JEND=MIN(JSTAR+JWORK-1,J2) > > ISTAR=I1 > > IEND=I2 > > PRINT *,myid,istar,iend,jstar,jend > > jlen(myid)=16*(jend-jstar+1) > > idisp(myid)=16*(jstar-1) > > print *,myid,jlen(myid),idisp(myid) > > CALL MPI_GATHERV(A(1,jstar),jlen(myid),MPI_REAL, > > *B,jlen,idisp,MPI_REAL,0,MPI_COMM_WORLD,IERR) > > IF(myid.EQ.0)THEN > > DO J=1,16 > > DO I=1,16 > > PRINT *,I,J,B(I,J) > > ENDDO > > ENDDO > > ENDIF > > CALL MPI_Finalize(rc) > > END PROGRAM > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Zhang University of California, San Diego
Re: [OMPI users] huge VmRSS on rank 0 after MPI_Init when using "btl_openib_receive_queues" option
it varies with the receive_queues specification *and* with the number of mpi processes: memory_consumed = nb_mpi_process * nb_buffers * (buffer_size + low_buffer_count_watermark + credit_window_size ) éloi On 04/22/2011 12:26 AM, Jeff Squyres wrote: Does it vary exactly according to your receive_queues specification? On Apr 19, 2011, at 9:03 AM, Eloi Gaudry wrote: hello, i would like to get your input on this: when launching a parallel computation on 128 nodes using openib and the "-mca btl_openib_receive_queues P,65536,256,192,128" option, i observe a rather large resident memory consumption (2GB: 65336*256*128) on the process with rank 0 (and only this process) just after a call to MPI_Init. i'd like to know why the other processes doesn't behave the same: - other processes located on the same nodes don't use that amount of memory - all others processes (i.e. located on any other nodes) neither i'm using OpenMPI-1.4.2, built with gcc-4.3.4 and '--enable-cxx-exceptions --with-pic --with-threads=posix' options. thanks for your help, éloi -- Eloi Gaudry Senior Product Development Engineer Free Field Technologies Company Website: http://www.fft.be Direct Phone Number: +32 10 495 147 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Eloi Gaudry Senior Product Development Engineer Free Field Technologies Company Website: http://www.fft.be Direct Phone Number: +32 10 495 147
Re: [OMPI users] Bug in MPI_scatterv Fortran-90 implementation
Jeff, No, the patch did not solve the problem. Looking more, there is another place where the interfaces come up, in mpi-f90-interfaces.h.sh in ompi/mpi/f90/scripts If I manually change the two arguments to arrays from scalars in both scripts after running configure but before "make", then it works. Stan Sazykin On 4/21/2011 11:07, Jeff Squyres wrote: I do believe you found a bona-fide bug. Could you try the attached patch? (I think it should only affect f90 "large" builds) You should be able to check it quickly via: cd top_of_ompi_source_tree patch -p0< scatterv-f90.patch cd ompi/mpi/f90 make clean rm mpi_scatterv_f90.f90 make all install On Apr 21, 2011, at 10:37 AM, Stanislav Sazykin wrote: Hello, I came across what appears to be an error in implementation of MPI_scatterv Fortran-90 version. I am using OpenMPI 1.4.3 on Linux. This comes up when OpenMPI was configured with --with-mpi-f90-size=medium or --with-mpi-f90-size=large The standard specifies that the interface is MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) SENDBUF(*), RECVBUF(*) INTEGERSENDCOUNTS(*), DISPLS(*), SENDTYPE so that SENDCOUNTS and DISPLS are integer arrays. However, if I compile a fortran code with calls to MPI_scatterv and compile with argument checks, two Fortran compilers (Intel and Lahey) produce fatal errors saying there is no matching interface. Looking in the source code of OpenMPI, I see that in ompi/mpi/f90/scripts, the script mpi_scatterv_f90.f90.sh that is invoked when running "make" produces Fortran interfaces that list both SENDCOUNTS and DISPLS as integer, intent(in) :: This appears to be an error as it would be illegal to pass a scalar variable and receive it as an array in the subroutine. I have not figured out what happens in the code at this invocation (the code is complicated), but seems like a segfault situation. -- Stan Sazykin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users