[OMPI users] multi-threaded MPI
Hi All - I am working on a networked cache for an out-of-core application, and currently I have it set up where I have several worker threads, and one "request" thread per node. The worker threads check the cache on their own node first, and if there's a miss, they make a request to the other nodes in the cluster to see who has the data. The request thread answers requests, and if a node is chosen to deliver data, the request thread spawns another thread to handle that particular request. Currently my application dies in MPI_Barrier before any computation begins (but after my request threads are spawned). After looking into this a bit, it seems that OpenMPI has to have thread support to handle a model like this (i.e. multiple Sends and Recvs happening at once per process). According to > ompi_info | grep Thread Thread support: posix (mpi: no, progress: no) I don't have this thread support. I am running OpenMPI v 1.1.2 (the latest openmpi package in Gentoo). Can anyone make a recommendation for what would be the version to try? Thanks, Brian
[OMPI users] MPI Spawn terminates application
Greetings, when MPI_Spawn cannot launch an application for whatever reason, the entire job is cancelled with some message like the following. Is there a way to handle this nicely, e.g. by throwing an exception? I understand, this does not work, when the job is first started with mpirun, as there is no application yet to fall back on, but in case of a running application, it should be possible to simply inform it that the spawning request failed. Then the application could begin to handle the error and terminate gracefully. I did enable C++ Exceptions btw, so I guess this is not implemented. Is there a technical (e.g. architectural) reason behind this, or simply a yet-to-be-added feature? All the best, Murat
Re: [OMPI users] openib errors as user, but not root
Ah! It WAS the torque startup script they provide! It pays to get into the weeds. Brian Andrus perotsystems Site Manager | Sr. Computer Scientist Naval Research Lab 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax (831) 656-4866 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Wednesday, November 07, 2007 4:26 PM To: Open MPI Users Subject: Re: [OMPI users] openib errors as user, but not root Check out: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more In particular, see the stuff about using resource managers. On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote: > Ok, I am having some difficulty troubleshooting this. > > If I run my hello program without torque, it works fine: > [root@login1 root]# mpirun --mca btl openib,self -host > n01,n02,n03,n04,n05 /data/root/hello > Hello from process 0 of 5 on node n01 > Hello from process 1 of 5 on node n02 > Hello from process 2 of 5 on node n03 > Hello from process 3 of 5 on node n04 > Hello from process 4 of 5 on node n05 > > If I submit it as root, it seems happy: > [root@login1 root]# qsub > #!/bin/bash > #PBS -j oe > #PBS -l nodes=5:ppn=1 > #PBS -W x=NACCESSPOLICY:SINGLEJOB > #PBS -N TestJob > #PBS -q long > #PBS -o output.txt > #PBS -V > cd $PBS_O_WORKDIR > rm -f output.txt > date > mpirun --mca btl openib,self /data/root/hello > 103.cluster.default.domain > [root@login1 root]# cat output.txt > Wed Nov 7 16:20:33 PST 2007 > Hello from process 0 of 5 on node n05 > Hello from process 1 of 5 on node n04 > Hello from process 2 of 5 on node n03 > Hello from process 3 of 5 on node n02 > Hello from process 4 of 5 on node n01 > > If I do it as me, not so good: > [andrus@login1 data]$ qsub > [andrus@login1 data]$ qsub > #!/bin/bash > #PBS -j oe > #PBS -l nodes=1:ppn=1 > #PBS -W x=NACCESSPOLICY:SINGLEJOB > #PBS -N TestJob > #PBS -q long > #PBS -o output.txt > #PBS -V > cd $PBS_O_WORKDIR > rm -f output.txt > date > mpirun --mca btl openib,self /data/root/hello > 105.littlemac.default.domain > [andrus@login1 data]$ cat output.txt > Wed Nov 7 16:23:00 PST 2007 > -- > The OpenIB BTL failed to initialize while trying to allocate some > locked memory. This typically can indicate that the memlock limits > are set too low. For most HPC installations, the memlock limits > should be set to "unlimited". The failure occured here: > > Host: n01 > OMPI source: btl_openib.c:828 > Function: ibv_create_cq() > Device:mthca0 > Memlock limit: 32768 > > You may need to consult with your system administrator to get this > problem fixed. This FAQ entry on the Open MPI web site may also be > helpful: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > -- > > -- > It looks like MPI_INIT failed for some reason; your parallel > process is likely to abort. There are many reasons that a parallel > process can fail during MPI_INIT; some of which are due to > configuration or environment problems. This failure appears to be an > internal failure; here's some additional information (which may only > be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > -- > > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (goodbye) > > > > I have checked that ulimit is unlimited. I cannot seem to figure this. > Any help? > Brian Andrus perotsystems > Site Manager | Sr. Computer Scientist > Naval Research Lab > 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax > (831) 656-4866 ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openib errors as user, but not root
I have checked those out. I am trying to test limits. If I ssh directly to a node and check, everything is ok: [andrus@login1 ~]$ ssh n01 ulimit -l unlimited The settings in /etc/security/limits.conf are right too. Brian Andrus perotsystems Site Manager | Sr. Computer Scientist Naval Research Lab 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax (831) 656-4866 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Wednesday, November 07, 2007 4:26 PM To: Open MPI Users Subject: Re: [OMPI users] openib errors as user, but not root Check out: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more In particular, see the stuff about using resource managers. On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote: > Ok, I am having some difficulty troubleshooting this. > > If I run my hello program without torque, it works fine: > [root@login1 root]# mpirun --mca btl openib,self -host > n01,n02,n03,n04,n05 /data/root/hello > Hello from process 0 of 5 on node n01 > Hello from process 1 of 5 on node n02 > Hello from process 2 of 5 on node n03 > Hello from process 3 of 5 on node n04 > Hello from process 4 of 5 on node n05 > > If I submit it as root, it seems happy: > [root@login1 root]# qsub > #!/bin/bash > #PBS -j oe > #PBS -l nodes=5:ppn=1 > #PBS -W x=NACCESSPOLICY:SINGLEJOB > #PBS -N TestJob > #PBS -q long > #PBS -o output.txt > #PBS -V > cd $PBS_O_WORKDIR > rm -f output.txt > date > mpirun --mca btl openib,self /data/root/hello > 103.cluster.default.domain > [root@login1 root]# cat output.txt > Wed Nov 7 16:20:33 PST 2007 > Hello from process 0 of 5 on node n05 > Hello from process 1 of 5 on node n04 > Hello from process 2 of 5 on node n03 > Hello from process 3 of 5 on node n02 > Hello from process 4 of 5 on node n01 > > If I do it as me, not so good: > [andrus@login1 data]$ qsub > [andrus@login1 data]$ qsub > #!/bin/bash > #PBS -j oe > #PBS -l nodes=1:ppn=1 > #PBS -W x=NACCESSPOLICY:SINGLEJOB > #PBS -N TestJob > #PBS -q long > #PBS -o output.txt > #PBS -V > cd $PBS_O_WORKDIR > rm -f output.txt > date > mpirun --mca btl openib,self /data/root/hello > 105.littlemac.default.domain > [andrus@login1 data]$ cat output.txt > Wed Nov 7 16:23:00 PST 2007 > -- > The OpenIB BTL failed to initialize while trying to allocate some > locked memory. This typically can indicate that the memlock limits > are set too low. For most HPC installations, the memlock limits > should be set to "unlimited". The failure occured here: > > Host: n01 > OMPI source: btl_openib.c:828 > Function: ibv_create_cq() > Device:mthca0 > Memlock limit: 32768 > > You may need to consult with your system administrator to get this > problem fixed. This FAQ entry on the Open MPI web site may also be > helpful: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > -- > > -- > It looks like MPI_INIT failed for some reason; your parallel > process is likely to abort. There are many reasons that a parallel > process can fail during MPI_INIT; some of which are due to > configuration or environment problems. This failure appears to be an > internal failure; here's some additional information (which may only > be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > -- > > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (goodbye) > > > > I have checked that ulimit is unlimited. I cannot seem to figure this. > Any help? > Brian Andrus perotsystems > Site Manager | Sr. Computer Scientist > Naval Research Lab > 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax > (831) 656-4866 ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openib errors as user, but not root
Check out: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more In particular, see the stuff about using resource managers. On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote: Ok, I am having some difficulty troubleshooting this. If I run my hello program without torque, it works fine: [root@login1 root]# mpirun --mca btl openib,self -host n01,n02,n03,n04,n05 /data/root/hello Hello from process 0 of 5 on node n01 Hello from process 1 of 5 on node n02 Hello from process 2 of 5 on node n03 Hello from process 3 of 5 on node n04 Hello from process 4 of 5 on node n05 If I submit it as root, it seems happy: [root@login1 root]# qsub #!/bin/bash #PBS -j oe #PBS -l nodes=5:ppn=1 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N TestJob #PBS -q long #PBS -o output.txt #PBS -V cd $PBS_O_WORKDIR rm -f output.txt date mpirun --mca btl openib,self /data/root/hello 103.cluster.default.domain [root@login1 root]# cat output.txt Wed Nov 7 16:20:33 PST 2007 Hello from process 0 of 5 on node n05 Hello from process 1 of 5 on node n04 Hello from process 2 of 5 on node n03 Hello from process 3 of 5 on node n02 Hello from process 4 of 5 on node n01 If I do it as me, not so good: [andrus@login1 data]$ qsub [andrus@login1 data]$ qsub #!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=1 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N TestJob #PBS -q long #PBS -o output.txt #PBS -V cd $PBS_O_WORKDIR rm -f output.txt date mpirun --mca btl openib,self /data/root/hello 105.littlemac.default.domain [andrus@login1 data]$ cat output.txt Wed Nov 7 16:23:00 PST 2007 -- The OpenIB BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Host: n01 OMPI source: btl_openib.c:828 Function: ibv_create_cq() Device:mthca0 Memlock limit: 32768 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) I have checked that ulimit is unlimited. I cannot seem to figure this. Any help? Brian Andrus perotsystems Site Manager | Sr. Computer Scientist Naval Research Lab 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax (831) 656-4866 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] openib errors as user, but not root
Ok, I am having some difficulty troubleshooting this. If I run my hello program without torque, it works fine: [root@login1 root]# mpirun --mca btl openib,self -host n01,n02,n03,n04,n05 /data/root/hello Hello from process 0 of 5 on node n01 Hello from process 1 of 5 on node n02 Hello from process 2 of 5 on node n03 Hello from process 3 of 5 on node n04 Hello from process 4 of 5 on node n05 If I submit it as root, it seems happy: [root@login1 root]# qsub #!/bin/bash #PBS -j oe #PBS -l nodes=5:ppn=1 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N TestJob #PBS -q long #PBS -o output.txt #PBS -V cd $PBS_O_WORKDIR rm -f output.txt date mpirun --mca btl openib,self /data/root/hello 103.cluster.default.domain [root@login1 root]# cat output.txt Wed Nov 7 16:20:33 PST 2007 Hello from process 0 of 5 on node n05 Hello from process 1 of 5 on node n04 Hello from process 2 of 5 on node n03 Hello from process 3 of 5 on node n02 Hello from process 4 of 5 on node n01 If I do it as me, not so good: [andrus@login1 data]$ qsub [andrus@login1 data]$ qsub #!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=1 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N TestJob #PBS -q long #PBS -o output.txt #PBS -V cd $PBS_O_WORKDIR rm -f output.txt date mpirun --mca btl openib,self /data/root/hello 105.littlemac.default.domain [andrus@login1 data]$ cat output.txt Wed Nov 7 16:23:00 PST 2007 -- The OpenIB BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Host: n01 OMPI source: btl_openib.c:828 Function: ibv_create_cq() Device:mthca0 Memlock limit: 32768 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) I have checked that ulimit is unlimited. I cannot seem to figure this. Any help? Brian Andrus perotsystems Site Manager | Sr. Computer Scientist Naval Research Lab 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax (831) 656-4866
Re: [OMPI users] Double Standard Output for Non-MPI on ItaniumRunning Red Hat Enterprise Linux 4.0
Please understand that I'm decent at the engineering side of it. As a system administrator, I'm a decent engineer. On the previous configurations, this program seems to run with any number of processors. I believe these successful users have been using LAM/MPI. While I was waiting for a reply, I installed LAM/MPI. The results were similar to those from OpenMPI. While I can choose LAM/MPI, I'd prefer to port it to OpenMPI since that is where all the development and most of the support are. I cannot choose the Portland compiler. I must use either GNU or Intel compilers on the Itanium2. Ted (more responses below) On November 7, 2007 at 8:39 AM, Squyres, Jeff wrote: On Nov 5, 2007, at 4:12 PM, Benjamin, Ted G. wrote: >> I have a code that runs with both Portland and Intel compilers >> on X86, AMD64 and Intel EM64T running various flavors of Linux on >> clusters. I am trying to port it to a 2-CPU Itanium2 (ia64) running >> Red Hat Enterprise Linux 4.0; it has gcc 3.4.6-8 and the Intel >> Fortran compiler 10.0.026 installed. I have built Open MPI 1.2.4 >> using these compilers. >> When I built the Open MPI, I didn't do anything special. I >> enabled debug, but that was really all. Of course, you can see that >> in the config file that is attached. >> This system is not part of a cluster. The two onboard CPUs (an >> HP zx6000) are the only processors on which the job runs. The code >> must run on MPI because the source calls it. I compiled the target >> software using the Fortran90 compiler (mpif90). >> I've been running the code in the foreground so that I could >> keep an eye on its behavior. >> When I try to run the compiled and linked code [mpirun -np # >> {executable file}], it performs as shown below: >> (1) With the source compiled at optimization -O0 and -np 1, the job >> runs very slowly (6 days on the wall clock) to the correct answer on >> the benchmark; >> (2) With the source compiled at optimization -O0 and -np 2, the >> benchmark job fails with a segmentation violation; > Have you tried running your code through a memory-checking debugger, > and/or examining any corefiles that were generated to see if there is > a problem in your code? > I will certainly not guarantee that Open MPI is bug free, but problems > like this are *usually* application-level issues. One place I always > start is running the application in a debugger to see if you can catch > exactly where the Badness happens. This can be most helpful. I have tried to run a debugger, but I am not an expert at it. I could not get Intel's idb debugger to give me a prompt, but I could get a prompt from gdb. I've looked over the manual, but I'm not sure how to put in the breakpoints et. al. that you geniuses use to evaluate a program at critical junctures. I actually used an "mpirun -np 2 dbg" command to run it on 2 CPUs. I attached the file at the prompt. When I did a run, it ran fine with no optimization and one processor. With 2 processors, it didn't seem to do anything. All I will say here is that I have a lot to learn. I'm calling on my friends for help on this. >> (3) With the source compiled at all other optimization (-O1, -O2, - >> O3) and processor combinations (-np1 and -np 2), it fails in what I >> would call a "quiescent" manner. What I mean by this is that it >> does not produce any error messages. When I submit the job, it >> produces a little standard output and it quits after 2-3 seconds. > That's fun. Can you tell if it runs the app at all, or if it dies before > main() starts? This is probably more of an issue for your > intel support guy than us... It's a Fortran program. It starts in the main program. I inserted some PRINT*, statements of the "PRINT*,'Read the input at line 213' " variety into the main program to see what would print. It printed the first four statements, but it didn't reach the last three. The calls that were reached were in the set-up section of the program. The section that wasn't reached had a lot of matrix-setting and solving subroutine calls. I'm going to point my Intel support person to this post and see where it takes us. >> In an attempt to find the problem, the technical support agent >> at Intel has had me run some simple "Hello" problems. >> The first one is an MPI hello code that is the attached >> hello_mpi.f. This ran as expected, and it echoed one "Hello" >> for each of the two processors. >> The second one is a non-MPI hello that is the attached >> hello.f90. Since it is a non-MPI source, I was told that running it >> on a workstation with a properly configured MPI should only echo on
Re: [OMPI users] Segmentation fault
On Wed, Nov 07, 2007 at 07:00:31 -0800, Francesco Pietra wrote: > I was lucky, given my modest skill with systems. In a couple of hours the system is OK again. DOCK, configured for MPICH and compiled gcc, is now running parallel with pointing to OpenMPI 1.2.3 compiled ifort/icc. Top -i shows all processors doing their job and I waited to post until the procedure ended correctly. Thanks francesco > , according to benchmatcs carried out by a number of guys) (intels are free as > gnu for my private use). And pointing MPICH for a program compiled gnu C (like > DOCK) to OpenMPI compiled intel was OK. ifort does not work if no icc is > present, so you may understand why. that is not true - ifort lives happily without icc - also in OpenMPI context (and MPICH). Karsten __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [OMPI users] Segmentation fault
On Wed, Nov 07, 2007 at 07:00:31 -0800, Francesco Pietra wrote: > > , according to benchmatcs carried out by a number of guys) (intels are free as > gnu for my private use). And pointing MPICH for a program compiled gnu C (like > DOCK) to OpenMPI compiled intel was OK. ifort does not work if no icc is > present, so you may understand why. that is not true - ifort lives happily without icc - also in OpenMPI context (and MPICH). Karsten -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
Re: [OMPI users] Segmentation fault
--- Adrian Knoth wrote: > On Wed, Nov 07, 2007 at 08:09:14AM -0500, Jeff Squyres wrote: > > > I'm not familiar with DOCK or Debian, but you will definitely have > > And last but not least, Surely not last. My OpenMPI was intel compiled. Simple reason: Amber9, as a Fortran program, runs faster on intels than on gnu (or any other known compiler , according to benchmatcs carried out by a number of guys) (intels are free as gnu for my private use). And pointing MPICH for a program compiled gnu C (like DOCK) to OpenMPI compiled intel was OK. ifort does not work if no icc is present, so you may understand why. In summary, when posing a question, either there is a suggestion how to possibly come out, or side comments are junk. Thanks francesco f. I'd like to point to the official Debian package > for OMPI: > >http://packages.debian.org/openmpi > > > -- > Cluster and Metacomputing Working Group > Friedrich-Schiller-Universität Jena, Germany > > private: http://adi.thur.de > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [OMPI users] problems compiling svn-version
works fine now. In earth sciences - at least oceanography and meteorology - Fortran is still the language of choice. kb On Wed, Nov 07, 2007 at 12:25:06 +0100, Adrian Knoth wrote: > On Wed, Nov 07, 2007 at 10:41:55AM +, Karsten Bolding wrote: > > > Hello > > Hi! > > > there is no support for Fortran - even though F77 and F90 are set as > > Fortran? Who needs Fortran? ;) > > Check line 151 in the Makefile. We've disabled Fortran for our developer > builds, as we're interested in OMPI, not in Fortran. > > You can simply remove the two "--disable-mpi-*" switches. > > > HTH > > -- > Cluster and Metacomputing Working Group > Friedrich-Schiller-Universität Jena, Germany > > private: http://adi.thur.de > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
Re: [OMPI users] Job does not quit even when the simulation dies
As Jeff indicated, the degree of capability has improved over time - I'm not sure which version this represents. The type of failure also plays a major role in our ability to respond. If a process actually segfaults or dies, we usually pick that up pretty well and abort the rest of the job (certainly, that seems to be working pretty well in the 1.2 series and beyond). If an MPI communication fails, I'm not sure what the MPI layer does - I believe it may retry for awhile, but I don't know how robust the error handling is in that layer. Perhaps someone else could address that question. If an actual node fails, then we don't handle that very well at all, even in today's development version. The problem is that we need to rely on the daemon on that node to tell us that the local procs died - if the node dies, then the daemon can't do that, so we never know it happened. We are working on solutions to that problem. Hopefully, we will have at least a preliminary version in the next release. Ralph On 11/7/07 6:44 AM, "Jeff Squyres" wrote: > Support for failure scenarios is something that is getting better over > time in Open MPI. > > It looks like the version you are using either didn't properly catch > that there was a failure and/or then cleanly exit all MPI processes. > > > On Nov 6, 2007, at 9:01 PM, Teng Lin wrote: > >> Hi, >> >> >> Just realize I have a job run for a long time, while some of the nodes >> already die. Is there any way to ask other nodes to quit ? >> >> >> [kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with >> errno=104 >> [kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with >> errno=104 >> >> The FAQ does mention it is related to : >> Connection reset by peer: These types of errors usually occur after >> MPI_INIT has completed, and typically indicate that an MPI process has >> died unexpectedly (e.g., due to a seg fault). The specific error >> message indicates that a peer MPI process tried to write to the now- >> dead MPI process and failed. >> >> Thanks, >> Teng >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Job does not quit even when the simulation dies
Support for failure scenarios is something that is getting better over time in Open MPI. It looks like the version you are using either didn't properly catch that there was a failure and/or then cleanly exit all MPI processes. On Nov 6, 2007, at 9:01 PM, Teng Lin wrote: Hi, Just realize I have a job run for a long time, while some of the nodes already die. Is there any way to ask other nodes to quit ? [kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with errno=104 [kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with errno=104 The FAQ does mention it is related to : Connection reset by peer: These types of errors usually occur after MPI_INIT has completed, and typically indicate that an MPI process has died unexpectedly (e.g., due to a seg fault). The specific error message indicates that a peer MPI process tried to write to the now- dead MPI process and failed. Thanks, Teng ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Double Standard Output for Non-MPI on Itanium Running Red Hat Enterprise Linux 4.0
On Nov 5, 2007, at 4:12 PM, Benjamin, Ted G. wrote: I have a code that runs with both Portland and Intel compilers on X86, AMD64 and Intel EM64T running various flavors of Linux on clusters. I am trying to port it to a 2-CPU Itanium2 (ia64) running Red Hat Enterprise Linux 4.0; it has gcc 3.4.6-8 and the Intel Fortran compiler 10.0.026 installed. I have built Open MPI 1.2.4 using these compilers. When I built the Open MPI, I didn’t do anything special. I enabled debug, but that was really all. Of course, you can see that in the config file that is attached. This system is not part of a cluster. The two onboard CPUs (an HP zx6000) are the only processors on which the job runs. The code must run on MPI because the source calls it. I compiled the target software using the Fortran90 compiler (mpif90). I’ve been running the code in the foreground so that I could keep an eye on its behavior. When I try to run the compiled and linked code [mpirun –np # {executable file}], it performs as shown below: (1) With the source compiled at optimization –O0 and –np 1, the job runs very slowly (6 days on the wall clock) to the correct answer on the benchmark; (2) With the source compiled at optimization –O0 and –np 2, the benchmark job fails with a segmentation violation; Have you tried running your code through a memory-checking debugger, and/or examining any corefiles that were generated to see if there is a problem in your code? I will certainly not guarantee that Open MPI is bug free, but problems like this are *usually* application-level issues. One place I always start is running the application in a debugger to see if you can catch exactly where the Badness happens. This can be most helpful. (3) With the source compiled at all other optimization (-O1, -O2, - O3) and processor combinations (-np1 and -np 2), it fails in what I would call a “quiescent” manner. What I mean by this is that it does not produce any error messages. When I submit the job, it produces a little standard output and it quits after 2-3 seconds. That's fun. Can you tell if it runs the app at all, or if it dies before main() starts? This is probably more of an issue for your intel support guy than us... In an attempt to find the problem, the technical support agent at Intel has had me run some simple “Hello” problems. The first one is an MPI hello code that is the attached hello_mpi.f. This ran as expected, and it echoed one “Hello” for each of the two processors. The second one is a non-MPI hello that is the attached hello.f90. Since it is a non-MPI source, I was told that running it on a workstation with a properly configured MPI should only echo one “Hello”; the Intel agent told me that two such echoes indicate a problem with Open MPI. It echoed twice, so now I have come to you for help. I'm not sure what you mean by that. If you: mpirun -np 4 hostname where "hostname" is non-MPI program (e.g., /bin/hostname), you'll still see the output 4 times because you told MPI to run 4 copies of "hostname". In this way, Open MPI is just being used as a job launcher. So if I'm understanding you right, mpirun -np 2 my_non_mpi_f90_hello_app should still print 2 copies of "hello". If it does, then Open MPI is doing exactly what it should do. Specifically: Open MPI's mpirun can be used to launch non-MPI applications (the same is not necessarily true for other MPI implementations). The other three attached files are the output requested on the “Getting Help” page – (1) the output of /sbin/ifconfig, (2) the output of ompt_info –all and (3) the config.log file. The installation of the Open MPI itself was as easy as could be. I am really ignorant of how it works beyond what I’ve read from the FAQs and learned in a little digging, so I hope it’s a simple solution. FWIW, I see that you're using Open MPI v1.2. Our latest version is v1.2.4; if possible, you might want to try and upgrade (e.g., delete your prior installation, recompile/reinstall Open MPI, and then recompile/relink your application against the new Open MPI installation); it has all of our latest bug fixes, etc. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Segmentation fault
Hi Jeff: I understand that my question was posed in extremely vague terms. Though, pointing MPICH to the installation of OpenMPI was suggested by the author of DOCK and it performed perfectly for a long while, until yesterday. Perhaps, could you please instruct me how to verify beyond doubt if the "apt-get update" has modified the version of OpenMPI that was originally installed (1.2.3)? On its side, Debian Linux is a perfectly standard Linux. francesco --- Jeff Squyres wrote: > I'm not familiar with DOCK or Debian, but you will definitely have > problems if you mix-n-match MPI implementations. Specifically, the > mpi.h files are not compatible between MPICH and Open MPI. > > Additionally, you may run into problems if you compile your app with > one version of Open MPI and then run it with another. We have not > [yet] done anything in terms of binary compatibility between versions. > > > On Nov 7, 2007, at 8:05 AM, Francesco Pietra wrote: > > > I wonder whether any suggestion can be offered about segmentation > > fault > > occurring on running a docking program (DOCK 6.1, written in C) on > > Debian Linux > > amd64 etch, i.e. dual opterons machine. Running DOCK6.1 parallel was > > OK until > > yesterday. I vaguely remember that before these problems I carried > > out a > > > > apt-get upgrade > > > > and something was done for OpenMPI. > > > > DOCK 6.1 was compiled: > > > > ./configure gnu parallel > > MPICH_HOME=/usr/local > > export MPICH_HOME > > make dock > > > > by pointing MPICH (for which DOCK 6.1 is configured, to my > > installation of > > OpenMPI 1.2.3 > > > > In my .bashrc: > > > > DOCK_HOME=/usr/local/dock6 > > PATH=$PATH:$DOCK_HOME/bib; export DOCK_HOME PATH > > > > MPI_HOME=/usr/local > > export MPI_home > > > > > > which mpicxx > > /usr/local/bin/mpicxx > > > > > > > > updatedb > > locate mpi.h > > /usr/include/sc/util/group/memmtmpi.h > > /usr/include/sc/util/group/messmpi.h > > /usr/dock6/src/dock/base_mpi.h > > /usr/local/include/mpi.h > > /usr/local/openmpi-1.2.3/ompi/include/mpi.h > > /usr/local/openmpi-1.2.3/ompi/include/mpi.h.in > > /usr/local/openmpi-1.2.3/ompi/mpi/f77/prototypes_mpi.h > > --- > > > > On these basis, running: > > > > mpirun -np 4 dock6.mpi -i dock.in -o dock.out > > > > the process halted with error message: > > > > Initialing MPI routines > > [deb64:03540] *** Process received signal *** > > [deb64:03540] Signal: Segmentation fault (11) > > [deb64:03540] Signal code: Address not mapped (1) > > [deb64:03540] Failing at address: 0x2b9ef5691000 > > dock6.mpi[3540]: segfault at 2b9ef5691 rip 00447b1b > > rsp > > 7fff43c137b0 error 6 > > [deb64:03540] [0] /lib/libthread.so.0 [0x2b9e681bc410] > > [deb64:03540] [1] dock6.mpi (_ZN60rient12match_ligandER7DOCKMol+0x40b) > > [0x447b1b] > > [deb64:03540] [2] dock6.mpi (main+0xaf5) [0x42cc75] > > [deb64:03540] [3] dock6.mpi /lib/libc.so.6(__libc_start_main+0xda) > > [0x2b9e682e14ca] > > [deb64:03540] [4] dock6.mpi (__gxx_personality_v0+0xc2) [0x41b4ea] > > [deb64:03540] *** End of error message *** > > mpirun noticed that jpb rank 0 with PID 3537 on node deb64 exited on > > signal 15 > > (Terminated). > > 3 additional processes aborted (not shown) > > > > > > Thanks > > francesco pietra > > > > __ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [OMPI users] Segmentation fault
On Wed, Nov 07, 2007 at 08:09:14AM -0500, Jeff Squyres wrote: > I'm not familiar with DOCK or Debian, but you will definitely have And last but not least, I'd like to point to the official Debian package for OMPI: http://packages.debian.org/openmpi -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI users] Segmentation fault
I'm not familiar with DOCK or Debian, but you will definitely have problems if you mix-n-match MPI implementations. Specifically, the mpi.h files are not compatible between MPICH and Open MPI. Additionally, you may run into problems if you compile your app with one version of Open MPI and then run it with another. We have not [yet] done anything in terms of binary compatibility between versions. On Nov 7, 2007, at 8:05 AM, Francesco Pietra wrote: I wonder whether any suggestion can be offered about segmentation fault occurring on running a docking program (DOCK 6.1, written in C) on Debian Linux amd64 etch, i.e. dual opterons machine. Running DOCK6.1 parallel was OK until yesterday. I vaguely remember that before these problems I carried out a apt-get upgrade and something was done for OpenMPI. DOCK 6.1 was compiled: ./configure gnu parallel MPICH_HOME=/usr/local export MPICH_HOME make dock by pointing MPICH (for which DOCK 6.1 is configured, to my installation of OpenMPI 1.2.3 In my .bashrc: DOCK_HOME=/usr/local/dock6 PATH=$PATH:$DOCK_HOME/bib; export DOCK_HOME PATH MPI_HOME=/usr/local export MPI_home which mpicxx /usr/local/bin/mpicxx updatedb locate mpi.h /usr/include/sc/util/group/memmtmpi.h /usr/include/sc/util/group/messmpi.h /usr/dock6/src/dock/base_mpi.h /usr/local/include/mpi.h /usr/local/openmpi-1.2.3/ompi/include/mpi.h /usr/local/openmpi-1.2.3/ompi/include/mpi.h.in /usr/local/openmpi-1.2.3/ompi/mpi/f77/prototypes_mpi.h --- On these basis, running: mpirun -np 4 dock6.mpi -i dock.in -o dock.out the process halted with error message: Initialing MPI routines [deb64:03540] *** Process received signal *** [deb64:03540] Signal: Segmentation fault (11) [deb64:03540] Signal code: Address not mapped (1) [deb64:03540] Failing at address: 0x2b9ef5691000 dock6.mpi[3540]: segfault at 2b9ef5691 rip 00447b1b rsp 7fff43c137b0 error 6 [deb64:03540] [0] /lib/libthread.so.0 [0x2b9e681bc410] [deb64:03540] [1] dock6.mpi (_ZN60rient12match_ligandER7DOCKMol+0x40b) [0x447b1b] [deb64:03540] [2] dock6.mpi (main+0xaf5) [0x42cc75] [deb64:03540] [3] dock6.mpi /lib/libc.so.6(__libc_start_main+0xda) [0x2b9e682e14ca] [deb64:03540] [4] dock6.mpi (__gxx_personality_v0+0xc2) [0x41b4ea] [deb64:03540] *** End of error message *** mpirun noticed that jpb rank 0 with PID 3537 on node deb64 exited on signal 15 (Terminated). 3 additional processes aborted (not shown) Thanks francesco pietra __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] Segmentation fault
I wonder whether any suggestion can be offered about segmentation fault occurring on running a docking program (DOCK 6.1, written in C) on Debian Linux amd64 etch, i.e. dual opterons machine. Running DOCK6.1 parallel was OK until yesterday. I vaguely remember that before these problems I carried out a apt-get upgrade and something was done for OpenMPI. DOCK 6.1 was compiled: ./configure gnu parallel MPICH_HOME=/usr/local export MPICH_HOME make dock by pointing MPICH (for which DOCK 6.1 is configured, to my installation of OpenMPI 1.2.3 In my .bashrc: DOCK_HOME=/usr/local/dock6 PATH=$PATH:$DOCK_HOME/bib; export DOCK_HOME PATH MPI_HOME=/usr/local export MPI_home which mpicxx /usr/local/bin/mpicxx updatedb locate mpi.h /usr/include/sc/util/group/memmtmpi.h /usr/include/sc/util/group/messmpi.h /usr/dock6/src/dock/base_mpi.h /usr/local/include/mpi.h /usr/local/openmpi-1.2.3/ompi/include/mpi.h /usr/local/openmpi-1.2.3/ompi/include/mpi.h.in /usr/local/openmpi-1.2.3/ompi/mpi/f77/prototypes_mpi.h --- On these basis, running: mpirun -np 4 dock6.mpi -i dock.in -o dock.out the process halted with error message: Initialing MPI routines [deb64:03540] *** Process received signal *** [deb64:03540] Signal: Segmentation fault (11) [deb64:03540] Signal code: Address not mapped (1) [deb64:03540] Failing at address: 0x2b9ef5691000 dock6.mpi[3540]: segfault at 2b9ef5691 rip 00447b1b rsp 7fff43c137b0 error 6 [deb64:03540] [0] /lib/libthread.so.0 [0x2b9e681bc410] [deb64:03540] [1] dock6.mpi (_ZN60rient12match_ligandER7DOCKMol+0x40b) [0x447b1b] [deb64:03540] [2] dock6.mpi (main+0xaf5) [0x42cc75] [deb64:03540] [3] dock6.mpi /lib/libc.so.6(__libc_start_main+0xda) [0x2b9e682e14ca] [deb64:03540] [4] dock6.mpi (__gxx_personality_v0+0xc2) [0x41b4ea] [deb64:03540] *** End of error message *** mpirun noticed that jpb rank 0 with PID 3537 on node deb64 exited on signal 15 (Terminated). 3 additional processes aborted (not shown) Thanks francesco pietra __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [OMPI users] problems compiling svn-version
On Wed, Nov 07, 2007 at 10:41:55AM +, Karsten Bolding wrote: > Hello Hi! > there is no support for Fortran - even though F77 and F90 are set as Fortran? Who needs Fortran? ;) Check line 151 in the Makefile. We've disabled Fortran for our developer builds, as we're interested in OMPI, not in Fortran. You can simply remove the two "--disable-mpi-*" switches. HTH -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI users] problems compiling svn-version
Hello On Wed, Nov 07, 2007 at 11:03:56AM +0100, Adrian Knoth wrote: > On Wed, Nov 07, 2007 at 09:45:24AM +, Karsten Bolding wrote: > > Place the attached Makefile as i.e. /tmp/my-ompi/Makefile, get the svn > snapshot into /tmp/my-ompi/ompi and just run "make" in /tmp/my-ompi/. > > Over here, it looks like this: > > adi@ipc654:/var/tmp/meta-ompi/trunk$ ls > Makefile Rakefile cc.build.job cunit ompi test tool unittests > > You don't need to care about the other files, just to outline where to > place the OMPI source. > > > You might want to change CONFIGURE_FLAGS in the Makefile, you'd probably > comment out the debug line and go for the second variant. half working - I can compile and I get an installed version. However, there is no support for Fortran - even though F77 and F90 are set as environment varibles (both set to ifort). > > HTH > > -- > Cluster and Metacomputing Working Group > Friedrich-Schiller-Universität Jena, Germany > kb -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
Re: [OMPI users] problems compiling svn-version
On Wed, Nov 07, 2007 at 09:45:24AM +, Karsten Bolding wrote: > Hello Hi! > Are there any known issues with ubuntus version of libtool. When I run Libtool is always an issue ;) To circumvent this, we have a Makefile fetching the right versions, compiling the whole autotools chain, prepends the new PATH and then compiles OMPI. Place the attached Makefile as i.e. /tmp/my-ompi/Makefile, get the svn snapshot into /tmp/my-ompi/ompi and just run "make" in /tmp/my-ompi/. Over here, it looks like this: adi@ipc654:/var/tmp/meta-ompi/trunk$ ls Makefile Rakefile cc.build.job cunit ompi test tool unittests You don't need to care about the other files, just to outline where to place the OMPI source. You might want to change CONFIGURE_FLAGS in the Makefile, you'd probably comment out the debug line and go for the second variant. HTH -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de # Meta-Makefile to build OpenMPI. # # (c) Christian Kauhaus # -- $Id: Makefile 3640 2007-10-09 14:23:11Z ckauhaus $ SHELL = bash # # Configuration section # ARCH := $(shell uname -m -s | tr ' ' '-') # but compile as 32bit even on amd64 ifeq ($(ARCH),Linux-x86_64) OMPI_CFLAGS = -m32 OMPI_LDFLAGS = -Wl,-melf_i386 endif # BuildBot does not set PWD on directory changes PWD := $(shell pwd | tr -d '\n') OMPI = $(PWD)/ompi # *_INSTALL_PREFIX are initially set to the same default, but may be overridden # individually. OMPI_INSTALL_PREFIX = $(PWD)/$(ARCH) TOOLS_INSTALL_PREFIX = $(PWD)/$(ARCH) BUILD_DIR = $(PWD)/build/$(ARCH) OMPI_BUILD_DIR = $(BUILD_DIR)/ompi TOOLS_BUILD_DIR = $(BUILD_DIR)/autotools TOOLS = $(TOOLS_INSTALL_PREFIX)/bin/autoconf \ $(TOOLS_INSTALL_PREFIX)/bin/automake \ $(TOOLS_INSTALL_PREFIX)/bin/libtool # CONFIGURE_FLAGS are appended to OMPI's ./configure (besides ignore-Fortran) CONFIGURE_FLAGS = --enable-debug --enable-trace --enable-static --disable-dlopen #CONFIGURE_FLAGS = --with-platform=optimized --enable-static --disable-dlopen CONFIGURE_FLAGS := $(CONFIGURE_FLAGS) CFLAGS=$(OMPI_CFLAGS) CXXFLAGS=$(OMPI_CFLAGS) LDFLAGS=$(OMPI_LDFLAGS) # use our own auto* tools PATH := $(DESTDIR)$(OMPI_INSTALL_PREFIX)/bin:$(TOOLS_INSTALL_PREFIX)/bin:$(PATH) ifeq ($(findstring curl, $(shell which curl)),curl) WGET = curl else WGET = wget -q --output-document=- endif # # Get required versions from distribution script # GETVERSION = $(shell grep "^$(1)_" $(OMPI)/contrib/dist/make_dist_tarball | sed -e 's/.*=//g') LT_VERSION = $(call GETVERSION,LT) AM_VERSION = $(call GETVERSION,AM) AC_VERSION = $(call GETVERSION,AC) # # main targets # all: openmpi .PHONY: test test: openmpi ulimit -u unlimited; umask 022; rake test # # Toolchain # .PHONY: tools tools: $(TOOLS_BUILD_DIR) $(TOOLS) $(TOOLS_BUILD_DIR): mkdir -p $@ # build GNU libtool $(TOOLS_INSTALL_PREFIX)/bin/libtool: $(TOOLS_BUILD_DIR)/libtool/Makefile \ $(TOOLS_INSTALL_PREFIX)/bin/automake cd $(dir $<) && umask 022 && $(MAKE) && $(MAKE) install $(TOOLS_BUILD_DIR)/libtool/Makefile: $(TOOLS_BUILD_DIR)/libtool-$(LT_VERSION)/* mkdir -p $(dir $@) cd $(dir $@) && $(dir $<)configure --prefix=$(TOOLS_INSTALL_PREFIX) # build GNU automake $(TOOLS_INSTALL_PREFIX)/bin/automake: $(TOOLS_BUILD_DIR)/automake/Makefile \ $(TOOLS_INSTALL_PREFIX)/bin/autoconf cd $(dir $<) && umask 022 && $(MAKE) && $(MAKE) install $(TOOLS_BUILD_DIR)/automake/Makefile: $(TOOLS_BUILD_DIR)/automake-$(AM_VERSION)/* mkdir -p $(dir $@) cd $(dir $@) && $(dir $<)configure --prefix=$(TOOLS_INSTALL_PREFIX) # build GNU autoconf $(TOOLS_INSTALL_PREFIX)/bin/autoconf: $(TOOLS_BUILD_DIR)/autoconf/Makefile cd $(dir $<) && umask 022 && $(MAKE) && $(MAKE) install $(TOOLS_BUILD_DIR)/autoconf/Makefile: $(TOOLS_BUILD_DIR)/autoconf-$(AC_VERSION)/* mkdir -p $(dir $@) cd $(dir $@) && $(dir $<)configure --prefix=$(TOOLS_INSTALL_PREFIX) # the download magic $(TOOLS_BUILD_DIR)/autoconf-$(AC_VERSION)/*: $(WGET) \ ftp://ftp.gnu.org/pub/gnu/autoconf/autoconf-$(AC_VERSION).tar.gz |\ (cd $(TOOLS_BUILD_DIR) && gzip -dc | tar xf -) $(TOOLS_BUILD_DIR)/automake-$(AM_VERSION)/*: $(WGET) \ ftp://ftp.gnu.org/pub/gnu/automake/automake-$(AM_VERSION).tar.gz |\ (cd $(TOOLS_BUILD_DIR) && gzip -dc | tar xf -) LT_URL=$(if $(findstring 2.1a, $(LT_VERSION)), \ http://www.open-mpi.org/svn/libtool.tar.gz, \ ftp://ftp.gnu.org/pub/gnu/libtool/libtool-$(LT_VERSION).tar.gz) $(TOOLS_BUILD_DIR)/libtool-$(LT_VERSION)/*: $(WGET) \ $(LT_URL) |\ (cd $(TOOLS_BUILD_DIR) && gzip -dc | tar xf -) # # build OpenMPI # .PHONY: openmpi compile install reinstall delete_install openmpi: install compile: $(OMPI_BUILD_DIR) install: $(DESTDIR)$(OMPI_INSTALL_PREFIX) reinstall: compile delete_install install delete_install: rm -rf $(DESTDIR)$(OMPI_INST
[OMPI users] problems compiling svn-version
Hello As it seems I need a feature only present in the svn-version of OpenMPI I'm in the process of installing and compiling this version. I've tried on two different machines. 1) debian everything worked OK. autoconf 2.61-4 automake 1:1.10+nogfdl-1 libtool 1.5.24-1 ifort Version 10.0 2) ubuntu (single processor/quad-core) autoconf 2.61-4 automake 1:1.10+nogfdl-1 libtool 1.5.24-1ubuntu1 ifort Version 10.0 make[2]: Entering directory `/data/kb/compile/openmpi-svn/orte/tools/orteboot' /bin/sh ../../../libtool --tag=CC --mode=link gcc -g -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions -fno-strict-aliasing -pthread -export-dynamic -o orteboot orteboot.o ../../../orte/libopen-rte.la -lnsl -lutil -lm gcc -g -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions -fno-strict-aliasing -pthread -o .libs/orteboot orteboot.o -Wl,--export-dynamic ../../../orte/.libs/libopen-rte.so -lnsl -lutil -lm -Wl,--rpath -Wl,/opt/openmpi-svn/lib ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_sys_limits' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_cr_finalize' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_cr_set_enabled' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_path_access' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_crs_base_extract_expected_component' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_crs_base_state_str' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_mutex_check_locks' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_progress_set_yield_when_idle' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_cr_init' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_progress_set_event_flag' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_crs_base_snapshot_t_class' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_cr_reg_coord_callback' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_cr_output' ../../../orte/.libs/libopen-rte.so: undefined reference to `opal_get_num_processors' collect2: ld returned 1 exit status make[2]: *** [orteboot] Error 1 If I do: strings orte/.libs/libopen-rte.so.0.0.0 | grep opal_get_num_processors I get: opal_get_num_processors Are there any known issues with ubuntus version of libtool. When I run ./autogen.sh I get this: [Running] autoheader ** Adjusting libtool for OMPI :-( ++ patching for pathscale multi-line output (LT 1.5.x) [Running] autoconf [Running] libtoolize --automake --copy --ltdl -- Moving libltdl to opal/ ** Adjusting libltdl for OMPI :-( ++ patching for argz bugfix in libtool 1.5 -- your libtool doesn't need this! yay! ++ patching 64-bit OS X bug in ltmain.sh -- your libtool doesn't need this! yay! ++ RTLD_GLOBAL in libltdl -- your libltdl doesn't need this! yay! I don't get that on machine 1. I tried to copy orte/.libs/libopen-rte.so from 1 to 2 without luck. kb -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
Re: [OMPI users] machinefile and rank
Yes, this feature is currently in the SVN. You can use the syntax in: https://svn.open-mpi.org/trac/ompi/ticket/1023 Currently the process affinity doesn't work but the ranks are running on the machines as specify in the hostfile. Currently Ralph is working on removing the new syntax from the hostfile And together we will implement it on anew config file. Sharon. -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Karsten Bolding Sent: Wednesday, November 07, 2007 9:40 AM To: us...@open-mpi.org Subject: Re: [OMPI users] machinefile and rank On Tue, Nov 06, 2007 at 09:22:50 -0500, Jeff Squyres wrote: > Unfortunately, not yet. I believe that this kind of functionality is > slated for the v1.3 series -- is that right Ralph/Voltaire? thats a pity since performance of the setup is horrible if I can't control the order. the svn code will develop into v1.3 - right? Is the feature already in svn? kb -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpicc Segmentation Fault with Intel Compiler
On 06.11.2007, at 10:42, Åke Sandgren wrote: Hi, On Tue, 2007-11-06 at 10:28 +0100, Michael Schulz wrote: Hi, I've the same problem described by some other users, that I can't compile anything if I'm using the open-mpi compiled with the Intel- Compiler. ompi_info --all Segmentation fault OpenSUSE 10.3 Kernel: 2.6.22.9-0.4-default Intel P4 Configure-Flags: CC=icc, CXX=icpc, F77=ifort, F90=ifort Intel-Compiler: both, C and Fortran 10.0.025 Is there any known solution? I had the same problem with pathscale. Try this, i think it is the solution i found. diff -ru site/opal/runtime/opal_init.c amd64_ubuntu606-psc/opal/runtime/opal_init.c --- site/opal/runtime/opal_init.c 2007-10-20 03:00:35.0 +0200 +++ amd64_ubuntu606-psc/opal/runtime/opal_init.c2007-10-23 16:12:15.0 +0200 @@ -169,7 +169,7 @@ } /* register params for opal */ -if (OPAL_SUCCESS != opal_register_params()) { +if (OPAL_SUCCESS != (ret = opal_register_params())) { error = "opal_register_params"; goto return_error; } thanks, but this doesn't solve my segv Problem. Michael
Re: [OMPI users] machinefile and rank
On Tue, Nov 06, 2007 at 09:22:50 -0500, Jeff Squyres wrote: > Unfortunately, not yet. I believe that this kind of functionality is > slated for the v1.3 series -- is that right Ralph/Voltaire? thats a pity since performance of the setup is horrible if I can't control the order. the svn code will develop into v1.3 - right? Is the feature already in svn? kb -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
Re: [OMPI users] mpicc Segmentation Fault with Intel Compiler
On Tue, 2007-11-06 at 20:49 -0500, Jeff Squyres wrote: > On Nov 6, 2007, at 4:42 AM, Åke Sandgren wrote: > > > I had the same problem with pathscale. > > There is a known outstanding problem with the pathscale problem. I am > still waiting for a solution from their engineers (we don't know yet > whether it's an OMPI issue or a Pathscale issue, but my [biased] money > is on a Pathscale issue :-) -- it doesn't happen with any other > compiler). > > > Try this, i think it is the solution i found. > > > > diff -ru site/opal/runtime/opal_init.c > > amd64_ubuntu606-psc/opal/runtime/opal_init.c > > --- site/opal/runtime/opal_init.c 2007-10-20 03:00:35.0 > > +0200 > > +++ amd64_ubuntu606-psc/opal/runtime/opal_init.c2007-10-23 > > 16:12:15.0 +0200 > > @@ -169,7 +169,7 @@ > > } > > > > /* register params for opal */ > > -if (OPAL_SUCCESS != opal_register_params()) { > > +if (OPAL_SUCCESS != (ret = opal_register_params())) { > > error = "opal_register_params"; > > goto return_error; > > } > > I don't see why this change would make any difference in terms of a > segv...? > > I see that ret is an uninitialized variable in the error case (which > I'll fix -- thanks for pointing it out :-) ) -- but I don't see how > that would fix a segv. Am I missing something? The problem is that i don't really remember what fixed my problem (or if it got interrupted before i managed to fix it in the first place). I have been busy building other software for a couple of weeks. The above was simply the only patch i hade made that i didn't know exactly what it was doing. But judging from trying to run that version of ompi_info i still have problems. I've been working with this for a while and can hopefully continue pursuing it next week or so.
Re: [OMPI users] machinefile and rank
On Tue, Nov 06, 2007 at 09:22:50PM -0500, Jeff Squyres wrote: > Unfortunately, not yet. I believe that this kind of functionality is > slated for the v1.3 series -- is that right Ralph/Voltaire? > Yes, the file format will be different, but arbitrary mapping will be possible. > > On Nov 5, 2007, at 11:22 AM, Karsten Bolding wrote: > > > Hello > > > > I'm using a machinefile like: > > n03 > > n04 > > n03 > > n03 > > n03 > > n02 > > n01 > > .. > > .. > > .. > > > > the order of the entries is determined by an external program for load > > balancing reasons. When the job is started the ranks do not correspond > > to entries in the machinefile. Is there a way to force that entry > > one in > > the machinefile gets rank=0, sencond entry gets rank=1 etc. > > > > > > Karsten > > > > > > -- > > -- > > Karsten BoldingBolding & Burchard Hydrodynamics > > Strandgyden 25 Phone: +45 64422058 > > DK-5466 AsperupFax: +45 64422068 > > DenmarkEmail: kars...@bolding-burchard.com > > > > http://www.findvej.dk/Strandgyden25,5466,11,3 > > -- > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Gleb.