Hi Peter

Given how long it takes to hit the problem, have you checked your file and disk quotas? Could be that the file is simply getting too big.

I'm also a tad curious how you got valgrind to work on OSX - I was unaware it supported that environment?

If all that looks okay, then the next thing would be to put some kind of check in handle_message to see what message you are actually attempting to output when it hangs. See if there is something that would cause fputs to have a heart attack - perhaps you have a message counter that rolls over (e.g., a 16-bit counter that rolls after you get too many messages).

Ralph


On Nov 5, 2008, at 8:12 PM, Peter Beerli wrote:

On some of my larger problems ,
my program stalls and does not continue

(50 or more nodes, 'long' runs >5 hours). My program is set up as a master-worker and it seems that the master gets stuck in a write to stdout see gdb backtrace below (It took all day to get there on 50 nodes). the function handle_message is simply printing to the stdout in this case. Of course the workers keep sending stuff to the master, but the master is stuck
writing that does not finish. Any idea where to look next?
[smaller runs look fine, valgrind did not find problems in my code (complaining a lot about openmpi so)
I attach also the ompi_info to show versions (OS is macos 10.5.5)
any idea what is going on? [any hint is welcome!]

thanks
Peter

(gdb) bt
#0  0x00000037528c0e50 in __write_nocancel () from /lib64/libc.so.6
#1  0x00000037528694b3 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00000037528693c6 in _IO_new_do_write () from /lib64/libc.so.6
#3  0x000000375286a822 in _IO_new_file_xsputn () from /lib64/libc.so.6
#4  0x000000375285f4f8 in fputs () from /lib64/libc.so.6
#5  0x000000000045e9de in handle_message (
rawmessage=0x4bb8830 "M0:[ 12] Swapping between 4 temperatures. \n", ' ' <repeats 11 times>, "Temperature | Accepted | Swaps between temperatures\n", ' ' <repeats 16 times>, "1e+06 | 0.00 | |\n", ' ' <repeats 15 times>, "3.0000 | 0.08 | 1 ||"..., sender=12, world=0x448d8b0)
  at migrate_mpi.c:3663
#6  0x000000000045362a in mpi_runloci_master (loci=1, who=0x4541fc0,
  world=0x448d8b0, options_readsum=0, menu=0) at migrate_mpi.c:228
#7 0x000000000044ed86 in run_sampler (options=0x448dc20, data=0x4465a10,
  universe=0x42b90c0, usize=4, outfilepos=0x7fff0ff98ee0,
  Gmax=0x7fff0ff98ee8) at main.c:885
#8 0x000000000044dff2 in main (argc=3, argv=0x7fff0ff99008) at main.c:422


petal:~>ompi_info
              Open MPI: 1.2.8
 Open MPI SVN revision: r19718
              Open RTE: 1.2.8
 Open RTE SVN revision: r19718
                  OPAL: 1.2.8
     OPAL SVN revision: r19718
                Prefix: /home/beerli/openmpi
Configured architecture: x86_64-unknown-linux-gnu
         Configured by: beerli
         Configured on: Mon Nov  3 15:00:02 EST 2008
        Configure host: petal
              Built by: beerli
              Built on: Mon Nov  3 15:08:02 EST 2008
            Built host: petal
            C bindings: yes
          C++ bindings: yes
    Fortran77 bindings: yes (all)
    Fortran90 bindings: yes
Fortran90 bindings size: small
            C compiler: gcc
   C compiler absolute: /usr/bin/gcc
          C++ compiler: g++
 C++ compiler absolute: /usr/bin/g++
    Fortran77 compiler: gfortran
Fortran77 compiler abs: /usr/bin/gfortran
    Fortran90 compiler: gfortran
Fortran90 compiler abs: /usr/bin/gfortran
           C profiling: yes
         C++ profiling: yes
   Fortran77 profiling: yes
   Fortran90 profiling: yes
        C++ exceptions: no
        Thread support: posix (mpi: no, progress: no)
Internal debug support: no
   MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
       libltdl support: yes
 Heterogeneous support: yes
mpirun default --prefix: no
MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.8) MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.8)
         MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.8)
MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.8)
             MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.8)
       MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.8)
       MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.8)
         MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
         MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
              MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.8)
              MCA coll: self (MCA v1.0, API v1.0, Component v1.2.8)
              MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.8)
              MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.8)
                MCA io: romio (MCA v1.0, API v1.0, Component v1.2.8)
             MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.8)
             MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.8)
               MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
               MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)
               MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.8)
            MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.8)
               MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.8)
               MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.8)
               MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
              MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.8)
               MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.8)
            MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.8)
            MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.8)
            MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.8)
               MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.8)
               MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.8)
               MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.8)
               MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.8)
               MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.8)
                MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.8)
                MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.8)
               MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.8) MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.8) MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.8)
               MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.8)
MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.8)
               MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.8)
               MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.8)
MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.8)
              MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.8)
              MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.8)
               MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.8)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
               MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.8)
               MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.8)
               MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.8)
               MCA sds: env (MCA v1.0, API v1.0, Component v1.2.8)
               MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.8)
               MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.8)
               MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.8)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to