Can you send your full Fortran test program?

> On Apr 22, 2015, at 6:24 PM, Galloway, Jack D <ja...@lanl.gov> wrote:
> 
> I have an MPI program that is fairly straight forward, essentially 
> "initialize, 2 sends from master to slaves, 2 receives on slaves, do a bunch 
> ofsystem calls for copying/pasting then running a serial code on each mpi 
> task, tidy up and mpi finalize".
> 
> This seems straightforward, but I'm not getting mpi_finalize to work 
> correctly. Below is a snapshot of the program, without all the 
> systemcopy/paste/call external code which I've rolled up in "do codish stuff" 
> type statements.
> 
> program mpi_finalize_break
> 
> !<variable declarations>
> 
> call MPI_INIT(ierr)
> 
> icomm = MPI_COMM_WORLD
> 
> call MPI_COMM_SIZE(icomm,nproc,ierr)
> 
> call MPI_COMM_RANK(icomm,rank,ierr)
> 
>  
> 
> !<do codish stuff for a while>
> 
> if (rank == 0) then
> 
>     !<set up some stuff then call MPI_SEND in a loop over number of slaves>
> 
>     call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr)
> 
>     call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr)
> 
> else
> 
>     call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> 
>     call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> 
>     !<do codish stuff for a while>
> 
> endif
> 
>  
> 
> print*, "got here4", rank
> 
> call MPI_BARRIER(icomm,ierr)
> 
> print*, "got here5", rank, ierr
> 
> call MPI_FINALIZE(ierr)
> 
>  
> 
> print*, "got here6"
> 
> end program mpi_finalize_break
> 
> Now the problem I am seeing occurs around the "got here4", "got here5" and 
> "got here6" statements. I get the appropriate number of print(it varies), 
> however the program hangs forever, never closing and leaves an orphaned 
> process on one (or more) of the compute nodes.
> 
> I am running this on an infiniband backbone machine, with the NFS server 
> shared over infiniband (nfs-rdma). I'm trying to determine how the running 
> firmware 2.9.1000.  This is the mellanox firmware available through yum with 
> centos 6.5, 2.6.32-504.8.1.el6.x86_64.
> 
> ib0       Link encap:InfiniBand  HWaddr 
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
> 
>           inet addr:192.168.6.254  Bcast:192.168.6.255  Mask:255.255.255.0
> 
>           inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link
> 
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
> 
>           RX packets:10952 errors:0 dropped:0 overruns:0 frame:0
> 
>           TX packets:9805 errors:0 dropped:625413 overruns:0 carrier:0
> 
>           collisions:0 txqueuelen:256
> 
>           RX bytes:830040 (810.5 KiB)  TX bytes:643212 (628.1 KiB)
> 
>  
> 
> hca_id: mlx4_0
> 
>         transport:                      InfiniBand (0)
> 
>         fw_ver:                         2.9.1000
> 
>         node_guid:                      0002:c903:0057:e7fc
> 
>         sys_image_guid:                 0002:c903:0057:e7ff
> 
>         vendor_id:                      0x02c9
> 
>         vendor_part_id:                 26428
> 
>         hw_ver:                         0xB0
> 
>         board_id:                       MT_0D90110009
> 
>         phys_port_cnt:                  1
> 
>                 port:   1
> 
>                         state:                  PORT_ACTIVE (4)
> 
>                         max_mtu:                4096 (5)
> 
>                         active_mtu:             4096 (5)
> 
>                         sm_lid:                 1
> 
>                         port_lid:               2
> 
>                         port_lmc:               0x00
> 
>                         link_layer:             InfiniBand
> 
>  
> 
> This problem only occurs in this simple implementation, thus my thinking it 
> is tied to the system calls.  I run several other, much larger, muchmore 
> robust MPI codes without issue on the machine.  Thanks for the help.
> 
> --Jack
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26765.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to