you are right, Jeff.

from the security reasons "child" is not allowed to share memory with
parent.

On Fri, Apr 24, 2015 at 9:20 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Does the child process end up with valid memory in the buffer in that
> sample?  Back when I paid attention to verbs (which was admittedly a long
> time ago), the sample I pasted would segv...
>
>
> > On Apr 24, 2015, at 9:40 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> >
> > ibv_fork_init() will set special flag for madvise()
> (IBV_DONTFORK/DOFORK) to inherit (and not cow) registered/locked pages on
> fork() and will maintain refcount for cleanup.
> >
> > I think some minimal kernel version required (2.6.x) which supports
> these flags.
> >
> > I can check if internally if you think the behave is different.
> >
> >
> > On Fri, Apr 24, 2015 at 1:41 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Mike --
> >
> > What happens when you do this?
> >
> > ----
> > ibv_fork_init();
> >
> > int *buffer = malloc(...);
> > ibv_reg_mr(buffer, ...);
> >
> > if (fork() != 0) {
> >     // in the child
> >     *buffer = 3;
> >     // ...
> > }
> > ----
> >
> >
> >
> > > On Apr 24, 2015, at 2:54 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> > >
> > > btw, ompi master now calls ibv_fork_init() before initializing
> btl/mtl/oob frameworks and all fork fears should be addressed.
> > >
> > >
> > > On Fri, Apr 24, 2015 at 4:37 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > > Disable the memory manager / don't use leave pinned.  Then you can
> fork/exec without fear (because only MPI will have registered memory --
> it'll never leave user buffers registered after MPI communications finish).
> > >
> > >
> > > > On Apr 23, 2015, at 9:25 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
> > > >
> > > > Jeff
> > > >
> > > > this is kind of a lanl thing. Jack and I are working offline.  any
> suggestions about openib and fork/exec may be useful however...and don't
> say no to fork/exec not at least if you dream of mpi in the data center.
> > > >
> > > > On Apr 23, 2015 10:49 AM, "Galloway, Jack D" <ja...@lanl.gov> wrote:
> > > > I am using a “homecooked” cluster at LANL, ~500 cores.  There are a
> whole bunch of fortran system calls doing the copying and pasting.  The
> full code is attached here, a bunch of if-then statements for user
> options.  Thanks for the help.
> > > >
> > > >
> > > >
> > > > --Jack Galloway
> > > >
> > > >
> > > >
> > > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard
> Pritchard
> > > > Sent: Thursday, April 23, 2015 8:15 AM
> > > > To: Open MPI Users
> > > > Subject: Re: [OMPI users] MPI_Finalize not behaving correctly,
> orphaned processes
> > > >
> > > >
> > > >
> > > > Hi Jack,
> > > >
> > > > Are you using a system at LANL? Maybe I could try to reproduce the
> problem on the system you are using.  The system call stuff adds a certain
> bit of zest to the problem.  does the app make fortran system calls to do
> the copying and pasting?
> > > >
> > > > Howard
> > > >
> > > > On Apr 22, 2015 4:24 PM, "Galloway, Jack D" <ja...@lanl.gov> wrote:
> > > >
> > > > I have an MPI program that is fairly straight forward, essentially
> "initialize, 2 sends from master to slaves, 2 receives on slaves, do a
> bunch of system calls for copying/pasting then running a serial code on
> each mpi task, tidy up and mpi finalize".
> > > >
> > > > This seems straightforward, but I'm not getting mpi_finalize to work
> correctly. Below is a snapshot of the program, without all the system
> copy/paste/call external code which I've rolled up in "do codish stuff"
> type statements.
> > > >
> > > > program mpi_finalize_break
> > > >
> > > > !<variable declarations>
> > > >
> > > > call MPI_INIT(ierr)
> > > >
> > > > icomm = MPI_COMM_WORLD
> > > >
> > > > call MPI_COMM_SIZE(icomm,nproc,ierr)
> > > >
> > > > call MPI_COMM_RANK(icomm,rank,ierr)
> > > >
> > > >
> > > >
> > > > !<do codish stuff for a while>
> > > >
> > > > if (rank == 0) then
> > > >
> > > >     !<set up some stuff then call MPI_SEND in a loop over number of
> slaves>
> > > >
> > > >     call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr)
> > > >
> > > >     call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr)
> > > >
> > > > else
> > > >
> > > >     call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> > > >
> > > >     call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> > > >
> > > >     !<do codish stuff for a while>
> > > >
> > > > endif
> > > >
> > > >
> > > >
> > > > print*, "got here4", rank
> > > >
> > > > call MPI_BARRIER(icomm,ierr)
> > > >
> > > > print*, "got here5", rank, ierr
> > > >
> > > > call MPI_FINALIZE(ierr)
> > > >
> > > >
> > > >
> > > > print*, "got here6"
> > > >
> > > > end program mpi_finalize_break
> > > >
> > > > Now the problem I am seeing occurs around the "got here4", "got
> here5" and "got here6" statements. I get the appropriate number of print
> statements with corresponding ranks for "got here4", as well as "got
> here5". Meaning, the master and all the slaves (rank 0, and all other
> ranks) got to the barrier call, through the barrier call, and to
> MPI_FINALIZE, reporting 0 for ierr on all of them. However, when it gets to
> "got here6", after the MPI_FINALIZE I'll get all kinds of weird behavior.
> Sometimes I'll get one less "got here6" than I expect, or sometimes I'll
> get eight less (it varies), however the program hangs forever, never
> closing and leaves an orphaned process on one (or more) of the compute
> nodes.
> > > >
> > > > I am running this on an infiniband backbone machine, with the NFS
> server shared over infiniband (nfs-rdma). I'm trying to determine how the
> MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned
> runs (not the same node, nor the same number of orphans every time). I'm
> guessing it is related to the various system calls to cp, mv,
> ./run_some_code, cp, mv but wasn't sure if it may be related to the speed
> of infiniband too, as all this happens fairly quickly. I could have wrong
> intuition as well. Anybody have thoughts? I could put the whole code if
> helpful, but this condensed version I believe captures it. I'm running
> openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running
> firmware 2.9.1000.  This is the mellanox firmware available through yum
> with centos 6.5, 2.6.32-504.8.1.el6.x86_64.
> > > >
> > > > ib0       Link encap:InfiniBand  HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> > > >
> > > >           inet addr:192.168.6.254  Bcast:192.168.6.255
> Mask:255.255.255.0
> > > >
> > > >           inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link
> > > >
> > > >           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
> > > >
> > > >           RX packets:10952 errors:0 dropped:0 overruns:0 frame:0
> > > >
> > > >           TX packets:9805 errors:0 dropped:625413 overruns:0
> carrier:0
> > > >
> > > >           collisions:0 txqueuelen:256
> > > >
> > > >           RX bytes:830040 (810.5 KiB)  TX bytes:643212 (628.1 KiB)
> > > >
> > > >
> > > >
> > > > hca_id: mlx4_0
> > > >
> > > >         transport:                      InfiniBand (0)
> > > >
> > > >         fw_ver:                         2.9.1000
> > > >
> > > >         node_guid:                      0002:c903:0057:e7fc
> > > >
> > > >         sys_image_guid:                 0002:c903:0057:e7ff
> > > >
> > > >         vendor_id:                      0x02c9
> > > >
> > > >         vendor_part_id:                 26428
> > > >
> > > >         hw_ver:                         0xB0
> > > >
> > > >         board_id:                       MT_0D90110009
> > > >
> > > >         phys_port_cnt:                  1
> > > >
> > > >                 port:   1
> > > >
> > > >                         state:                  PORT_ACTIVE (4)
> > > >
> > > >                         max_mtu:                4096 (5)
> > > >
> > > >                         active_mtu:             4096 (5)
> > > >
> > > >                         sm_lid:                 1
> > > >
> > > >                         port_lid:               2
> > > >
> > > >                         port_lmc:               0x00
> > > >
> > > >                         link_layer:             InfiniBand
> > > >
> > > >
> > > >
> > > > This problem only occurs in this simple implementation, thus my
> thinking it is tied to the system calls.  I run several other, much larger,
> much more robust MPI codes without issue on the machine.  Thanks for the
> help.
> > > >
> > > > --Jack
> > > >
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26765.php
> > > >
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26772.php
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26775.php
> > >
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26776.php
> > >
> > >
> > >
> > > --
> > >
> > > Kind Regards,
> > >
> > > M.
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26778.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26783.php
> >
> >
> >
> > --
> >
> > Kind Regards,
> >
> > M.
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26785.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26786.php




-- 

Kind Regards,

M.

Reply via email to