The first variable can probably be moved to opal pretty easily. That is
used when we need to fully shutdown the BTLs and re-init them on continue.
We do not have to do that for tcp (since we leave the sockets open), but do
have to do that for IB, for example.

The second call is a bit tricky since this is leaving a 'note' about a file
that needs to be created (touch'ed) on restart in order for the sm BTL
component to restart properly. For sm we leave the share memory file open
and inplace when we checkpoint since on 'continue' we just keep using it.
But on restart the file will no longer be there and can cause the process
to crash when restarted. So just before restart we touch the file, then
cleanup the old reference and the old (newly touch'ed) file during the
restart INC when the process is being rebuilt.

So that is what that call is doing, just writing the name of the file into
the metadata for the snapshot. Then opal_restart will touch the file just
before calling the CRS component to restart the process. So we just need to
replace it with a call that sets this data in the metadata file. Take a
look in the CRS components and the CR infrastructure to see how they are
writing to the snapshot metadata (they might do it directly).

Unfortunately, I have been away from that code long enough to not easily
remember how to do it. Let me know if that gives you enough to move forward
on.

Thanks,
Josh



On Fri, Oct 17, 2014 at 9:15 AM, Adrian Reber <adr...@lisas.de> wrote:

> Josh,
>
> I had a look at the code (e.g., opal/mca/btl/sm/btl_sm.c) and there are
> two uses of orte code:
>
> if (orte_cr_continue_like_restart)
>
> and
>
>  /* On restart we need the old file names to exist (not necessarily
>   * contain content) so the CRS component does not fail when  searching
>   * for these old file handles. The restart procedure will make sure
>   * these files get cleaned up appropriately.
>   */
>  orte_sstore.set_attr(orte_sstore_handle_current,
>                       SSTORE_METADATA_LOCAL_TOUCH,
>                       mca_btl_sm_component.sm_seg->shmem_ds.seg_name);
>
>
> Do you have an idea how to fix those two? The first variable
> orte_cr_continue_like_restart could probably be moved but I am not sure
> how to handle the sstore call.
>
>                 Adrian
>
>
> On Sat, Aug 09, 2014 at 08:46:31AM -0500, Josh Hursey wrote:
> > Those calls should be protected with the CR FT #define - If I remember
> > correctly. We were using the sstore to track the shared memory file names
> > so we could clean them up on restart.
> >
> > I'm not sure if the sstore framework is necessary in this location, since
> > we should be able to tell opal_crs and it will do the right thing. I can
> > try to look at it early next week if someone doesn't get to it before
> then.
> >
> > -- Josh
> >
> >
> >
> > On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com>
> > wrote:
> >
> > > I think you're making a joke, right...?
> > >
> > > I see direct calls to ORTE sstore functionality in all three.
> > >
> > >
> > >
> > >
> > > On Aug 8, 2014, at 5:42 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> > >
> > > > These are harmless. They are only used when FT is enabled which
> should
> > > rarely be the case.
> > > >
> > > >   George.
> > > >
> > > >
> > > >
> > > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) <
> > > jsquy...@cisco.com> wrote:
> > > > Here's a few ORTE headers in OPAL source -- can respective owners
> clean
> > > these up?  Thanks.
> > > >
> > > > -----
> > > > mca/btl/smcuda/btl_smcuda.c
> > > > 63:#include "orte/mca/sstore/sstore.h"
> > > >
> > > > mca/btl/sm/btl_sm.c
> > > > 62:#include "orte/mca/sstore/sstore.h"
> > > >
> > > > mca/mpool/sm/mpool_sm_module.c
> > > > 34:#include "orte/mca/sstore/sstore.h"
> > > > -----
> > > >
> > > > --
> > > > Jeff Squyres
> > > > jsquy...@cisco.com
> > > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > > >
> > > > _______________________________________________
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> > > >
> > > > _______________________________________________
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
> > >
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> > >
> >
> >
> >
> > --
> > Joshua Hursey
> > Assistant Professor of Computer Science
> > University of Wisconsin-La Crosse
> > http://cs.uwlax.edu/~jjhursey
>
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15588.php
>
>
>                 Adrian
>
> --
> Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
> ink, n.:
>         A villainous compound of tannogallate of iron, gum-arabic,
>         and water, chiefly used to facilitate the infection of
>         idiocy and promote intellectual crime.
>                 -- H.L. Mencken
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/10/16061.php
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey

Reply via email to