How about sending a 'ping' to a socket periodically which is monitored by an auxiliary program that runs where the master process runs?

Also, I know you don't want to delve into the third-party libs but have you actually tried to get to the bottom of the hang, e.g. run an strace, attach a debugger or if you have the intel tools available you could run the MPI profiling tool or similar? Maybe it's something more fundamental?!

Good luck,
Cihan

On 18/06/16 01:58, Alex Kaiser wrote:
An outside monitor should work. My outline of the monitor script (with
advice from the sys admin) has opportunities for bugs with environment
variables and such.

I wanted to make sure there was not a simpler solution, or one that is
less error prone. Modifying the main routine which calls the library or
external scripts is no problem, I only meant that I did not want to
debug the library internals, which are huge and complicated!

Appreciate the advice. Thank you,
Alex

On Friday, June 17, 2016, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:

    Sadly, no - there was some possibility of using a file monitor we
    had for awhile, but that isn’t in the 1.6 series. So I fear your
    best bet is to periodically output some kind of marker, and have a
    separate process that monitors to see if it is being updated. Either
    way would require modifying code and that seems to be outside the
    desired scope of the solution.

    Afraid I don’t know how to accomplish what you seek without code
    modification.

    On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote:

    Dear Dr. Castain,

    I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any
    other info which would be helpful? Partial output follows.

    Thanks,
    Alex

    -bash-4.1$ ompi_info

        Package: Open MPI l...@soho.es.its.nyu.edu Distribution
        Open MPI: 1.6.5
        ...
        C compiler family name: GNU
        C compiler version: 4.8.2


    On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa
    <g...@ldeo.columbia.edu> wrote:

        Hi Alex

        You know all this, but just in case ...

        Restartable code goes like this:

        *****************************
        .start

        read the initial/previous configuration from a file
        ...
        final_step = first_step + nsteps
        time_step = first_step
        while ( time_step .le. final_step )
          ... march in time ...
          time_step = time_step +1
        end

        save the final_step configuration (or phase space) to a file
        [depending on the algorithm you may need to save the
        previous config also, or perhaps a few more]

        .end
        ************************************************

        Then restart the job time and again, until the desired
        number of time steps is completed.

        Job queue systems/resource managers allow a job to resubmit
        itself,
        so unless a job fails it feels like a single time integration.

        If a job fails in the middle, you don't lose all work,
        just restart from the previous saved configuration.
        That is the only situation where you need to "monitor" the code.
        Resource managers/ queue systems can also email you in
        case the job fails, warning you to do manual intervention.

        The time granularity per job (nsteps) is up to you.
        Normally it is adjusted to the max walltime of job queues
        (in a shared computer/cluster),
        but in your case it can be adjusted to how often the program
        fails.

        All atmosphere/ocean/climate/weather_forecast models work
        this way (that's what we mostly run here).
        I guess most CFD, computational Chemistry, etc, programs also do.

        I hope this helps,
        Gus Correa



        On 06/16/2016 05:25 PM, Alex Kaiser wrote:

            Hello,

            I have an MPI code which sometimes hangs, simply stops
            running. It is
            not clear why and it uses many large third party libraries
            so I do not
            want to try to fix it. The code is easy to restart, but
            then it needs to
            be monitored closely by me, and I'd prefer to do it
            automatically.

            Is there a way, within an MPI process, to detect the hang
            and abort if so?

            In psuedocode, I would like to do something like

                for (all time steps)
                     step
                     if (nothing has happened for x minutes)

                         call mpi abort to return control to the shell

                     endif

                endfor

            This structure does not work however, as it can no longer
            do anything,
            including check itself, when it is stuck.


            Thank you,
            Alex



            _______________________________________________
            users mailing list
            us...@open-mpi.org
            Subscription:
            https://www.open-mpi.org/mailman/listinfo.cgi/users
            Link to this post:
            http://www.open-mpi.org/community/lists/users/2016/06/29471.php


        _______________________________________________
        users mailing list
        us...@open-mpi.org
        Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this post:
        http://www.open-mpi.org/community/lists/users/2016/06/29473.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/06/29474.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29481.php


Reply via email to