[slurm-dev] Re: more robust logic for tree message forward

Danny Auble Tue, 12 Feb 2013 16:20:07 -0800

This does appear to fix the issues with bug #153. This slightlymodified patch(https://github.com/SchedMD/slurm/commit/3026f7a7b5799b7e0ac7bc1834106183e2c78d29)has been applied and tested with 2.5 and 2.6. It will be in the next2.5 and 2.6 releases.

We will keep both patches related to this going forward but this oneappears to actually fix the problem. I am in favor of leaving theassert in place to guard against this if there is some other issue wehaven't discovered yet. With the addition of this patch the assert nolonger is tripped when this case happens.


Thanks for everyone helping on this,
Danny

On 02/04/13 13:14, Matthieu Hautreux wrote:

Re: [slurm-dev] more robust logic for tree message forward Hi,

we ran into that kind of problem too (with 2.4) but where not able tofind the root cause at that time. As a result, we made the patch thatwas recently added/modified in schedmd 2.6 dev branch replacing anerror message with an assert.

The kind of problem that you identify and reproduce (not enoughreplies from a relay node) was what we were expecting too. As we werenot sure, we just decided to make a global check and return an erroras you noticed. I think your patch is a good addition. I would be infavor of having both your patch and the current one (with an errorinstead of an assert) as a global/final check to validate thateverything was right in the tree msg loop sounds interesting to catchunexpected/unpredictable failures and make the system more resilient.

For the history of our patch, you can find the details on bug #153 onbugs.schedmd.com <http://bugs.schedmd.com>.


Regards,
Matthieu

2013/2/3 Hongjia Cao <[email protected] <mailto:[email protected]>>

    I run into a problem that the node ping/registration agent hangs with
    SLURM-2.5.2.

    After some tracing I find the cause of the problem in that one
    node(cn118) has a bad version of SLURM (2.4.2) installed. On slurmctld
    starting, it send node registration  messages to the compute nodes,
    during which cn118 is requested to forward the message to other nodes:

    [2013-02-02T20:23:06+08:00] debug3: Tree sending to cn118 along with
    cn[119-127]

    On receiving the node registration msg, slurmd on cn118 complains
    about
    the protocol version mismatch:

    slurmd: debug3: in the service_connection
    slurmd: debug:  unsupported RPC 1001
    slurmd: error: Invalid Protocol Version 6400 from uid=0 at
    25.8.6.0:40083 <http://25.8.6.0:40083>
    slurmd: error: slurm_receive_msg_and_forward: Protocol version has
    changed, re-link your code
    slurmd: error: service_connection: slurm_receive_msg: Protocol version
    has changed, re-link your code

    Then in slurmctld, only one ret_data_info is got in the following call
    sequence, since cn118 does not forward the message at all:

    _fwd_tree_thread() => slurm_send_addr_recv_msgs() =>
    _send_and_recv_msgs() => slurm_receive_msgs()

    And errno is set to 0 because slurmctld successfully received a
    message.
    The next node in the tree hostlist will not be tried.

    A thread will wait indefinitely in start_msg_tree() for the count of
    ret_list to match the host count:

            count = list_count(ret_list);
            debug2("Tree head got back %d looking for %d", count,
    host_count);
            while ((count < host_count)) {
                    pthread_cond_wait(&notify, &tree_mutex);
                    count = list_count(ret_list);
                    debug2("Tree head got back %d", count);
            }
            debug2("Tree head got them all");
            slurm_mutex_unlock(&tree_mutex);


    I noticed that in SLURM-2.6 the message forward logic is changed
    that in
    start_msg_tree() the thread will wait for all _fwd_tree_thread()
    threads
    to finish. But the assertion following will cause slurmctld to abort
    under the condition described above:

            slurm_mutex_lock(&tree_mutex);

            count = list_count(ret_list);
            debug2("Tree head got back %d looking for %d", count,
    host_count);
            while (thr_count > 0) {
                    pthread_cond_wait(&notify, &tree_mutex);
                    count = list_count(ret_list);
                    debug2("Tree head got back %d", count);
            }
            xassert(count >= host_count);   /* Tree head did not get all
    responses,
                                             * but no more active fwd
    threads!*/
            slurm_mutex_unlock(&tree_mutex);

    So I think in SLURM-2.6 this will also be a problem.

    I made the attached patch for slurmctld to continue working under the
    condition.

[slurm-dev] Re: more robust logic for tree message forward

Reply via email to