This does appear to fix the issues with bug #153. This slightly modified patch (https://github.com/SchedMD/slurm/commit/3026f7a7b5799b7e0ac7bc1834106183e2c78d29) has been applied and tested with 2.5 and 2.6. It will be in the next 2.5 and 2.6 releases.

We will keep both patches related to this going forward but this one appears to actually fix the problem. I am in favor of leaving the assert in place to guard against this if there is some other issue we haven't discovered yet. With the addition of this patch the assert no longer is tripped when this case happens.

Thanks for everyone helping on this,
Danny

On 02/04/13 13:14, Matthieu Hautreux wrote:
Re: [slurm-dev] more robust logic for tree message forward Hi,

we ran into that kind of problem too (with 2.4) but where not able to find the root cause at that time. As a result, we made the patch that was recently added/modified in schedmd 2.6 dev branch replacing an error message with an assert.

The kind of problem that you identify and reproduce (not enough replies from a relay node) was what we were expecting too. As we were not sure, we just decided to make a global check and return an error as you noticed. I think your patch is a good addition. I would be in favor of having both your patch and the current one (with an error instead of an assert) as a global/final check to validate that everything was right in the tree msg loop sounds interesting to catch unexpected/unpredictable failures and make the system more resilient.

For the history of our patch, you can find the details on bug #153 on bugs.schedmd.com <http://bugs.schedmd.com>.

Regards,
Matthieu

2013/2/3 Hongjia Cao <[email protected] <mailto:[email protected]>>

    I run into a problem that the node ping/registration agent hangs with
    SLURM-2.5.2.

    After some tracing I find the cause of the problem in that one
    node(cn118) has a bad version of SLURM (2.4.2) installed. On slurmctld
    starting, it send node registration  messages to the compute nodes,
    during which cn118 is requested to forward the message to other nodes:

    [2013-02-02T20:23:06+08:00] debug3: Tree sending to cn118 along with
    cn[119-127]

    On receiving the node registration msg, slurmd on cn118 complains
    about
    the protocol version mismatch:

    slurmd: debug3: in the service_connection
    slurmd: debug:  unsupported RPC 1001
    slurmd: error: Invalid Protocol Version 6400 from uid=0 at
    25.8.6.0:40083 <http://25.8.6.0:40083>
    slurmd: error: slurm_receive_msg_and_forward: Protocol version has
    changed, re-link your code
    slurmd: error: service_connection: slurm_receive_msg: Protocol version
    has changed, re-link your code

    Then in slurmctld, only one ret_data_info is got in the following call
    sequence, since cn118 does not forward the message at all:

    _fwd_tree_thread() => slurm_send_addr_recv_msgs() =>
    _send_and_recv_msgs() => slurm_receive_msgs()

    And errno is set to 0 because slurmctld successfully received a
    message.
    The next node in the tree hostlist will not be tried.

    A thread will wait indefinitely in start_msg_tree() for the count of
    ret_list to match the host count:

            count = list_count(ret_list);
            debug2("Tree head got back %d looking for %d", count,
    host_count);
            while ((count < host_count)) {
                    pthread_cond_wait(&notify, &tree_mutex);
                    count = list_count(ret_list);
                    debug2("Tree head got back %d", count);
            }
            debug2("Tree head got them all");
            slurm_mutex_unlock(&tree_mutex);


    I noticed that in SLURM-2.6 the message forward logic is changed
    that in
    start_msg_tree() the thread will wait for all _fwd_tree_thread()
    threads
    to finish. But the assertion following will cause slurmctld to abort
    under the condition described above:

            slurm_mutex_lock(&tree_mutex);

            count = list_count(ret_list);
            debug2("Tree head got back %d looking for %d", count,
    host_count);
            while (thr_count > 0) {
                    pthread_cond_wait(&notify, &tree_mutex);
                    count = list_count(ret_list);
                    debug2("Tree head got back %d", count);
            }
            xassert(count >= host_count);   /* Tree head did not get all
    responses,
                                             * but no more active fwd
    threads!*/
            slurm_mutex_unlock(&tree_mutex);

    So I think in SLURM-2.6 this will also be a problem.

    I made the attached patch for slurmctld to continue working under the
    condition.



Reply via email to