This does appear to fix the issues with bug #153. This slightly
modified patch
(https://github.com/SchedMD/slurm/commit/3026f7a7b5799b7e0ac7bc1834106183e2c78d29)
has been applied and tested with 2.5 and 2.6. It will be in the next
2.5 and 2.6 releases.
We will keep both patches related to this going forward but this one
appears to actually fix the problem. I am in favor of leaving the
assert in place to guard against this if there is some other issue we
haven't discovered yet. With the addition of this patch the assert no
longer is tripped when this case happens.
Thanks for everyone helping on this,
Danny
On 02/04/13 13:14, Matthieu Hautreux wrote:
Re: [slurm-dev] more robust logic for tree message forward Hi,
we ran into that kind of problem too (with 2.4) but where not able to
find the root cause at that time. As a result, we made the patch that
was recently added/modified in schedmd 2.6 dev branch replacing an
error message with an assert.
The kind of problem that you identify and reproduce (not enough
replies from a relay node) was what we were expecting too. As we were
not sure, we just decided to make a global check and return an error
as you noticed. I think your patch is a good addition. I would be in
favor of having both your patch and the current one (with an error
instead of an assert) as a global/final check to validate that
everything was right in the tree msg loop sounds interesting to catch
unexpected/unpredictable failures and make the system more resilient.
For the history of our patch, you can find the details on bug #153 on
bugs.schedmd.com <http://bugs.schedmd.com>.
Regards,
Matthieu
2013/2/3 Hongjia Cao <[email protected] <mailto:[email protected]>>
I run into a problem that the node ping/registration agent hangs with
SLURM-2.5.2.
After some tracing I find the cause of the problem in that one
node(cn118) has a bad version of SLURM (2.4.2) installed. On slurmctld
starting, it send node registration messages to the compute nodes,
during which cn118 is requested to forward the message to other nodes:
[2013-02-02T20:23:06+08:00] debug3: Tree sending to cn118 along with
cn[119-127]
On receiving the node registration msg, slurmd on cn118 complains
about
the protocol version mismatch:
slurmd: debug3: in the service_connection
slurmd: debug: unsupported RPC 1001
slurmd: error: Invalid Protocol Version 6400 from uid=0 at
25.8.6.0:40083 <http://25.8.6.0:40083>
slurmd: error: slurm_receive_msg_and_forward: Protocol version has
changed, re-link your code
slurmd: error: service_connection: slurm_receive_msg: Protocol version
has changed, re-link your code
Then in slurmctld, only one ret_data_info is got in the following call
sequence, since cn118 does not forward the message at all:
_fwd_tree_thread() => slurm_send_addr_recv_msgs() =>
_send_and_recv_msgs() => slurm_receive_msgs()
And errno is set to 0 because slurmctld successfully received a
message.
The next node in the tree hostlist will not be tried.
A thread will wait indefinitely in start_msg_tree() for the count of
ret_list to match the host count:
count = list_count(ret_list);
debug2("Tree head got back %d looking for %d", count,
host_count);
while ((count < host_count)) {
pthread_cond_wait(¬ify, &tree_mutex);
count = list_count(ret_list);
debug2("Tree head got back %d", count);
}
debug2("Tree head got them all");
slurm_mutex_unlock(&tree_mutex);
I noticed that in SLURM-2.6 the message forward logic is changed
that in
start_msg_tree() the thread will wait for all _fwd_tree_thread()
threads
to finish. But the assertion following will cause slurmctld to abort
under the condition described above:
slurm_mutex_lock(&tree_mutex);
count = list_count(ret_list);
debug2("Tree head got back %d looking for %d", count,
host_count);
while (thr_count > 0) {
pthread_cond_wait(¬ify, &tree_mutex);
count = list_count(ret_list);
debug2("Tree head got back %d", count);
}
xassert(count >= host_count); /* Tree head did not get all
responses,
* but no more active fwd
threads!*/
slurm_mutex_unlock(&tree_mutex);
So I think in SLURM-2.6 this will also be a problem.
I made the attached patch for slurmctld to continue working under the
condition.