I run into a problem that the node ping/registration agent hangs with
SLURM-2.5.2.
After some tracing I find the cause of the problem in that one
node(cn118) has a bad version of SLURM (2.4.2) installed. On slurmctld
starting, it send node registration messages to the compute nodes,
during which cn118 is requested to forward the message to other nodes:
[2013-02-02T20:23:06+08:00] debug3: Tree sending to cn118 along with
cn[119-127]
On receiving the node registration msg, slurmd on cn118 complains about
the protocol version mismatch:
slurmd: debug3: in the service_connection
slurmd: debug: unsupported RPC 1001
slurmd: error: Invalid Protocol Version 6400 from uid=0 at
25.8.6.0:40083
slurmd: error: slurm_receive_msg_and_forward: Protocol version has
changed, re-link your code
slurmd: error: service_connection: slurm_receive_msg: Protocol version
has changed, re-link your code
Then in slurmctld, only one ret_data_info is got in the following call
sequence, since cn118 does not forward the message at all:
_fwd_tree_thread() => slurm_send_addr_recv_msgs() =>
_send_and_recv_msgs() => slurm_receive_msgs()
And errno is set to 0 because slurmctld successfully received a message.
The next node in the tree hostlist will not be tried.
A thread will wait indefinitely in start_msg_tree() for the count of
ret_list to match the host count:
count = list_count(ret_list);
debug2("Tree head got back %d looking for %d", count, host_count);
while ((count < host_count)) {
pthread_cond_wait(¬ify, &tree_mutex);
count = list_count(ret_list);
debug2("Tree head got back %d", count);
}
debug2("Tree head got them all");
slurm_mutex_unlock(&tree_mutex);
I noticed that in SLURM-2.6 the message forward logic is changed that in
start_msg_tree() the thread will wait for all _fwd_tree_thread() threads
to finish. But the assertion following will cause slurmctld to abort
under the condition described above:
slurm_mutex_lock(&tree_mutex);
count = list_count(ret_list);
debug2("Tree head got back %d looking for %d", count,
host_count);
while (thr_count > 0) {
pthread_cond_wait(¬ify, &tree_mutex);
count = list_count(ret_list);
debug2("Tree head got back %d", count);
}
xassert(count >= host_count); /* Tree head did not get all
responses,
* but no more active fwd
threads!*/
slurm_mutex_unlock(&tree_mutex);
So I think in SLURM-2.6 this will also be a problem.
I made the attached patch for slurmctld to continue working under the
condition.
>From 328c8a816b2eab59f4b4322b2ebbd3ca3871fe9f Mon Sep 17 00:00:00 2001
From: Hongjia Cao <[email protected]>
Date: Sun, 3 Feb 2013 12:06:32 +0800
Subject: [PATCH] more robust tree message forward logic.
---
src/common/forward.c | 24 ++++++++++++++++++++++++
1 个文件被修改,插入 24 行(+)
diff --git a/src/common/forward.c b/src/common/forward.c
index 00db8eb..aa0356e 100644
--- a/src/common/forward.c
+++ b/src/common/forward.c
@@ -316,6 +316,9 @@ void *_fwd_tree_thread(void *arg)
{
fwd_tree_t *fwd_tree = (fwd_tree_t *)arg;
List ret_list = NULL;
+ ListIterator itr = NULL;
+ ret_data_info_t *ret_data_info = NULL;
+ int ret_cnt;
char *name = NULL;
char *buf = NULL;
slurm_msg_t send_msg;
@@ -360,11 +363,32 @@ void *_fwd_tree_thread(void *arg)
xfree(send_msg.forward.nodelist);
if (ret_list) {
+ ret_cnt = list_count(ret_list);
+ if (ret_cnt <= send_msg.forward.cnt &&
+ errno != SLURM_COMMUNICATIONS_CONNECTION_ERROR) {
+ error("fwd_tree_thread: %s failed to forward "
+ "the message, expecting %d ret got only"
+ " %d",
+ name, send_msg.forward.cnt + 1, ret_cnt);
+ if (ret_cnt > 1) { /* not likely */
+ itr = list_iterator_create(ret_list);
+ while ((ret_data_info = list_next(itr))) {
+ if (strcmp(ret_data_info->node_name, name))
+ hostlist_delete_host(fwd_tree->tree_hl, ret_data_info->node_name);
+ }
+ list_iterator_destroy(itr);
+ }
+ }
slurm_mutex_lock(fwd_tree->tree_mutex);
list_transfer(fwd_tree->ret_list, ret_list);
pthread_cond_signal(fwd_tree->notify);
slurm_mutex_unlock(fwd_tree->tree_mutex);
list_destroy(ret_list);
+ /* try next node */
+ if (ret_cnt <= send_msg.forward.cnt) {
+ free(name);
+ continue;
+ }
} else {
/* This should never happen (when this was
* written slurm_send_addr_recv_msgs always
--
1.7.10.4