I run into a problem that the node ping/registration agent hangs with
SLURM-2.5.2.

After some tracing I find the cause of the problem in that one
node(cn118) has a bad version of SLURM (2.4.2) installed. On slurmctld
starting, it send node registration  messages to the compute nodes,
during which cn118 is requested to forward the message to other nodes:

[2013-02-02T20:23:06+08:00] debug3: Tree sending to cn118 along with
cn[119-127]

On receiving the node registration msg, slurmd on cn118 complains about
the protocol version mismatch:

slurmd: debug3: in the service_connection
slurmd: debug:  unsupported RPC 1001
slurmd: error: Invalid Protocol Version 6400 from uid=0 at
25.8.6.0:40083
slurmd: error: slurm_receive_msg_and_forward: Protocol version has
changed, re-link your code
slurmd: error: service_connection: slurm_receive_msg: Protocol version
has changed, re-link your code

Then in slurmctld, only one ret_data_info is got in the following call
sequence, since cn118 does not forward the message at all:

_fwd_tree_thread() => slurm_send_addr_recv_msgs() =>
_send_and_recv_msgs() => slurm_receive_msgs()

And errno is set to 0 because slurmctld successfully received a message.
The next node in the tree hostlist will not be tried.

A thread will wait indefinitely in start_msg_tree() for the count of
ret_list to match the host count:

        count = list_count(ret_list);
        debug2("Tree head got back %d looking for %d", count, host_count);
        while ((count < host_count)) {
                pthread_cond_wait(&notify, &tree_mutex);
                count = list_count(ret_list);
                debug2("Tree head got back %d", count);
        }
        debug2("Tree head got them all");
        slurm_mutex_unlock(&tree_mutex);


I noticed that in SLURM-2.6 the message forward logic is changed that in
start_msg_tree() the thread will wait for all _fwd_tree_thread() threads
to finish. But the assertion following will cause slurmctld to abort
under the condition described above:

        slurm_mutex_lock(&tree_mutex);

        count = list_count(ret_list);
        debug2("Tree head got back %d looking for %d", count,
host_count);
        while (thr_count > 0) {
                pthread_cond_wait(&notify, &tree_mutex);
                count = list_count(ret_list);
                debug2("Tree head got back %d", count);
        }
        xassert(count >= host_count);   /* Tree head did not get all
responses,
                                         * but no more active fwd
threads!*/
        slurm_mutex_unlock(&tree_mutex);

So I think in SLURM-2.6 this will also be a problem.

I made the attached patch for slurmctld to continue working under the
condition.
>From 328c8a816b2eab59f4b4322b2ebbd3ca3871fe9f Mon Sep 17 00:00:00 2001
From: Hongjia Cao <[email protected]>
Date: Sun, 3 Feb 2013 12:06:32 +0800
Subject: [PATCH] more robust tree message forward logic.

---
 src/common/forward.c |   24 ++++++++++++++++++++++++
 1 个文件被修改,插入 24 行(+)

diff --git a/src/common/forward.c b/src/common/forward.c
index 00db8eb..aa0356e 100644
--- a/src/common/forward.c
+++ b/src/common/forward.c
@@ -316,6 +316,9 @@ void *_fwd_tree_thread(void *arg)
 {
 	fwd_tree_t *fwd_tree = (fwd_tree_t *)arg;
 	List ret_list = NULL;
+	ListIterator itr = NULL;
+	ret_data_info_t *ret_data_info = NULL;
+	int ret_cnt;
 	char *name = NULL;
 	char *buf = NULL;
 	slurm_msg_t send_msg;
@@ -360,11 +363,32 @@ void *_fwd_tree_thread(void *arg)
 		xfree(send_msg.forward.nodelist);
 
 		if (ret_list) {
+			ret_cnt = list_count(ret_list); 
+			if (ret_cnt <= send_msg.forward.cnt &&
+			    errno != SLURM_COMMUNICATIONS_CONNECTION_ERROR) {
+				error("fwd_tree_thread: %s failed to forward "
+				      "the message, expecting %d ret got only"
+				      " %d",
+				      name, send_msg.forward.cnt + 1, ret_cnt);
+				if (ret_cnt > 1) { /* not likely */
+					itr = list_iterator_create(ret_list);
+	                        	while ((ret_data_info = list_next(itr))) {
+						if (strcmp(ret_data_info->node_name, name)) 
+							hostlist_delete_host(fwd_tree->tree_hl, ret_data_info->node_name);
+					}
+			        	list_iterator_destroy(itr);
+				}
+			}
 			slurm_mutex_lock(fwd_tree->tree_mutex);
 			list_transfer(fwd_tree->ret_list, ret_list);
 			pthread_cond_signal(fwd_tree->notify);
 			slurm_mutex_unlock(fwd_tree->tree_mutex);
 			list_destroy(ret_list);
+			/* try next node */
+			if (ret_cnt <= send_msg.forward.cnt) {
+				free(name);
+				continue;
+			}
 		} else {
 			/* This should never happen (when this was
 			 * written slurm_send_addr_recv_msgs always
-- 
1.7.10.4

Reply via email to