I have a support request from a customer, which asks for a way to preserve 
dynamically updated partition state information across an "scontrol 
reconfigure" command.  Currently, the state of nodes is preserved across 
the "scontrol reconfig", but partitions are totally reset from the 
slurm.conf information.

The dynamic partition state does persist across a SIGHUP to slurmctld, but 
sending a SIGHUP via a "kill -s SIGHUP <slurmctld pid>" is rather awkward 
and outside of SLURM.  In addition, if a "reconfig" is requested for some 
other reason such as changing the log file location or updating node 
information, then the partitions are reconstructed and the current state 
is lost.

I propose that a new option to the "slurm.conf" file, for example 
"ReconfKeepPartState", could satisfy this request.  The option would be 
set to "0" by default to preserve the old behavior, but could be set to 
"1" to request the new functionality of keeping/merging the partition 
information on the "scontrol reconfig". 

Here is the proposed addition to the "slurm.conf" man page to describe 
this option:

  ReconfKeepPartState
    If set to YES, an "scontrol reconfig" command will maintain the
    in-memory state of partitions that may have been dynamically
    updated by "scontrol update".  Partition information in the
    slurm.conf file will be merged with in-memory data.  The default
    is NO, which will rebuild the partition information using only the
    definitions in the slurm.conf file, whenever an "scontrol
    reconfig" is done.

Essentially, the new option causes slurmctld to treat the partition update 
as if SIGHUP had been requested instead of completely rebuilding the 
partition information.  Since this is an enhancement, the proposed patch 
set below that implements this change is against the SLURM 2.4.0 version.

  -Don Albert-


Index: s240rc3/slurm/contribs/perlapi/libslurm/perl/conf.c
===================================================================
RCS file: /cvsroot/slurm/slurm/contribs/perlapi/libslurm/perl/conf.c,v
retrieving revision 1.1.1.6
diff -u -r1.1.1.6 conf.c
--- s240rc3/slurm/contribs/perlapi/libslurm/perl/conf.c 12 Nov 2010 17:18:19 
-0000      1.1.1.6
+++ s240rc3/slurm/contribs/perlapi/libslurm/perl/conf.c 16 Nov 2011 23:36:44 
-0000
@@ -149,6 +149,7 @@
                STORE_FIELD(hv, conf, propagate_rlimits, charp);
        if(conf->propagate_rlimits_except)
                STORE_FIELD(hv, conf, propagate_rlimits_except, charp);
+       STORE_FIELD(hv, conf, reconf_keep_part_state, uint16_t);
        if(conf->resume_program)
                STORE_FIELD(hv, conf, resume_program, charp);
        STORE_FIELD(hv, conf, resume_rate, uint16_t);
@@ -340,6 +341,7 @@
        FETCH_FIELD(hv, conf, propagate_prio_process, uint16_t, TRUE);
        FETCH_FIELD(hv, conf, propagate_rlimits, charp, FALSE);
        FETCH_FIELD(hv, conf, propagate_rlimits_except, charp, FALSE);
+       FETCH_FIELD(hv, conf, reconf_keep_part_state, uint16_t, TRUE);
        FETCH_FIELD(hv, conf, resume_program, charp, FALSE);
        FETCH_FIELD(hv, conf, resume_rate, uint16_t, TRUE);
        FETCH_FIELD(hv, conf, resume_timeout, uint16_t, TRUE);
Index: s240rc3/slurm/doc/man/man5/slurm.conf.5
===================================================================
RCS file: /cvsroot/slurm/slurm/doc/man/man5/slurm.conf.5,v
retrieving revision 1.1.1.59.6.1
diff -u -r1.1.1.59.6.1 slurm.conf.5
--- s240rc3/slurm/doc/man/man5/slurm.conf.5     25 Oct 2011 17:57:25 -0000      
1.1.1.59.6.1
+++ s240rc3/slurm/doc/man/man5/slurm.conf.5     16 Nov 2011 23:37:41 -0000
@@ -1242,6 +1242,15 @@
 an authorized user. After being rebooting, the node is returned to normal use.

 .TP
+\fBReconfKeepPartState\fR
+If set to YES, an "scontrol reconfig" command will maintain the
+in-memory state of partitions that may have been dynamically updated
+by "scontrol update".  Partition information in the slurm.conf file
+will be merged with in-memory data.  The default is NO, which will
+rebuild the partition information using only the definitions in the
+slurm.conf file, whenever an "scontrol reconfig" is done.
+
+.TP
 \fBResumeProgram\fR
 SLURM supports a mechanism to reduce power consumption on nodes that
 remain idle for an extended period of time.
Index: s240rc3/slurm/slurm/slurm.h.in
===================================================================
RCS file: /cvsroot/slurm/slurm/slurm/slurm.h.in,v
retrieving revision 1.1.1.53.6.2
diff -u -r1.1.1.53.6.2 slurm.h.in
--- s240rc3/slurm/slurm/slurm.h.in      27 Oct 2011 18:52:22 -0000      
1.1.1.53.6.2
+++ s240rc3/slurm/slurm/slurm.h.in      16 Nov 2011 23:38:08 -0000
@@ -1887,6 +1887,7 @@
        char *propagate_rlimits;/* Propagate (all/specific) resource limits */
        char *propagate_rlimits_except;/* Propagate all rlimits except these */
        char *reboot_program;   /* program to reboot the node */
+       uint16_t reconf_keep_part_state; /* keep partition state on scontrol 
reconfig */
        char *resume_program;   /* program to make nodes full power */
        uint16_t resume_rate;   /* nodes to make full power, per minute */
        uint16_t resume_timeout;/* time required in order to perform a node
Index: s240rc3/slurm/src/api/config_info.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/api/config_info.c,v
retrieving revision 1.1.1.40
diff -u -r1.1.1.40 config_info.c
--- s240rc3/slurm/src/api/config_info.c 18 Oct 2011 16:09:22 -0000      1.1.1.40
+++ s240rc3/slurm/src/api/config_info.c 16 Nov 2011 23:38:32 -0000
@@ -742,6 +742,14 @@
        key_pair->value = xstrdup(slurm_ctl_conf_ptr->reboot_program);
        list_append(ret_list, key_pair);

+       key_pair = xmalloc(sizeof(config_key_pair_t));
+       key_pair->name = xstrdup("ReconfKeepPartState");
+       if(slurm_ctl_conf_ptr->reconf_keep_part_state)
+               key_pair->value = xstrdup("YES");
+       else
+               key_pair->value = xstrdup("NO");
+       list_append(ret_list, key_pair);
+
        key_pair = xmalloc(sizeof(config_key_pair_t));
        key_pair->name = xstrdup("ResumeProgram");
        key_pair->value = xstrdup(slurm_ctl_conf_ptr->resume_program);
Index: s240rc3/slurm/src/common/read_config.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/common/read_config.c,v
retrieving revision 1.1.1.55
diff -u -r1.1.1.55 read_config.c
--- s240rc3/slurm/src/common/read_config.c      18 Oct 2011 16:08:40 -0000      
1.1.1.55
+++ s240rc3/slurm/src/common/read_config.c      16 Nov 2011 23:38:53 -0000
@@ -240,6 +240,7 @@
        {"PropagateResourceLimitsExcept", S_P_STRING},
        {"PropagateResourceLimits", S_P_STRING},
        {"RebootProgram", S_P_STRING},
+       {"ReconfKeepPartState", S_P_BOOLEAN},
        {"ResumeProgram", S_P_STRING},
        {"ResumeRate", S_P_UINT16},
        {"ResumeTimeout", S_P_UINT16},
@@ -1961,6 +1962,7 @@
        xfree (ctl_conf_ptr->propagate_rlimits);
        xfree (ctl_conf_ptr->propagate_rlimits_except);
        xfree (ctl_conf_ptr->reboot_program);
+       ctl_conf_ptr->reconf_keep_part_state    = (uint16_t) NO_VAL;
        ctl_conf_ptr->resume_timeout            = 0;
        xfree (ctl_conf_ptr->resume_program);
        ctl_conf_ptr->resume_rate               = (uint16_t) NO_VAL;
@@ -2933,6 +2935,10 @@
                              conf->propagate_rlimits);
        }

+       if (!s_p_get_boolean((bool *) &conf->reconf_keep_part_state,
+                            "ReconfKeepPartState", hashtbl))
+               conf->reconf_keep_part_state = DEFAULT_RECONF_KEEP_PART_STATE;
+
        if (!s_p_get_uint16(&conf->ret2service, "ReturnToService", hashtbl))
                conf->ret2service = DEFAULT_RETURN_TO_SERVICE;
 #ifdef HAVE_CRAY
Index: s240rc3/slurm/src/common/read_config.h
===================================================================
RCS file: /cvsroot/slurm/slurm/src/common/read_config.h,v
retrieving revision 1.1.1.43
diff -u -r1.1.1.43 read_config.h
--- s240rc3/slurm/src/common/read_config.h      18 Oct 2011 16:08:36 -0000      
1.1.1.43
+++ s240rc3/slurm/src/common/read_config.h      16 Nov 2011 23:41:07 -0000
@@ -109,6 +109,7 @@
 #define DEFAULT_PRIORITY_DECAY      604800 /* 7 days */
 #define DEFAULT_PRIORITY_CALC_PERIOD 300 /* in seconds */
 #define DEFAULT_PRIORITY_TYPE       "priority/basic"
+#define DEFAULT_RECONF_KEEP_PART_STATE 0
 #define DEFAULT_RETURN_TO_SERVICE   0
 #define DEFAULT_RESUME_RATE         300
 #define DEFAULT_RESUME_TIMEOUT      60
Index: s240rc3/slurm/src/common/slurm_protocol_pack.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/common/slurm_protocol_pack.c,v
retrieving revision 1.1.1.50
diff -u -r1.1.1.50 slurm_protocol_pack.c
--- s240rc3/slurm/src/common/slurm_protocol_pack.c      18 Oct 2011 16:08:30 
-0000      1.1.1.50
+++ s240rc3/slurm/src/common/slurm_protocol_pack.c      16 Nov 2011 23:41:27 
-0000
@@ -4539,6 +4539,7 @@
                packstr(build_ptr->propagate_rlimits_except, buffer);

                packstr(build_ptr->reboot_program, buffer);
+               pack16(build_ptr->reconf_keep_part_state, buffer);
                packstr(build_ptr->resume_program, buffer);
                pack16(build_ptr->resume_rate, buffer);
                pack16(build_ptr->resume_timeout, buffer);
@@ -5385,6 +5386,7 @@

                safe_unpackstr_xmalloc(&build_ptr->reboot_program, &uint32_tmp,
                                       buffer);
+               safe_unpack16(&build_ptr->reconf_keep_part_state, buffer);
                safe_unpackstr_xmalloc(&build_ptr->resume_program,
                                       &uint32_tmp, buffer);
                safe_unpack16(&build_ptr->resume_rate, buffer);
Index: s240rc3/slurm/src/slurmctld/proc_req.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/slurmctld/proc_req.c,v
retrieving revision 1.1.1.62
diff -u -r1.1.1.62 proc_req.c
--- s240rc3/slurm/src/slurmctld/proc_req.c      25 Oct 2011 14:35:51 -0000      
1.1.1.62
+++ s240rc3/slurm/src/slurmctld/proc_req.c      16 Nov 2011 23:41:51 -0000
@@ -572,6 +572,7 @@
                                                     propagate_rlimits_except);

        conf_ptr->reboot_program      = xstrdup(conf->reboot_program);
+       conf_ptr->reconf_keep_part_state = conf->reconf_keep_part_state;
        conf_ptr->resume_program      = xstrdup(conf->resume_program);
        conf_ptr->resume_rate         = conf->resume_rate;
        conf_ptr->resume_timeout      = conf->resume_timeout;
Index: s240rc3/slurm/src/slurmctld/read_config.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/slurmctld/read_config.c,v
retrieving revision 1.1.1.58
diff -u -r1.1.1.58 read_config.c
--- s240rc3/slurm/src/slurmctld/read_config.c   18 Oct 2011 16:09:34 -0000      
1.1.1.58
+++ s240rc3/slurm/src/slurmctld/read_config.c   16 Nov 2011 23:42:14 -0000
@@ -778,7 +778,7 @@
                                                 old_node_record_count);
                        error_code = MAX(error_code, rc);  /* not fatal */
                }
-               if (old_part_list && (recover > 1)) {
+               if (old_part_list && ((recover > 1) || 
slurmctld_conf.reconf_keep_part_state)) {
                        info("restoring original partition state");
                        rc = _restore_part_state(old_part_list,
                                                 old_def_part_name);

Reply via email to