I have a bug report which points out a problem with the current algorithm 
for deciding where the slurmctld core file should go,  which results in no 
core file being produced on slurmctld abort.   The "man" file for 
slurmctld describes the core file location algorithm as follows:

CORE FILE LOCATION
       If slurmctld is started with the -D option then the core file  will 
 be
       written  to  the current working directory.  Otherwise if 
SlurmctldLog-
       File is a fully qualified path name (starting with a slash),  the 
core
       file  will be written to the same directory as the log file. 
Otherwise
       the core file will be written to the  StateSaveLocation.   The 
command
       "scontrol abort" can be used to abort the slurmctld daemon and 
generate
       a core file.

In the case described in the bug report,  the user configured the 
SlurmctldLogFile parameter to place the log file at 
"/var/log/slurmctld.log".  The /var/log directory contains log files from 
various other applications and daemons,  and seems like a reasonable place 
to put the "slurmctld.log" file.   The log file was created successfully, 
but the user was unable to obtain a slurmctld core file dump under this 
directory.

The problem seems to be that the code in "src/slurmctld/controller.c", 
just after making the slurmctld a daemon process, checks for the slurmctld 
log file pathname being an absolute pathname (i.e., beginning with a "/"), 
 and if it is, does a "chdir" to the directory containing the log file. At 
this point the log file can be created and the "chdir" succeeds, because 
the slurmctld process is still running as "root" and has the necessary 
permissions on the "/var/log" directory.   However, soon after this, the 
process does a "setuid" to "SlurmUser" and gives up root privileges. Since 
the process now does not have permissions to "/var/log" even though its 
working directory is set there,  the process cannot create a core file on 
abort.

There are several possible solutions:

1.  Update the documentation to say that the "slurmctld.log" file should 
be placed under a subdirectory (e.g., /var/log/slurm/slurmctld.log) and 
that the subdirectory should have permissions for "SlurmUser".   Also add 
some words to explain that this is necessary so that the core file can be 
written there if slurmctld aborts.

2.  Eliminate the log file directory path as a location for the core file, 
and just do a "chdir" to the StateSaveLocation directory instead,  since 
this directory is already documented as requiring create file permissions 
for SlurmUser.   In fact, in the lastest "man" file for "slurm.conf",  it 
already implies that the core files for the SLURM daemons are written 
there (see below in bold font:

      StateSaveLocation
             Fully  qualified  pathname  of  a directory into which the 
SLURM
             controller,     slurmctld,     saves     its     state (e.g.
             "/usr/local/slurm/checkpoint").   SLURM state will saved here 
to
             recover from system failures.  SlurmUser must be able to 
create
             files in this directory.  If you have a BackupController 
config-
             ured, this location should be readable and writable by both 
sys-
             tems.   Since  all running and pending job information is 
stored
             here, the use of a reliable file system (e.g.  RAID)  is 
recom-
             mended.   The  default value is "/var/spool".  If any slurm 
dae-
             mons terminate abnormally, their core files will also be 
written
             into this directory.

3.  Add some additional code after the "chdir" to the slurmctld.log 
directory to check if "SlurmUser" can create and write files into that 
directory, and if not, fallback to using StateSaveLocation.   This would 
retain the existing algorithm for log file placement,  but be a little 
more robust.

I am willing to provide patches for the documentation and/or code for any 
of the above alternatives,  but I would like to know what is the most 
acceptable solution.  I favor number 2,  unless there is some strong 
reason for wanting the core file to be written to the log directory if 
possible.

        -Don Albert-

Reply via email to