I have a bug report which points out a problem with the current
algorithm
for deciding where the slurmctld core file should go, which results
in no
core file being produced on slurmctld abort. The "man" file for
slurmctld describes the core file location algorithm as follows:
CORE FILE LOCATION
If slurmctld is started with the -D option then the core file
will
be
written to the current working directory. Otherwise if
SlurmctldLog-
File is a fully qualified path name (starting with a slash), the
core
file will be written to the same directory as the log file.
Otherwise
the core file will be written to the StateSaveLocation. The
command
"scontrol abort" can be used to abort the slurmctld daemon and
generate
a core file.
In the case described in the bug report, the user configured the
SlurmctldLogFile parameter to place the log file at
"/var/log/slurmctld.log". The /var/log directory contains log files
from
various other applications and daemons, and seems like a reasonable
place
to put the "slurmctld.log" file. The log file was created
successfully,
but the user was unable to obtain a slurmctld core file dump under this
directory.
The problem seems to be that the code in "src/slurmctld/controller.c",
just after making the slurmctld a daemon process, checks for the
slurmctld
log file pathname being an absolute pathname (i.e., beginning with a
"/"),
and if it is, does a "chdir" to the directory containing the log
file. At
this point the log file can be created and the "chdir" succeeds, because
the slurmctld process is still running as "root" and has the necessary
permissions on the "/var/log" directory. However, soon after this, the
process does a "setuid" to "SlurmUser" and gives up root privileges.
Since
the process now does not have permissions to "/var/log" even though its
working directory is set there, the process cannot create a core
file on
abort.
There are several possible solutions:
1. Update the documentation to say that the "slurmctld.log" file should
be placed under a subdirectory (e.g., /var/log/slurm/slurmctld.log) and
that the subdirectory should have permissions for "SlurmUser". Also
add
some words to explain that this is necessary so that the core file
can be
written there if slurmctld aborts.
2. Eliminate the log file directory path as a location for the core
file,
and just do a "chdir" to the StateSaveLocation directory instead, since
this directory is already documented as requiring create file
permissions
for SlurmUser. In fact, in the lastest "man" file for
"slurm.conf", it
already implies that the core files for the SLURM daemons are written
there (see below in bold font:
StateSaveLocation
Fully qualified pathname of a directory into which the
SLURM
controller, slurmctld, saves its state
(e.g.
"/usr/local/slurm/checkpoint"). SLURM state will saved
here
to
recover from system failures. SlurmUser must be able to
create
files in this directory. If you have a BackupController
config-
ured, this location should be readable and writable by both
sys-
tems. Since all running and pending job information is
stored
here, the use of a reliable file system (e.g. RAID) is
recom-
mended. The default value is "/var/spool". If any slurm
dae-
mons terminate abnormally, their core files will also be
written
into this directory.
3. Add some additional code after the "chdir" to the slurmctld.log
directory to check if "SlurmUser" can create and write files into that
directory, and if not, fallback to using StateSaveLocation. This would
retain the existing algorithm for log file placement, but be a little
more robust.
I am willing to provide patches for the documentation and/or code for
any
of the above alternatives, but I would like to know what is the most
acceptable solution. I favor number 2, unless there is some strong
reason for wanting the core file to be written to the log directory if
possible.
-Don Albert-