Option 2 seems best to me also.

Quoting [email protected]:

I have a bug report which points out a problem with the current algorithm
for deciding where the slurmctld core file should go,  which results in no
core file being produced on slurmctld abort.   The "man" file for
slurmctld describes the core file location algorithm as follows:

CORE FILE LOCATION
       If slurmctld is started with the -D option then the core file  will
 be
       written  to  the current working directory.  Otherwise if
SlurmctldLog-
       File is a fully qualified path name (starting with a slash),  the
core
       file  will be written to the same directory as the log file.
Otherwise
       the core file will be written to the  StateSaveLocation.   The
command
       "scontrol abort" can be used to abort the slurmctld daemon and
generate
       a core file.

In the case described in the bug report,  the user configured the
SlurmctldLogFile parameter to place the log file at
"/var/log/slurmctld.log".  The /var/log directory contains log files from
various other applications and daemons,  and seems like a reasonable place
to put the "slurmctld.log" file.   The log file was created successfully,
but the user was unable to obtain a slurmctld core file dump under this
directory.

The problem seems to be that the code in "src/slurmctld/controller.c",
just after making the slurmctld a daemon process, checks for the slurmctld
log file pathname being an absolute pathname (i.e., beginning with a "/"),
 and if it is, does a "chdir" to the directory containing the log file. At
this point the log file can be created and the "chdir" succeeds, because
the slurmctld process is still running as "root" and has the necessary
permissions on the "/var/log" directory.   However, soon after this, the
process does a "setuid" to "SlurmUser" and gives up root privileges. Since
the process now does not have permissions to "/var/log" even though its
working directory is set there,  the process cannot create a core file on
abort.

There are several possible solutions:

1.  Update the documentation to say that the "slurmctld.log" file should
be placed under a subdirectory (e.g., /var/log/slurm/slurmctld.log) and
that the subdirectory should have permissions for "SlurmUser".   Also add
some words to explain that this is necessary so that the core file can be
written there if slurmctld aborts.

2.  Eliminate the log file directory path as a location for the core file,
and just do a "chdir" to the StateSaveLocation directory instead,  since
this directory is already documented as requiring create file permissions
for SlurmUser.   In fact, in the lastest "man" file for "slurm.conf",  it
already implies that the core files for the SLURM daemons are written
there (see below in bold font:

      StateSaveLocation
             Fully  qualified  pathname  of  a directory into which the
SLURM
             controller,     slurmctld,     saves     its     state (e.g.
             "/usr/local/slurm/checkpoint").   SLURM state will saved here
to
             recover from system failures.  SlurmUser must be able to
create
             files in this directory.  If you have a BackupController
config-
             ured, this location should be readable and writable by both
sys-
             tems.   Since  all running and pending job information is
stored
             here, the use of a reliable file system (e.g.  RAID)  is
recom-
             mended.   The  default value is "/var/spool".  If any slurm
dae-
             mons terminate abnormally, their core files will also be
written
             into this directory.

3.  Add some additional code after the "chdir" to the slurmctld.log
directory to check if "SlurmUser" can create and write files into that
directory, and if not, fallback to using StateSaveLocation.   This would
retain the existing algorithm for log file placement,  but be a little
more robust.

I am willing to provide patches for the documentation and/or code for any
of the above alternatives,  but I would like to know what is the most
acceptable solution.  I favor number 2,  unless there is some strong
reason for wanting the core file to be written to the log directory if
possible.

        -Don Albert-




Reply via email to