I like number 2 as well. It keeps things simple. A user can override it with /proc/s/k/core_pattern if they *really* need to.
On Oct 31, 2011, at 1:14 PM, [email protected] wrote: > > I have a bug report which points out a problem with the current algorithm for > deciding where the slurmctld core file should go, which results in no core > file being produced on slurmctld abort. The "man" file for slurmctld > describes the core file location algorithm as follows: > > CORE FILE LOCATION > If slurmctld is started with the -D option then the core file will > be > written to the current working directory. Otherwise if > SlurmctldLog- > File is a fully qualified path name (starting with a slash), the > core > file will be written to the same directory as the log file. > Otherwise > the core file will be written to the StateSaveLocation. The > command > "scontrol abort" can be used to abort the slurmctld daemon and > generate > a core file. > > In the case described in the bug report, the user configured the > SlurmctldLogFile parameter to place the log file at > "/var/log/slurmctld.log". The /var/log directory contains log files from > various other applications and daemons, and seems like a reasonable place to > put the "slurmctld.log" file. The log file was created successfully, but > the user was unable to obtain a slurmctld core file dump under this > directory. > > The problem seems to be that the code in "src/slurmctld/controller.c", just > after making the slurmctld a daemon process, checks for the slurmctld log > file pathname being an absolute pathname (i.e., beginning with a "/"), and > if it is, does a "chdir" to the directory containing the log file. At this > point the log file can be created and the "chdir" succeeds, because the > slurmctld process is still running as "root" and has the necessary > permissions on the "/var/log" directory. However, soon after this, the > process does a "setuid" to "SlurmUser" and gives up root privileges. Since > the process now does not have permissions to "/var/log" even though its > working directory is set there, the process cannot create a core file on > abort. > > There are several possible solutions: > > 1. Update the documentation to say that the "slurmctld.log" file should be > placed under a subdirectory (e.g., /var/log/slurm/slurmctld.log) and that the > subdirectory should have permissions for "SlurmUser". Also add some words > to explain that this is necessary so that the core file can be written there > if slurmctld aborts. > > 2. Eliminate the log file directory path as a location for the core file, > and just do a "chdir" to the StateSaveLocation directory instead, since this > directory is already documented as requiring create file permissions for > SlurmUser. In fact, in the lastest "man" file for "slurm.conf", it already > implies that the core files for the SLURM daemons are written there (see > below in bold font: > > StateSaveLocation > Fully qualified pathname of a directory into which the SLURM > controller, slurmctld, saves its state (e.g. > "/usr/local/slurm/checkpoint"). SLURM state will saved here to > recover from system failures. SlurmUser must be able to create > files in this directory. If you have a BackupController config- > ured, this location should be readable and writable by both sys- > tems. Since all running and pending job information is stored > here, the use of a reliable file system (e.g. RAID) is recom- > mended. The default value is "/var/spool". If any slurm dae- > mons terminate abnormally, their core files will also be written > into this directory. > > 3. Add some additional code after the "chdir" to the slurmctld.log directory > to check if "SlurmUser" can create and write files into that directory, and > if not, fallback to using StateSaveLocation. This would retain the existing > algorithm for log file placement, but be a little more robust. > > I am willing to provide patches for the documentation and/or code for any of > the above alternatives, but I would like to know what is the most acceptable > solution. I favor number 2, unless there is some strong reason for wanting > the core file to be written to the log directory if possible. > > -Don Albert-
