I like number 2 as well. It keeps things simple.

A user can override it with /proc/s/k/core_pattern if they *really* need to.

On Oct 31, 2011, at 1:14 PM, [email protected] wrote:

> 
> I have a bug report which points out a problem with the current algorithm for 
> deciding where the slurmctld core file should go,  which results in no core 
> file being produced on slurmctld abort.   The "man" file for slurmctld 
> describes the core file location algorithm as follows: 
> 
> CORE FILE LOCATION 
>        If slurmctld is started with the -D option then the core file  will  
> be 
>        written  to  the current working directory.  Otherwise if 
> SlurmctldLog- 
>        File is a fully qualified path name (starting with a slash),  the  
> core 
>        file  will be written to the same directory as the log file.  
> Otherwise 
>        the core file will be written to the  StateSaveLocation.   The  
> command 
>        "scontrol abort" can be used to abort the slurmctld daemon and 
> generate 
>        a core file. 
> 
> In the case described in the bug report,  the user configured the 
> SlurmctldLogFile parameter to place the log file at  
> "/var/log/slurmctld.log".  The /var/log directory contains log files from 
> various other applications and daemons,  and seems like a reasonable place to 
> put the "slurmctld.log" file.   The log file was created successfully,  but 
> the user was unable to obtain a slurmctld core file dump under this 
> directory. 
> 
> The problem seems to be that the code in "src/slurmctld/controller.c",  just 
> after making the slurmctld a daemon process, checks for the slurmctld log 
> file pathname being an absolute pathname (i.e., beginning with a "/"),  and 
> if it is, does a "chdir" to the directory containing the log file.   At this 
> point the log file can be created and the "chdir" succeeds, because the 
> slurmctld process is still running as "root" and has the necessary 
> permissions on the "/var/log" directory.   However, soon after this, the 
> process does a "setuid" to "SlurmUser" and gives up root privileges.  Since 
> the process now does not have permissions to "/var/log" even though its 
> working directory is set there,  the process cannot create a core file on 
> abort. 
> 
> There are several possible solutions: 
> 
> 1.  Update the documentation to say that the "slurmctld.log" file should be 
> placed under a subdirectory (e.g., /var/log/slurm/slurmctld.log) and that the 
> subdirectory should have permissions for "SlurmUser".   Also add some words 
> to explain that this is necessary so that the core file can be written there 
> if slurmctld aborts. 
> 
> 2.  Eliminate the log file directory path as a location for the core file, 
> and just do a "chdir" to the StateSaveLocation directory instead,  since this 
> directory is already documented as requiring create file permissions for 
> SlurmUser.   In fact, in the lastest "man" file for "slurm.conf",  it already 
> implies that the core files for the SLURM daemons are written there (see 
> below in bold font: 
> 
>       StateSaveLocation 
>              Fully  qualified  pathname  of  a directory into which the SLURM 
>              controller,     slurmctld,     saves     its     state     (e.g. 
>              "/usr/local/slurm/checkpoint").   SLURM state will saved here to 
>              recover from system failures.  SlurmUser must be able to  create 
>              files in this directory.  If you have a BackupController config- 
>              ured, this location should be readable and writable by both sys- 
>              tems.   Since  all running and pending job information is stored 
>              here, the use of a reliable file system (e.g.  RAID)  is  recom- 
>              mended.   The  default value is "/var/spool".  If any slurm dae- 
>              mons terminate abnormally, their core files will also be written 
>              into this directory. 
> 
> 3.  Add some additional code after the "chdir" to the slurmctld.log directory 
> to check if "SlurmUser" can create and write files into that directory, and 
> if not, fallback to using StateSaveLocation.   This would retain the existing 
> algorithm for log file placement,  but be a little more robust. 
> 
> I am willing to provide patches for the documentation and/or code for any of 
> the above alternatives,  but I would like to know what is the most acceptable 
> solution.  I favor number 2,  unless there is some strong reason for wanting 
> the core file to be written to the log directory if possible. 
> 
>         -Don Albert- 

Reply via email to