The original epilog works for me and both env vars should be set. See src/slurmd/slurmd/req.c:
        setenvf(&env, "SLURM_JOB_UID",   "%u", job_env->uid);
        setenvf(&env, "SLURM_UID",   "%u", job_env->uid);

Quoting Kevin Abbey <[email protected]>:

Hi,

I upgraded from 2.5.3 to 14.03.8 a few days ago. We have nodes shared for multiple jobs. Suddenly, we realized that once one of the jobs on a node ended all other jobs were being killed. I'm not sure if I broke the system during the upgrade. I did not examine the database. I checked the epilog since that is a likely source. I noticed that the variable: $SLURM_UID, which was in the original script I had been using from 2.5.3 is still included with the new version. However, the variable is no longer present when I do salloc, then env | grep SLURM. I pasted the output below.

The epilog file provided is the same as with 2.5.3 but since the $SLURM_UID is not present I modified the variable to use $SLURM_JOB_UID. It appears to be working as expected now. If anyone is setting the path to the slurm bin in their environment be sure to add the / before squeue for the job list line.

I have made the changes in bold below. Can anyone confirm this problem and solution? If there is something wrong in my database then I'd like to find it.

Thank you,
Kevin




--- slurm.epilog.clean.default  2014-10-13 18:39:29.146111716 -0400
+++ slurm.epilog.clean  2014-10-13 18:51:51.192922992 -0400
@@ -8,7 +8,7 @@
 # SLURM_BIN can be used for testing with private version of SLURM
 #SLURM_BIN="/usr/bin/"
 #
-if [ x$SLURM_UID = "x" ] ; then
+if [ x*$SLURM_JOB_UID* = "x" ] ; then
        exit 0
 fi
 if [ x$SLURM_JOB_ID = "x" ] ; then
@@ -18,20 +18,19 @@
 #
 # Don't try to kill user root or system daemon jobs
 #
-if [ $SLURM_UID -lt 100 ] ; then
+if [ *$SLURM_JOB_UID* -lt 100 ] ; then
        exit 0
 fi

-job_list=`${SLURM_BIN}squeue --noheader --format=%A --user=$SLURM_UID --node=localhost` +job_list=`*${SLURM_BIN}/squeue* --noheader --format=%i --user=*$SLURM_JOB_UID* --node=localhost`
 for job_id in $job_list
 do
        if [ $job_id -ne $SLURM_JOB_ID ] ; then
                exit 0
        fi
 done
-
 #
 # No other SLURM jobs, purge all remaining processes of this user
 #
-pkill -KILL -U $SLURM_UID
+pkill -KILL -U *$SLURM_JOB_UID*
 exit 0






==============================================================

[kabbey@kestrel slurm]$ salloc --partition=batch --nodelist=node12  --mem=30G
salloc: Granted job allocation 16010




[kabbey@node12 slurm]$ env | grep SLURM
SLURM_CHECKPOINT_IMAGE_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_NODELIST=node12
SLURMD_NODENAME=node12
SLURM_TOPOLOGY_ADDR=node12
SLURM_PRIO_PROCESS=0
SLURM_SRUN_COMM_PORT=48272
SLURM_PTY_WIN_ROW=40
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=1
SLURM_STEP_NUM_NODES=1
SLURM_JOBID=16010
SLURM_NTASKS=1
SLURM_LAUNCH_NODE_IPADDR=192.168.0.169
SLURM_STEP_ID=0
SLURM_STEP_LAUNCHER_PORT=48272
SLURM_TASKS_PER_NODE=1
*SLURM_JOB_ID=16010**
*SLURM_JOB_USER=kabbey
SLURM_STEPID=0
SLURM_SRUN_COMM_HOST=192.168.0.169
SLURM_PTY_WIN_COL=209
*SLURM_JOB_UID=12901*
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/g1/home/kabbey/software_tests/slurm
SLURM_TASK_PID=85622
SLURM_NPROCS=1
SLURM=/g0/opt/slurm/14.03.8
SLURM_CPUS_ON_NODE=2
SLURM_DISTRIBUTION=cyclic
SLURM_PROCID=0
SLURM_JOB_NODELIST=node12
SLURM_PTY_PORT=52017
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=2
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=kestrel.ccib.rutgers.edu
SLURM_JOB_PARTITION=batch
SLURM_STEP_NUM_TASKS=1
SLURM_JOB_NUM_NODES=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_NODELIST=node12
SLURM_MEM_PER_NODE=30720
[kabbey@node12 slurm]$



--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
 Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: [email protected]


--
Morris "Moe" Jette
CTO, SchedMD LLC

Reply via email to