We are using slurm 2.4.0. We're trying to give users some memory usage statistics for their batch script in the end of their slurm-nnn.out, by running
sstat -j $SLURM_JOB_ID.batch -o jobid,maxvmsize,maxrss at the end of the script. (We also run "sacct -j $SLURM_JOB_ID" to get information about job steps started with srun or mpirun, but that works fine in all cases below.) This works fine for jobs that run on only one noe. For instance, with the following job script: $ cat sstattest.sm #!/bin/sh #SBATCH --job-name=sstattest #SBATCH --account=staff #SBATCH --time=0:10:0 --mem-per-cpu=1G srun -l hostname sstat -j $SLURM_JOB_ID.batch -o jobid,maxvmsize,maxrss One task works: $ sbatch sstattest.sm Submitted batch job 193 $ cat slurm-193.out 0: compute-6-1.local JobID MaxVMSize MaxRSS ------------ ---------- ---------- 193.batch 271484K 3720K Two tasks on the same node works: $ sbatch --ntasks=2 --nodes=1 sstattest.sm Submitted batch job 194 $ cat slurm-194.out 1: compute-6-1.local 0: compute-6-1.local JobID MaxVMSize MaxRSS ------------ ---------- ---------- 194.batch 271488K 3724K Two tasks on separate nodes doesn't work: $ sbatch --ntasks=2 --nodes=2 sstattest.sm Submitted batch job 195 $ cat slurm-195.out 0: compute-6-1.local 1: compute-6-2.local JobID MaxVMSize MaxRSS ------------ ---------- ---------- sstat: error: slurm_job_step_stat: there was an error with the request to c6-2 rc = Invalid job id specified sstat: error: problem getting step_layout for 195.4294967294: Invalid job id specified The slurmd.log on c6-1 contains: [2012-07-03T17:08:32] [193] Handling REQUEST_INFO [2012-07-03T17:08:32] debug: Entering stepd_stat_jobacct for job 193.4294967294 [2012-07-03T17:08:32] [193] Handling REQUEST_STEP_STAT [2012-07-03T17:08:32] [193] _handle_stat_jobacct for job 193.4294967294 [2012-07-03T17:08:32] [193] Job 193 memory used:3720 limit:1048576 KB [2012-07-03T17:08:32] [193] Handling REQUEST_STEP_LIST_PIDS [2012-07-03T17:08:32] [193] _handle_list_pids for job 193.4294967294 [...] [2012-07-03T17:10:21] [194] Handling REQUEST_INFO [2012-07-03T17:10:21] debug: Entering stepd_stat_jobacct for job 194.4294967294 [2012-07-03T17:10:21] [194] Handling REQUEST_STEP_STAT [2012-07-03T17:10:21] [194] _handle_stat_jobacct for job 194.4294967294 [2012-07-03T17:10:21] [194] Job 194 memory used:3724 limit:2097152 KB [2012-07-03T17:10:21] [194] Handling REQUEST_STEP_LIST_PIDS [2012-07-03T17:10:21] [194] _handle_list_pids for job 194.4294967294 (but nothing similar for job 195) The slurmd.log on c6-2 contains: [2012-07-03T17:11:02] [195.0] task 1 (8002) exited with exit code 0. [2012-07-03T17:11:02] [195.0] task_post_term: 195.0, task 0 [2012-07-03T17:11:02] [195.0] Waiting for IO [2012-07-03T17:11:02] [195.0] Closing debug channel [2012-07-03T17:11:02] [195.0] IO handler exited, rc=0 [2012-07-03T17:11:02] [195.0] Message thread exited [2012-07-03T17:11:02] [195.0] done with job [2012-07-03T17:11:02] error: stat_jobacct for invalid job_id: 195 [2012-07-03T17:11:02] debug: _rpc_terminate_job, uid = 501 [2012-07-03T17:11:02] debug: task_slurmd_release_resources: 195 Is there something wrong here, or are we doing something wrong? -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo