We are using slurm 2.4.0.

We're trying to give users some memory usage statistics for their batch
script in the end of their slurm-nnn.out, by running

  sstat -j $SLURM_JOB_ID.batch -o jobid,maxvmsize,maxrss

at the end of the script.  (We also run "sacct -j $SLURM_JOB_ID" to get
information about job steps started with srun or mpirun, but that works
fine in all cases below.)


This works fine for jobs that run on only one noe.  For instance, with
the following job script:

$ cat sstattest.sm
#!/bin/sh
#SBATCH --job-name=sstattest
#SBATCH --account=staff
#SBATCH --time=0:10:0 --mem-per-cpu=1G

srun -l hostname
sstat -j $SLURM_JOB_ID.batch -o jobid,maxvmsize,maxrss


One task works:

$ sbatch sstattest.sm
Submitted batch job 193
$ cat slurm-193.out 
0: compute-6-1.local
       JobID  MaxVMSize     MaxRSS 
------------ ---------- ---------- 
193.batch       271484K      3720K 


Two tasks on the same node works:

$ sbatch --ntasks=2 --nodes=1 sstattest.sm
Submitted batch job 194
$ cat slurm-194.out 
1: compute-6-1.local
0: compute-6-1.local
       JobID  MaxVMSize     MaxRSS 
------------ ---------- ---------- 
194.batch       271488K      3724K 


Two tasks on separate nodes doesn't work:

$ sbatch --ntasks=2 --nodes=2 sstattest.sm
Submitted batch job 195
$ cat slurm-195.out 
0: compute-6-1.local
1: compute-6-2.local
       JobID  MaxVMSize     MaxRSS 
------------ ---------- ---------- 
sstat: error: slurm_job_step_stat: there was an error with the request to c6-2 
rc = Invalid job id specified
sstat: error: problem getting step_layout for 195.4294967294: Invalid job id 
specified


The slurmd.log on c6-1 contains:

[2012-07-03T17:08:32] [193] Handling REQUEST_INFO
[2012-07-03T17:08:32] debug:  Entering stepd_stat_jobacct for job 193.4294967294
[2012-07-03T17:08:32] [193] Handling REQUEST_STEP_STAT
[2012-07-03T17:08:32] [193] _handle_stat_jobacct for job 193.4294967294
[2012-07-03T17:08:32] [193] Job 193 memory used:3720 limit:1048576 KB
[2012-07-03T17:08:32] [193] Handling REQUEST_STEP_LIST_PIDS
[2012-07-03T17:08:32] [193] _handle_list_pids for job 193.4294967294
[...]
[2012-07-03T17:10:21] [194] Handling REQUEST_INFO
[2012-07-03T17:10:21] debug:  Entering stepd_stat_jobacct for job 194.4294967294
[2012-07-03T17:10:21] [194] Handling REQUEST_STEP_STAT
[2012-07-03T17:10:21] [194] _handle_stat_jobacct for job 194.4294967294
[2012-07-03T17:10:21] [194] Job 194 memory used:3724 limit:2097152 KB
[2012-07-03T17:10:21] [194] Handling REQUEST_STEP_LIST_PIDS
[2012-07-03T17:10:21] [194] _handle_list_pids for job 194.4294967294

(but nothing similar for job 195)


The slurmd.log on c6-2 contains:

[2012-07-03T17:11:02] [195.0] task 1 (8002) exited with exit code 0.
[2012-07-03T17:11:02] [195.0] task_post_term: 195.0, task 0
[2012-07-03T17:11:02] [195.0] Waiting for IO
[2012-07-03T17:11:02] [195.0] Closing debug channel
[2012-07-03T17:11:02] [195.0] IO handler exited, rc=0
[2012-07-03T17:11:02] [195.0] Message thread exited
[2012-07-03T17:11:02] [195.0] done with job
[2012-07-03T17:11:02] error: stat_jobacct for invalid job_id: 195
[2012-07-03T17:11:02] debug:  _rpc_terminate_job, uid = 501
[2012-07-03T17:11:02] debug:  task_slurmd_release_resources: 195


Is there something wrong here, or are we doing something wrong?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to