On 2017-09-05 20:40, John DeSantis wrote:
>
> Hello all,
>
> We were recently alerted by a user whose long running jobs (>= 6 days) were 
> killed by oom.
>
> A closer look revealed that jag_common_poll_data isn't consistent in logging 
> the total memory usage,
> depending on the application in question.  For example, consider the 
> following job submission:
>
> #!/bin/bash
> #
> #SBATCH --job-name=sif14_1co2
> #SBATCH --output=runlog.log
> #SBATCH --time=48:00:00
> #SBATCH --nodes=8 --ntasks-per-node=6
> #SBATCH --partition=p2016 --qos=p16
> #SBATCH --mail-type=END
> #SBATCH --mail-user=user
>
> module purge
> module add apps/vasp/5.4.1
>
> # For standard vasp
> mpirun vasp_std
>
>
> Since no memory syntax was offered, the DefMemPerCPU value would take over, 
> which is 512 MB.
>
> Looking at the batch host stats, here is what is actually being reported:
>
>   PPID    LWP PSR     TIME S   RSS     ELAPSED COMMAND            VSZ
>  28388  28394   1 00:00:00 S  1332    17:47:24 slurm_script    106116
>  28394  28403   1 00:00:00 S  1404    17:47:24 mpirun          106248
>  28403  28416  10 00:00:00 S  1664    17:47:24 mpiexec.hydra    17180
>  28416  28417   5 00:00:00 S  4996    17:47:24 srun            708424
>  28417  28421   3 00:00:00 S   668    17:47:24 srun             37712
>  28453  28458   0 00:00:00 S  1928    17:47:24 pmi_proxy        16384
>  28458  28471   4 15:55:13 R 1059004  17:47:24 vasp_std        2953636
>  28458  28472  10 17:01:43 R 1057672  17:47:24 vasp_std        2950168
>  28458  28473  20 17:35:54 R 1055412  17:47:24 vasp_std        2947896
>  28458  28474   3 17:53:39 R 1060060  17:47:24 vasp_std        2952712
>  28458  28475  11 17:51:52 R 1055168  17:47:24 vasp_std        2947772
>  28458  28476  23 17:53:42 R 1063888  17:47:24 vasp_std        2956572
>
> [2017-09-05T13:11:09.124] [10212120] debug:  jag_common_poll_data: Task 
> average frequency = 2197 pid
> 28394 mem size 4400 229544 time 0.020000(0+0)
> [2017-09-05T13:11:09.280] [10212120.0] debug:  jag_common_poll_data: Task 
> average frequency = 2197
> pid 28458 mem size 1928 205988 time 0.020000(0+0)
> [2017-09-05T13:11:31.729] [10212120] debug:  jag_common_poll_data: Task 
> average frequency = 2197 pid
> 28394 mem size 4400 229544 time 0.020000(0+0)
> [2017-09-05T13:11:31.757] [10212120.0] debug:  jag_common_poll_data: Task 
> average frequency = 2197
> pid 28458 mem size 1928 205988 time 0.020000(0+0)
> [2017-09-05T13:11:39.126] [10212120] debug:  jag_common_poll_data: Task 
> average frequency = 2197 pid
> 28394 mem size 4400 229544 time 0.020000(0+0)
> [2017-09-05T13:11:39.283] [10212120.0] debug:  jag_common_poll_data: Task 
> average frequency = 2197
> pid 28458 mem size 1928 205988 time 0.020000(0+0)
>
> However, if a slightly older version of the same software is used, memory 
> seems to be reported
> correctly.  Consider the submission script below.
>
> #!/bin/sh
>
> #SBATCH --time=04:00:00
> #SBATCH -N 2 --ntasks-per-node=2 -J "vasp_run"
> #SBATCH --mem-per-cpu=1800
> #SBATCH --reservation=blah
> #SBATCH -p blah
> #SBATCH -w nodes-[1-2]
>
> module purge
> module load apps/vasp/5.3.3
>
> mpirun vasp
>
> And, the reported stats:
>
>  PPID   LWP PSR     TIME S   RSS     ELAPSED COMMAND
> 22046 22050   0 00:00:00 S  1324       03:32 slurm_script
> 22050 22053   0 00:00:00 S  1476       03:32 mpirun
> 22053 22111   1 00:00:00 S  1428       03:32 mpiexec.hydra
> 22111 22112   2 00:00:00 S  4800       03:32 srun
> 22112 22113   3 00:00:00 S   668       03:32 srun
> 22121 22126   3 00:00:00 S  1352       03:32 pmi_proxy
> 22126 22127   5 00:03:30 R 1781544     03:32 vasp
> 22126 22128   0 00:03:30 R 1778052     03:32 vasp
>
>
> [2017-09-05T13:17:58.097] [10215464] debug:  jag_common_poll_data: Task 
> average frequency = 2527 pid
> 22050 mem size 4228 230384 time 0.010000(0+0)
> [2017-09-05T13:17:58.265] [10215464.0] debug:  jag_common_poll_data: Task 
> average frequency = 2527
> pid 22126 mem size 3538576 7561020 time 59.630000(56+2)
> [2017-09-05T13:18:28.101] [10215464] debug:  jag_common_poll_data: Task 
> average frequency = 2527 pid
> 22050 mem size 4228 230384 time 0.010000(0+0)
> [2017-09-05T13:18:28.270] [10215464.0] debug:  jag_common_poll_data: Task 
> average frequency = 2505
> pid 22126 mem size 3560928 7655316 time 119.590000(116+3)
> [2017-09-05T13:18:58.105] [10215464] debug:  jag_common_poll_data: Task 
> average frequency = 2527 pid
> 22050 mem size 4228 230384 time 0.010000(0+0)
> [2017-09-05T13:18:58.275] [10215464.0] debug:  jag_common_poll_data: Task 
> average frequency = 2498
> pid 22126 mem size 3560932 7655316 time 179.570000(175+3)
> [2017-09-05T13:19:28.109] [10215464] debug:  jag_common_poll_data: Task 
> average frequency = 2527 pid
> 22050 mem size 4228 230384 time 0.010000(0+0)
> [2017-09-05T13:19:28.281] [10215464.0] debug:  jag_common_poll_data: Task 
> average frequency = 2495
> pid 22126 mem size 3560932 7655316 time 239.550000(234+4)
> [2017-09-05T13:19:58.113] [10215464] debug:  jag_common_poll_data: Task 
> average frequency = 2527 pid
> 22050 mem size 4228 230384 time 0.010000(0+0)
> [2017-09-05T13:19:58.286] [10215464.0] debug:  jag_common_poll_data: Task 
> average frequency = 2493
> pid 22126 mem size 3560936 7655316 time 299.510000(294+5)
>
>
>
> Has anyone else noticed anything similar on their cluster(s)?  I cannot 
> confirm if this was
> happening before we upgraded from 15.08.4 to 16.05.10-2.
>
> Thanks,
> John DeSantis
>
Not knowing the specifics of your setup, AFAIK the general advice remains that 
unless you're launching the job via srun instead of mpirun (that is, you're 
using the slurm MPI integration) all bets are off w.r.t. accounting. Since you 
seem to be using mvapich2(?), make sure you're compiling mvapich2 with

./configure --with-pmi=pmi2 --with-pm=slurm


and then launch the application with

srun --mpi=pmi2 vasp_std

instead of mpirun.

(from https://slurm.schedmd.com/mpi_guide.html#mvapich2 )



-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi

Reply via email to