[Wien] Problems of out of memory in parallel jobs

MA Weiliang Mon, 11 May 2020 08:56:19 -0700

Dear Wien users,

The wien2k 18.2 I used is compiled in a share memory cluster with intel 
compiler 2019, mkl 2019 and impi 2019.  Because ‘srun' cannot get a correct 
parallel calculation in the system, I commented the line of "#setenv 
WIEN_MPIRUN "srun -K -N_nodes_ -n_NP_ -r_offset_ _PINNING_ _EXEC_” in the 
parallel_options file and used the second choice "mpirun='mpirun -np _NP_ 
_EXEC_”.


Parallel jobs go well in scf cycles. But when I increase k points (about 5000) 
to calculate DOS, the lapw1 crashed with the cgroup out-of-memory handler 
halfway. That is very strange. With same parameters, job runs well with single 
core.

The similar problem is encountered on nlvdw_mpi step. I also increase memory up 
to 50G for this less than 10 atoms cell, but it still didn’t work.

[Parallel job output:]
starting parallel lapw1 at lun. mai 11 16:24:48 CEST 2020
->  starting parallel LAPW1 jobs at lun. mai 11 16:24:48 CEST 2020
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
[1] 12604
[1]  + Done                          ( cd $PWD; $t $ttt; rm -f 
.lock_$lockfile[$p] ) >> .time1_$loop
     lame25 lame25 lame25 lame25 lame25 lame25 lame25 lame25(5038) 4641.609u 
123.862s 10:00.69 793.3%   0+0k 489064+2505080io 7642pf+0w
   Summary of lapw1para:
   lame25        k=0     user=0  wallclock=0
**  LAPW1 crashed!
4643.674u 126.539s 10:03.50 790.4%      0+0k 490512+2507712io 7658pf+0w
error: command   /home/mcsete/work/wma/Package/wien2k.18n/lapw1para lapw1.def   
failed
slurmstepd: error: Detected 1 oom-kill event(s) in step 86112.batch cgroup. 
Some of your processes may have been killed by the cgroup out-of-memory handler.

[Single mode output: ]
 LAPW1 END
11651.205u 178.664s 3:23:49.07 96.7%    0+0k 19808+22433688io 26pf+0w

Do you have any ideas? Thank you in advance!

Best regards,
Liang

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

[Wien] Problems of out of memory in parallel jobs

Reply via email to