There is a list of potential exit code 9 (KILLED BY SIGNAL: 9) causes at
[1].
Hitting the walltime (--time [2,3]) limit is listed as one of them.
The slurm seff command might be helpful for determining if it caused by
oom. Refer to [4,5].
[1]
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-6/error-message-bad-termination.html
[2] https://docs.hpc.uwec.edu/slurm/determining-resources/#time-walltime
[3] https://hpcc.umd.edu/hpcc/help/jobs.html#walltime
[4] https://www.nsc.liu.se/support/memory-management/
[5] https://documentation.sigma2.no/jobs/choosing-memory-settings.html
Hope that can help,
Gavin
WIEN2k user
On 1/24/2025 8:40 AM, Laurence Marks wrote:
Sorry, but you have not provided enough information for more than a guess.
Exit code 9 is when the OS kills the task, often from out of memory
(oom) but it does not have to be. The larger calculation will require
about 8*8 more memory (perhaps more) than your simple calculation: do
"grep "Matrix size" *output1* -18". You probably ran out of memory,
and will need to use more mpi/kpt for the larger calculation.
N.B., using 2 ompi per task is also useful in reducing the total
memory useage. Combine this with mpi.
---
Emeritus Professor Laurence Marks (Laurie)
www.numis.northwestern.edu <http://www.numis.northwestern.edu>
https://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en
<https://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
"Research is to see what everybody else has seen, and to think what
nobody else has thought" Albert Szent-Györgyi
On Fri, Jan 24, 2025, 07:46 Sergeev Gregory <[email protected]> wrote:
Dear developers,
I do my calculations on hpc with slurm system and I have strange
behaviour of parallel wien2k jobs:
I have two structures:
1. Structure with 8 atoms in unitcell (simple structure)
2. Supercell structure with 64 atoms (2*2*2 supercell structure)
based on cell from simple structure
I try to do Wien2k calculations on parallel mode with two configs:
1. Calculations on 1 node (1 node has 48 processors) with 12
parallel jobs with 4 processors per each job (one node job)
2. Calculations on 2 nodes (2 node has 48*2=96 processors) with 24
parallel jobs with 4 processors per each job (two node job)
For "simple structure" "one node job" and "two node job" work
without problems.
For "supercell structure" "one node job" works well, but "two node
job" crashs with errors in .time1_* files (I use Intel MPI):
-----------------
n053 n053 n053 n053(21)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 21859 RUNNING AT n053
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 21859 RUNNING AT n053
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
0.042u 0.144s 2:45.42 0.1% 0+0k 4064+8io 60pf+0w
-----------------
First I thinked, that there are problems with unufficial memory on
"2 node job" (but why, if "1 node job" works with same processors
per one parallel job?). I tried to twice increaced used memory per
task (#SBATCH --cpus-per-task 2), but this fix haven't solve
problem. Same error.
Any ideas why such strange behavior?
Does Wien2k have problems scaling to multiple nodes?
I would appreciate your help. I want to speed up calculations for
complex structures, I have the resources, but I can't do it.
_______________________________________________
Wien mailing list
[email protected]
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/[email protected]/index.html