suspend/resume uses SIGSTOP/SIGCONT to all spawned processes at the
same time. I do not know the details of openMP, but it may have
difficulties with message handling when jobs get suspended.
Quoting Matteo Guglielmi <[email protected]>:
Hi,
I have a user who says:
I am seeing some very strange behaviour.
Preamble: my jobs save tiny bits of information at each time step - just
a few bytes long, essentially the step number and a few numbers related
to the state of the simulation.
Recently, I have realized that some percentage of these output files are
corrupt. They have the correct number of lines and so on, but some lines
are cut - they are missing a piece of the beginning of the line, or the
middle, etc.
This has never happened before, so I am hesitant to blame the program's
simple "fwrite" call just yet. What I suspect is that this may be
happening when jobs get STOPPED.
How does get suspended an openMP job in slurm?
How do threads get suspended?
Do you think that this problem can be related to the way that slurm
suspends openMP jobs?
Thanks,
--matt