Hi, I have a user who says:
>I am seeing some very strange behaviour. >Preamble: my jobs save tiny bits of information at each time step - just >a few bytes long, essentially the step number and a few numbers related >to the state of the simulation. >Recently, I have realized that some percentage of these output files are >corrupt. They have the correct number of lines and so on, but some lines >are cut - they are missing a piece of the beginning of the line, or the >middle, etc. >This has never happened before, so I am hesitant to blame the program's >simple "fwrite" call just yet. What I suspect is that this may be >happening when jobs get STOPPED. How does get suspended an openMP job in slurm? How do threads get suspended? Do you think that this problem can be related to the way that slurm suspends openMP jobs? Thanks, --matt
