On Tue, 29 Nov 2011 10:58:42 -0700, Moe Jette <[email protected]> wrote: > suspend/resume uses SIGSTOP/SIGCONT to all spawned processes at the > same time. I do not know the details of openMP, but it may have > difficulties with message handling when jobs get suspended.
You also want to make sure your user is checking the return code of fwrite(3). I could imagine that fwrite can return short writes when handling signals, and if you assume everything was written you can get truncated lines at the very least. mark > Quoting Matteo Guglielmi <[email protected]>: > > > Hi, > > > > I have a user who says: > > > >> I am seeing some very strange behaviour. > > > >> Preamble: my jobs save tiny bits of information at each time step - just > >> a few bytes long, essentially the step number and a few numbers related > >> to the state of the simulation. > > > >> Recently, I have realized that some percentage of these output files are > >> corrupt. They have the correct number of lines and so on, but some lines > >> are cut - they are missing a piece of the beginning of the line, or the > >> middle, etc. > > > >> This has never happened before, so I am hesitant to blame the program's > >> simple "fwrite" call just yet. What I suspect is that this may be > >> happening when jobs get STOPPED. > > > > How does get suspended an openMP job in slurm? > > > > How do threads get suspended? > > > > Do you think that this problem can be related to the way that slurm > > suspends openMP jobs? > > > > > > Thanks, > > > > --matt > > > > >
