On Tue, 29 Nov 2011 10:58:42 -0700, Moe Jette <[email protected]> wrote:
> suspend/resume uses SIGSTOP/SIGCONT to all spawned processes at the  
> same time. I do not know the details of openMP, but it may have  
> difficulties with message handling when jobs get suspended.

You also want to make sure your user is checking the return code
of fwrite(3). I could imagine that fwrite can return short writes
when handling signals, and if you assume everything was written you
can get truncated lines at the very least.

mark

 
> Quoting Matteo Guglielmi <[email protected]>:
> 
> > Hi,
> >
> > I have a user who says:
> >
> >> I am seeing some very strange behaviour.
> >
> >> Preamble: my jobs save tiny bits of information at each time step - just
> >> a few bytes long, essentially the step number and a few numbers related
> >> to the state of the simulation.
> >
> >> Recently, I have realized that some percentage of these output files are
> >> corrupt. They have the correct number of lines and so on, but some lines
> >> are cut - they are missing a piece of the beginning of the line, or the
> >> middle, etc.
> >
> >> This has never happened before, so I am hesitant to blame the program's
> >> simple "fwrite" call just yet. What I suspect is that this may be
> >> happening when jobs get STOPPED.
> >
> > How does get suspended an openMP job in slurm?
> >
> > How do threads get suspended?
> >
> > Do you think that this problem can be related to the way that slurm
> > suspends openMP jobs?
> >
> >
> > Thanks,
> >
> > --matt
> >
> 
> 
> 

Reply via email to