Ben De Luca <[email protected]> writes: > I was wondering, how people deal with oom conditions on there cluster. > We constantly have machines that die because the oom killer takes out > critical system services. > > Has any experiance with the oom_adj proc value, or a patch to grid to > support it?
I second the advice about controlling the memory used by jobs. However, for what it's worth, OOM adjustment should be straightforward with the planned loadable module support in the shepherd. SLURM has a module to do it. It needs to be done there as it's a privileged operation, unless there's some reasonably safe way to do it with an suid starter method. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
