I understand this problem more fully now.

Certains jobs that our users run fork processes in a way that the parent
PID gets set to 1. The _get_offspring_data function in
jobacct_gather/linux ignores these when adding up memory usage. 

It seems like if proctrack/cgroup is enabled, the jobacct_gather/linux
plugin should rely on the cgroup.procs file to identify the pids instead
of trying to figure things out based on parent PID. Is something like
that reasonable?

Andy

On Tue, Jul 30, 2013 at 10:59:56AM -0700, Andy Wettstein wrote:
> 
> Hi,
> 
> I have the following set:
> 
> ProctrackType           = proctrack/cgroup
> TaskPlugin              = task/cgroup
> JobAcctGatherType       = jobacct_gather/linux
> 
> This is on slurm 2.5.7.
> 
> When I use sstat on all running jobs, there are a large number of jobs
> that say they have no steps running (for example: sstat: error: couldn't
> get steps for job 4783548).
> 
> This seems to be the case for all steps that use the step_batch cgroup.
> If the step gets created in something like step_0, everything seems to
> be reported ok. In both instances, the PIDs are actually listed in the
> right cgroup.procs file.
> 
> I noticed this because there were several jobs that should have been
> killed due to memory limits, but were not. The jobacct_gather plugin
> doesn't know about the processes in the step_batch cgroup so it doesn't
> count the memory usage.
> 
> 
> Andy
> 
> 
> 
> 
> -- 
> andy wettstein
> hpc system administrator
> research computing center
> university of chicago
> 773.702.1104

-- 
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104

Reply via email to