Thomas Mainka <[email protected]> writes:

> Hi everyone,
>
>
> currently I am trying to upgrade an old SGE cluster from an 6.2 release to SGE
> 8.1.3. The upgrade itself is no problem, however the sge_execd of 8.1.3 
> consumes a
> lot of CPU on the exec host when a job is running--sometimes up to 30% of a 
> CPU.
>
> For a 40 minute test job (Abaqus/Standard FEA) the execd consumed around 10
> minutes of CPU time. After the job finished the CPU consumption of the 
> sge_execd
> dropped as well. This was on an RHEL6.4 system, but was observed on an RHEL5.9
> system as well.

That's what Abaqus runs on here, but there aren't any examples running
currently, and I haven't noticed the problem when they have been.  I
haven't seen execd usage above 1% on a 1s interval.

> After some digging with strace and seeing the sge_execd opens /proc/<pid>/...
> files every second, the root of the problem seems to be the function
> linux_read_status() in daemons/common/procfs.c, which tries to gather process
> statistics for each SGE task.
>
> This seems to be done every second and not for load_report_time
> intervals.

There's a PDC interval in execd_parameters.

> As the
> smap for the Abaqus process gets pretty big sge_execd also spends a 
> significant
> time to parse it. From my small test job:
>
>   # wc -c /proc/11479/smaps
>   3108194 /proc/11479/smaps

Gosh; another reason for hating Abaqus.  I suppose that might account
for some bizarre problems users have had with memory allocation under it
which I could probably have investigated if they used a free program.
Perhaps it's horribly fragmented.

> From what I can see in the source the only thing really parsed from
> /proc/<pid>/smaps is the process resident size and swap usage to calculate the
> total size of the process, which is initialized from /proc/<pid>/stat.

It's the PSS, not the RSS.  They can be significantly different.

> I think this could be retrieved from /proc/<pid>/status far more easily where 
> I
> think the VmSwap line in /proc/<pid>/status was added in kernel 2.6.31.

PSS didn't seem to be in status when I checked the then-latest source.
I assume it's too expensive.

> Funny thing is that in procfs.c this is also done, but only as a fallback if 
> there
> is no "Swap:" data in /proc/<pid>/smap and not as a default.

What source are you looking at?  The logic is supposed to be:  use
PSS+swap if available (swap from status preferably, else smaps), else
use RSS+swap, if available, else use VMsize.

> My personal point of view is that it's quite unneccessary to burn all these 
> CPU
> cycles every second just to have overly accurate data of process size in RSS +
> swap. If the system doesn't give it to you directly like in /proc/<pid>/status
> maybe it's not worth obsessing over it.

RSS+swap is what people have asked for, but it's typically 10-20% larger
than PSS+swap (and the value from the cgroup controller, if you use it,
is different again).  That's typically equivalent to an extra process
per node.  Swap isn't in status in RH5, for instance, so you need smaps.

> So, wouldn't it be better to change the default to parse /proc/<pid>/status, 
> and
> maybe only enable the detailed parsing of the complete MMU map (which is 
> basically
> what /proc/<pid>/smap is) only with an configurable option that isn't enabled 
> by
> default?

I think you have a pathological case, but I'll add an option, and see if
the parsing can be made faster.  You should be able to turn it off
easily by rebuilding with pss_in_smaps always returning false.

This whole area is a can of worms.  The memory cgroup in recent Linux is
supposed to be the answer, but it has its own problems apart from
finding a clean way to configure all this stuff.

It's worth raising an issue for something like this.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to