Re: [systemd-devel] systemd debug out of memory

2017-03-30 Thread Lennart Poettering
On Sun, 05.03.17 15:59, Pascal Kolijn (p.kol...@vu.nl) wrote:

> Peace,
> 
> On 28/02/2017 16:00, Lennart Poettering wrote:
> > On Tue, 28.02.17 13:26, Pascal kolijn (p.kol...@vu.nl) wrote:
> > 
> >> Hi List,
> >>
> >> I've subscribed to this list to ask for help in debugging a problem we
> >> seem to have with the socket activated telnetd on a rhel7 system.
> >>
> >> A default install of telnetd collects data from some small boxes
> >> deployed in the field. It works for a long time and then suddenly:
> >>
> >> Feb 26 17:46:53 bibr systemd: Created slice user-6006.slice.
> >> Feb 26 17:46:53 bibr systemd: Starting user-6006.slice.
> >> Feb 26 17:46:53 bibr systemd: Started Session 2223341 of user .
> >> Feb 26 17:46:53 bibr systemd-logind: New session 2223341 of user .
> >> Feb 26 17:46:53 bibr systemd: Starting Session 2223341 of user .
> >> Feb 26 17:46:53 bibr systemd: Started Telnet Server (:28830).
> >> Feb 26 17:46:53 pbibr001 systemd: Starting Telnet Server (:28830)...
> >> Feb 26 17:46:57 bibr systemd: Failed to fork: Cannot allocate memory
> > 
> > Hmm, Linux fork() returns ENOMEM if the maximum number of tasks on the
> > system is hit (yes this is a bit misleading, but that's how it is).
> > That max number of tasks is limited for example by the max number of
> > assignable pids as configured in /proc/sys/kernel/pid_max? Maybe you
> > hit that limit? Maybe something is leaking pids on your system? not
> > reaping zombies properly?
> 
> As far as I can determine running out of pids is not the issue, as I can
> see pids being reused in a day, which will not say that some may still
> go missing over time, but how do I determine if that is the case...?

Well, the "Failed to fork" log message is generated when "fork()"
returns ENOMEM, i.e. this is generated by a syscall, i.e. it's an
allocation of some object in the kernel that fails, not of userspace...

> What I do see is that the rss of the systemd process is slowly growing
> over time in the production environment. I've not been able (yet) to
> reproduce the situation in a test environment, which is a pity. I think
> I can simulate the telnet connects more accurately after I speak with
> the developer of the said boxes, and see if I can create a reproducible
> situation.

Well, if it grows without bounds then this would be a leak. But if it
grows only a bit during runtime that's not a problem: in-process
memory management generally doesn't return all memory back to the
kernel immediately, but keeps an allocation cache. In fact, our
hashtable implementation in systemd (which is heavily used) implements
an extra allocation cache on top. Memory in that cache is not actually
leaked, even though it is not returned to the OS: it is simply kept
around for next time we need a similar object.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd debug out of memory

2017-03-10 Thread Michal Sekletar
On Sun, Mar 5, 2017 at 3:59 PM, Pascal Kolijn  wrote:
> Peace,
>
> On 28/02/2017 16:00, Lennart Poettering wrote:
>> On Tue, 28.02.17 13:26, Pascal kolijn (p.kol...@vu.nl) wrote:
>>
>>> Hi List,
>>>
>>> I've subscribed to this list to ask for help in debugging a problem we
>>> seem to have with the socket activated telnetd on a rhel7 system.
>>>
>>> A default install of telnetd collects data from some small boxes
>>> deployed in the field. It works for a long time and then suddenly:
>>>
>>> Feb 26 17:46:53 bibr systemd: Created slice user-6006.slice.
>>> Feb 26 17:46:53 bibr systemd: Starting user-6006.slice.
>>> Feb 26 17:46:53 bibr systemd: Started Session 2223341 of user .
>>> Feb 26 17:46:53 bibr systemd-logind: New session 2223341 of user .
>>> Feb 26 17:46:53 bibr systemd: Starting Session 2223341 of user .
>>> Feb 26 17:46:53 bibr systemd: Started Telnet Server (:28830).
>>> Feb 26 17:46:53 pbibr001 systemd: Starting Telnet Server (:28830)...
>>> Feb 26 17:46:57 bibr systemd: Failed to fork: Cannot allocate memory
>>
>> Hmm, Linux fork() returns ENOMEM if the maximum number of tasks on the
>> system is hit (yes this is a bit misleading, but that's how it is).
>> That max number of tasks is limited for example by the max number of
>> assignable pids as configured in /proc/sys/kernel/pid_max? Maybe you
>> hit that limit? Maybe something is leaking pids on your system? not
>> reaping zombies properly?
>
> As far as I can determine running out of pids is not the issue, as I can
> see pids being reused in a day, which will not say that some may still
> go missing over time, but how do I determine if that is the case...?
>
> What I do see is that the rss of the systemd process is slowly growing
> over time in the production environment. I've not been able (yet) to
> reproduce the situation in a test environment, which is a pity. I think
> I can simulate the telnet connects more accurately after I speak with
> the developer of the said boxes, and see if I can create a reproducible
> situation.
>
>>> Feb 26 17:46:57 bibr systemd: Assertion 'pid >= 1' failed at
>>> src/core/unit.c:1996, function unit_watch_pid(). Aborting.
>>> Feb 26 17:46:57 bibr001 systemd: Caught , cannot fork for core
>>> dump: Cannot allocate memory
>>> Feb 26 17:46:57 bibr systemd: Freezing execution.
>>
>> So this is definitely a bug. If the limit is hit, we hould certainly
>> not hit an assert. I tried to figure out how this could ever happen,
>> but afaics this should not be possible on current git at least. Any
>> chance you can try to reproduce this isue with something more recent
>> than a rhel7 box?
>
> Hmmm, the version we currently use in production is:
>
> # rpm -qa | grep systemd
> systemd-libs-219-19.el7_2.13.x86_64
> systemd-219-19.el7_2.13.x86_64
> systemd-sysv-219-19.el7_2.13.x86_64

I've backported bunch of fixes for memory leaks to
systemd-219-19.el7_2.14. From changelog,

* Mon Aug 22 2016 Lukas Nykryn  - 219-19.14
- core: fix memory leak on set-default, enable, disable etc (#1331667)
- nspawn: fix minor memory leak (#1331667)
- basic: fix error/memleak in socket-util (#1331667)
- core: fix memory leak in manager_run_generators() (#1331667)
- modules-load: fix memory leak (#1331667)
- core: fix memory leak on failed preset-all (#1331667)
- sd-bus: fix memory leak in test-bus-chat (#1331667)
- core: fix memory leak in transient units (#1331667)

Fix is in the code path that is hit everytime you log onto the box,
because every session has its own scope unit.

- bus: fix leak in error path (#1331667)
- shared/logs-show: fix memleak in add_matches_for_unit (#1331667)

>
> I think I can update it to the current state in 7.3 for the production
> machine, but will be reluctant to go for a more recent version...

Those fixes are of course included in 7.3 as well.

Michal

>
> Maybe in the test env, if I can reproduce it there.
>
>> Either way it appears that there's both a bug on your setup and in
>> systemd: something leaks processes (which is bug #1, in your setup)
>> and then systemd doesn't deal properly with that (which is bug #2, in
>> systemd upstream)...
>>
>> Lennart
>>
>
> Pascal Kolijn
> Vrije Universiteit Amsterdam
> ___
> systemd-devel mailing list
> systemd-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/systemd-devel
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd debug out of memory

2017-03-05 Thread Pascal Kolijn
Peace,

On 28/02/2017 16:00, Lennart Poettering wrote:
> On Tue, 28.02.17 13:26, Pascal kolijn (p.kol...@vu.nl) wrote:
> 
>> Hi List,
>>
>> I've subscribed to this list to ask for help in debugging a problem we
>> seem to have with the socket activated telnetd on a rhel7 system.
>>
>> A default install of telnetd collects data from some small boxes
>> deployed in the field. It works for a long time and then suddenly:
>>
>> Feb 26 17:46:53 bibr systemd: Created slice user-6006.slice.
>> Feb 26 17:46:53 bibr systemd: Starting user-6006.slice.
>> Feb 26 17:46:53 bibr systemd: Started Session 2223341 of user .
>> Feb 26 17:46:53 bibr systemd-logind: New session 2223341 of user .
>> Feb 26 17:46:53 bibr systemd: Starting Session 2223341 of user .
>> Feb 26 17:46:53 bibr systemd: Started Telnet Server (:28830).
>> Feb 26 17:46:53 pbibr001 systemd: Starting Telnet Server (:28830)...
>> Feb 26 17:46:57 bibr systemd: Failed to fork: Cannot allocate memory
> 
> Hmm, Linux fork() returns ENOMEM if the maximum number of tasks on the
> system is hit (yes this is a bit misleading, but that's how it is).
> That max number of tasks is limited for example by the max number of
> assignable pids as configured in /proc/sys/kernel/pid_max? Maybe you
> hit that limit? Maybe something is leaking pids on your system? not
> reaping zombies properly?

As far as I can determine running out of pids is not the issue, as I can
see pids being reused in a day, which will not say that some may still
go missing over time, but how do I determine if that is the case...?

What I do see is that the rss of the systemd process is slowly growing
over time in the production environment. I've not been able (yet) to
reproduce the situation in a test environment, which is a pity. I think
I can simulate the telnet connects more accurately after I speak with
the developer of the said boxes, and see if I can create a reproducible
situation.

>> Feb 26 17:46:57 bibr systemd: Assertion 'pid >= 1' failed at
>> src/core/unit.c:1996, function unit_watch_pid(). Aborting.
>> Feb 26 17:46:57 bibr001 systemd: Caught , cannot fork for core
>> dump: Cannot allocate memory
>> Feb 26 17:46:57 bibr systemd: Freezing execution.
> 
> So this is definitely a bug. If the limit is hit, we hould certainly
> not hit an assert. I tried to figure out how this could ever happen,
> but afaics this should not be possible on current git at least. Any
> chance you can try to reproduce this isue with something more recent
> than a rhel7 box?

Hmmm, the version we currently use in production is:

# rpm -qa | grep systemd
systemd-libs-219-19.el7_2.13.x86_64
systemd-219-19.el7_2.13.x86_64
systemd-sysv-219-19.el7_2.13.x86_64

I think I can update it to the current state in 7.3 for the production
machine, but will be reluctant to go for a more recent version...

Maybe in the test env, if I can reproduce it there.

> Either way it appears that there's both a bug on your setup and in
> systemd: something leaks processes (which is bug #1, in your setup)
> and then systemd doesn't deal properly with that (which is bug #2, in
> systemd upstream)...
> 
> Lennart
> 

Pascal Kolijn
Vrije Universiteit Amsterdam
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd debug out of memory

2017-02-28 Thread Lennart Poettering
On Tue, 28.02.17 13:26, Pascal kolijn (p.kol...@vu.nl) wrote:

> Hi List,
> 
> I've subscribed to this list to ask for help in debugging a problem we
> seem to have with the socket activated telnetd on a rhel7 system.
> 
> A default install of telnetd collects data from some small boxes
> deployed in the field. It works for a long time and then suddenly:
> 
> Feb 26 17:46:53 bibr systemd: Created slice user-6006.slice.
> Feb 26 17:46:53 bibr systemd: Starting user-6006.slice.
> Feb 26 17:46:53 bibr systemd: Started Session 2223341 of user .
> Feb 26 17:46:53 bibr systemd-logind: New session 2223341 of user .
> Feb 26 17:46:53 bibr systemd: Starting Session 2223341 of user .
> Feb 26 17:46:53 bibr systemd: Started Telnet Server (:28830).
> Feb 26 17:46:53 pbibr001 systemd: Starting Telnet Server (:28830)...
> Feb 26 17:46:57 bibr systemd: Failed to fork: Cannot allocate memory

Hmm, Linux fork() returns ENOMEM if the maximum number of tasks on the
system is hit (yes this is a bit misleading, but that's how it is).
That max number of tasks is limited for example by the max number of
assignable pids as configured in /proc/sys/kernel/pid_max? Maybe you
hit that limit? Maybe something is leaking pids on your system? not
reaping zombies properly?

> Feb 26 17:46:57 bibr systemd: Assertion 'pid >= 1' failed at
> src/core/unit.c:1996, function unit_watch_pid(). Aborting.
> Feb 26 17:46:57 bibr001 systemd: Caught , cannot fork for core
> dump: Cannot allocate memory
> Feb 26 17:46:57 bibr systemd: Freezing execution.

So this is definitely a bug. If the limit is hit, we hould certainly
not hit an assert. I tried to figure out how this could ever happen,
but afaics this should not be possible on current git at least. Any
chance you can try to reproduce this isue with something more recent
than a rhel7 box?

Either way it appears that there's both a bug on your setup and in
systemd: something leaks processes (which is bug #1, in your setup)
and then systemd doesn't deal properly with that (which is bug #2, in
systemd upstream)...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel