Hi All,

This is not a Lustre problem proper, but others might run into it with a 64-bit 
Lustre client on RHEL 7, and I hope to save others the time it took us to nail 
it down.  We saw it on a node running the "Starfish" policy engine, which reads 
through the entire file system tree repeatedly and consumes changelogs.  
Starfish itself creates and destroys processes frequently, and the workload 
causes Lustre to create and destroy threads as well, by triggering statahead 
thread creation and changelog thread creation.

For the impatient, the fix was to increase pid_max.  We used:
kernel.pid_max=524288

The symptoms are:

1) console log messages like
LustreError: 10525:0:(statahead.c:970:ll_start_agl()) can't start ll_agl 
thread, rc: -12
LustreError: 15881:0:(statahead.c:1614:start_statahead_thread()) can't start 
ll_sa thread, rc: -12
LustreError: 15881:0:(statahead.c:1614:start_statahead_thread()) Skipped 45 
previous similar messages
LustreError: 15878:0:(statahead.c:1614:start_statahead_thread()) can't start 
ll_sa thread, rc: -12
LustreError: 15878:0:(statahead.c:1614:start_statahead_thread()) Skipped 17 
previous similar messages 

Note the return codes are -12, which is -ENOMEM.

Attempts to create new user space processes are also intermittently failing.

sf_lustre.liblustreCmds 10983 'MainThread' : ("can't start new thread",) 
[liblustreCmds.py:216]

and

[faaland1@solfish2 lustre]$git fetch llnlstash
Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': 
Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': 
remote: Enumerating objects: 1377, done.
remote: Counting objects: 100% (1236/1236), done.
remote: Compressing objects: 100% (271/271), done.
error: cannot fork() for index-pack: Cannot allocate memory
fatal: fetch-pack: unable to fork off index-pack

We wasted a lot of time chasing the idea that this was in fact due to 
insufficient free memory on the node, but the actual problem was that sysctl 
kernel.pid_max was too low.

When a new process must be created via fork() or kthread_create(), or similar, 
the kernel has to allocate a PID.  It has a data structure for keeping track of 
which PIDs are available, and there is some delay after a process is destroyed 
before its PID may be reused.

We found that on this node, that the kernel would occasionally find no PIDs 
available when it was creating the process.  Specifically, copy_process() would 
call alloc_pidmap(), which would return -1.  This tended to be when the system 
was processing a large number of changes on the file system, so both Lustre and 
Starifish were suddenly doing a lot of work and both would have been creating 
new threads in response to the load.   This node has about 700-800 processes 
running normally according to top(1).  At the time these errors occurred, I 
don't know many processes were running or how quickly they were being created 
and destroyed.

Ftrace showed this:

|        copy_namespaces();
|        copy_thread();
|        alloc_pid() {
|          kmem_cache_alloc() {
|            __might_sleep();
|            _cond_resched();
|          }
|          kmem_cache_free();
|        }
|        exit_task_namespaces() {
|          switch_task_namespaces() {

On this particular node, with 32 cores, running RHEL 7, arch x86_64, pid_max 
was 36K.    We added
kernel.pid_max=524288
to our sysctl.conf which resolved the issue.

I don't expect this to be an issue under RHEL 8 (or clone of your choice), 
because in RHEL 8.2 systemd puts a config file in place that sets pid_max to 
2^22.

-Olaf
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to