Hi, I noticed a node today with a load average of like 5,000. After logging in to investigate I noticed that there were thousands of flock processes running.
Snipped ps aux|grep flock: root 80850 0.0 0.0 4056 484 ? S 11:48 0:00 flock -x /cgroup/memory -c /etc/slurm/cgroup/release_memory sync /slurm/uid_3373/job_959282 root 80851 0.0 0.0 4056 488 ? S 11:48 0:00 flock -x /cgroup/memory -c /etc/slurm/cgroup/release_memory sync /slurm/uid_3373/job_959282 root 80852 0.0 0.0 4056 484 ? S 11:48 0:00 flock -x /cgroup/memory -c /etc/slurm/cgroup/release_memory sync /slurm/uid_3373/job_959282 root 81016 0.0 0.0 4056 480 ? S 11:48 0:00 flock -x /cgroup/memory -c rmdir /cgroup/memory/slurm/uid_3373 root 81026 0.0 0.0 4056 484 ? S 11:48 0:00 flock -x /cgroup/memory -c rmdir /cgroup/memory/slurm/uid_3373 root 81470 0.0 0.0 103248 900 pts/1 S+ 11:48 0:00 grep flock I haven’t seen this before, likely something to do with the type of job that is running on the node possibly? Whats going on here? Bad code in the job? Or bad slurm config somewhere? Is there something that can be done to fix this behavior? Slurm version 14.11.3 Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167
