Re: [PATCH v4 00/33] Per-VMA locks
On Tue, Jul 11, 2023 at 09:35:13AM -0700, Suren Baghdasaryan wrote: > On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky wrote: > > > > On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote: > > > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote: > > > > On 7/11/23 12:35, Leon Romanovsky wrote: > > > > > > > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > > > > > > > > > <...> > > > > > > > > > >> Laurent Dufour (1): > > > > >> powerc/mm: try VMA lock-based page fault handling first > > > > > > > > > > Hi, > > > > > > > > > > This series and specifically the commit above broke docker over PPC. > > > > > It causes to docker service stuck while trying to activate. Revert of > > > > > this commit allows us to use docker again. > > > > > > > > Hi, > > > > > > > > there have been follow-up fixes, that are part of 6.4.3 stable (also > > > > 6.5-rc1) Does that version work for you? > > > > > > I'll recheck it again on clean system, but for the record: > > > 1. We are running 6.5-rc1 kernels. > > > 2. PPC doesn't compile for us on -rc1 without this fix. > > > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/ > > > > Ohh, I see it in -rc1, let's recheck. > > Hi Leon, > Please let us know how it goes. Once, we rebuilt clean -rc1, docker worked for us. Sorry for the noise. > > > > > > 3. I didn't see anything relevant -rc1 with "git log > > > arch/powerpc/mm/fault.c". > > The fixes Vlastimil was referring to are not in the fault.c, they are > in the main mm and fork code. More specifically, check for these > patches to exist in the branch you are testing: > > mm: lock newly mapped VMA with corrected ordering > fork: lock VMAs of the parent process when forking > mm: lock newly mapped VMA which can be modified after it becomes visible > mm: lock a vma before stack expansion Thanks > > Thanks, > Suren. > > > > > > > Do you have in mind anything specific to check? > > > > > > Thanks > > > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an > > email to kernel-team+unsubscr...@android.com. > >
Re: [PATCH v4 00/33] Per-VMA locks
On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky wrote: > > On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote: > > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote: > > > On 7/11/23 12:35, Leon Romanovsky wrote: > > > > > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > > > > > > > <...> > > > > > > > >> Laurent Dufour (1): > > > >> powerc/mm: try VMA lock-based page fault handling first > > > > > > > > Hi, > > > > > > > > This series and specifically the commit above broke docker over PPC. > > > > It causes to docker service stuck while trying to activate. Revert of > > > > this commit allows us to use docker again. > > > > > > Hi, > > > > > > there have been follow-up fixes, that are part of 6.4.3 stable (also > > > 6.5-rc1) Does that version work for you? > > > > I'll recheck it again on clean system, but for the record: > > 1. We are running 6.5-rc1 kernels. > > 2. PPC doesn't compile for us on -rc1 without this fix. > > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/ > > Ohh, I see it in -rc1, let's recheck. Hi Leon, Please let us know how it goes. > > > 3. I didn't see anything relevant -rc1 with "git log > > arch/powerpc/mm/fault.c". The fixes Vlastimil was referring to are not in the fault.c, they are in the main mm and fork code. More specifically, check for these patches to exist in the branch you are testing: mm: lock newly mapped VMA with corrected ordering fork: lock VMAs of the parent process when forking mm: lock newly mapped VMA which can be modified after it becomes visible mm: lock a vma before stack expansion Thanks, Suren. > > > > Do you have in mind anything specific to check? > > > > Thanks > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v4 00/33] Per-VMA locks
On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: <...> > Laurent Dufour (1): > powerc/mm: try VMA lock-based page fault handling first Hi, This series and specifically the commit above broke docker over PPC. It causes to docker service stuck while trying to activate. Revert of this commit allows us to use docker again. [user@ppc-135-3-200-205 ~]# sudo systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Active: activating (start) since Mon 2023-06-26 14:47:07 IDT; 3h 50min ago TriggeredBy: ● docker.socket Docs: https://docs.docker.com Main PID: 276555 (dockerd) Memory: 44.2M CGroup: /system.slice/docker.service └─ 276555 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129383166+03:00" level=info msg="Graph migration to content-addressability took 0.00 se> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129666160+03:00" level=warning msg="Your kernel does not support cgroup cfs period" Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129684117+03:00" level=warning msg="Your kernel does not support cgroup cfs quotas" Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129697085+03:00" level=warning msg="Your kernel does not support cgroup rt period" Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129711513+03:00" level=warning msg="Your kernel does not support cgroup rt runtime" Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129720656+03:00" level=warning msg="Unable to find blkio cgroup in mounts" Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129805617+03:00" level=warning msg="mountpoint for pids not found" Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.130199070+03:00" level=info msg="Loading containers: start." Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.132688568+03:00" level=warning msg="Running modprobe bridge br_netfilter failed with me> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.271014050+03:00" level=info msg="Default bridge (docker0) is assigned with an IP addres> Python script which we used for bisect: import subprocess import time import sys def run_command(cmd): print('running:', cmd) p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) try: stdout, stderr = p.communicate(timeout=30) except subprocess.TimeoutExpired: return True print(stdout.decode()) print(stderr.decode()) print('rc:', p.returncode) return False def main(): commands = [ 'sudo systemctl stop docker', 'sudo systemctl status docker', 'sudo systemctl is-active docker', 'sudo systemctl start docker', 'sudo systemctl status docker', ] for i in range(1000): title = f'Try no. {i + 1}' print('*' * 50, title, '*' * 50) for cmd in commands: if run_command(cmd): print(f'Reproduced on try no. {i + 1}!') print(f'"{cmd}" is stuck!') return 1 print('\n') time.sleep(30) return 0 if __name__ == '__main__': sys.exit(main()) Thanks
Re: [PATCH v4 00/33] Per-VMA locks
On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote: > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote: > > On 7/11/23 12:35, Leon Romanovsky wrote: > > > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > > > > > <...> > > > > > >> Laurent Dufour (1): > > >> powerc/mm: try VMA lock-based page fault handling first > > > > > > Hi, > > > > > > This series and specifically the commit above broke docker over PPC. > > > It causes to docker service stuck while trying to activate. Revert of > > > this commit allows us to use docker again. > > > > Hi, > > > > there have been follow-up fixes, that are part of 6.4.3 stable (also > > 6.5-rc1) Does that version work for you? > > I'll recheck it again on clean system, but for the record: > 1. We are running 6.5-rc1 kernels. > 2. PPC doesn't compile for us on -rc1 without this fix. > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/ Ohh, I see it in -rc1, let's recheck. > 3. I didn't see anything relevant -rc1 with "git log arch/powerpc/mm/fault.c". > > Do you have in mind anything specific to check? > > Thanks >
Re: [PATCH v4 00/33] Per-VMA locks
On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote: > On 7/11/23 12:35, Leon Romanovsky wrote: > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > > > <...> > > > >> Laurent Dufour (1): > >> powerc/mm: try VMA lock-based page fault handling first > > > > Hi, > > > > This series and specifically the commit above broke docker over PPC. > > It causes to docker service stuck while trying to activate. Revert of > > this commit allows us to use docker again. > > Hi, > > there have been follow-up fixes, that are part of 6.4.3 stable (also > 6.5-rc1) Does that version work for you? I'll recheck it again on clean system, but for the record: 1. We are running 6.5-rc1 kernels. 2. PPC doesn't compile for us on -rc1 without this fix. https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/ 3. I didn't see anything relevant -rc1 with "git log arch/powerpc/mm/fault.c". Do you have in mind anything specific to check? Thanks
Re: [PATCH v4 00/33] Per-VMA locks
On 7/11/23 12:35, Leon Romanovsky wrote: > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > <...> > >> Laurent Dufour (1): >> powerc/mm: try VMA lock-based page fault handling first > > Hi, > > This series and specifically the commit above broke docker over PPC. > It causes to docker service stuck while trying to activate. Revert of > this commit allows us to use docker again. Hi, there have been follow-up fixes, that are part of 6.4.3 stable (also 6.5-rc1) Does that version work for you? Vlastimil > [user@ppc-135-3-200-205 ~]# sudo systemctl status docker > ● docker.service - Docker Application Container Engine > Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor > preset: disabled) > Active: activating (start) since Mon 2023-06-26 14:47:07 IDT; 3h 50min > ago > TriggeredBy: ● docker.socket >Docs: https://docs.docker.com >Main PID: 276555 (dockerd) > Memory: 44.2M > CGroup: /system.slice/docker.service > └─ 276555 /usr/bin/dockerd -H fd:// > --containerd=/run/containerd/containerd.sock > > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.129383166+03:00" level=info msg="Graph migration to > content-addressability took 0.00 se> > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.129666160+03:00" level=warning msg="Your kernel > does not support cgroup cfs period" > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.129684117+03:00" level=warning msg="Your kernel > does not support cgroup cfs quotas" > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.129697085+03:00" level=warning msg="Your kernel > does not support cgroup rt period" > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.129711513+03:00" level=warning msg="Your kernel > does not support cgroup rt runtime" > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.129720656+03:00" level=warning msg="Unable to find > blkio cgroup in mounts" > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.129805617+03:00" level=warning msg="mountpoint for > pids not found" > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.130199070+03:00" level=info msg="Loading > containers: start." > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.132688568+03:00" level=warning msg="Running > modprobe bridge br_netfilter failed with me> > Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: > time="2023-06-26T14:47:07.271014050+03:00" level=info msg="Default bridge > (docker0) is assigned with an IP addres> > > Python script which we used for bisect: > > import subprocess > import time > import sys > > > def run_command(cmd): > print('running:', cmd) > > p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, > stderr=subprocess.PIPE) > > try: > stdout, stderr = p.communicate(timeout=30) > > except subprocess.TimeoutExpired: > return True > > print(stdout.decode()) > print(stderr.decode()) > print('rc:', p.returncode) > > return False > > > def main(): > commands = [ > 'sudo systemctl stop docker', > 'sudo systemctl status docker', > 'sudo systemctl is-active docker', > 'sudo systemctl start docker', > 'sudo systemctl status docker', > ] > > for i in range(1000): > title = f'Try no. {i + 1}' > print('*' * 50, title, '*' * 50) > > for cmd in commands: > if run_command(cmd): > print(f'Reproduced on try no. {i + 1}!') > print(f'"{cmd}" is stuck!') > > return 1 > > print('\n') > time.sleep(30) > return 0 > > if __name__ == '__main__': > sys.exit(main()) > > Thanks
[PATCH v4 00/33] Per-VMA locks
Previous versions: v3: https://lore.kernel.org/all/20230216051750.3125598-1-sur...@google.com/ v2: https://lore.kernel.org/lkml/20230127194110.533103-1-sur...@google.com/ v1: https://lore.kernel.org/all/20230109205336.3665937-1-sur...@google.com/ RFC: https://lore.kernel.org/all/20220901173516.702122-1-sur...@google.com/ LWN article describing the feature: https://lwn.net/Articles/906852/ Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM last year [2], which concluded with suggestion that “a reader/writer semaphore could be put into the VMA itself; that would have the effect of using the VMA as a sort of range lock. There would still be contention at the VMA level, but it would be an improvement.” This patchset implements this suggested approach. When handling page faults we lookup the VMA that contains the faulting page under RCU protection and try to acquire its lock. If that fails we fall back to using mmap_lock, similar to how SPF handled this situation. One notable way the implementation deviates from the proposal is the way VMAs are read-locked. During some of mm updates, multiple VMAs need to be locked until the end of the update (e.g. vma_merge, split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks, figuring out when it's safe to unlock previously locked VMAs would make the code more complex. So, instead of the usual lock/unlock pattern, the proposed solution marks a VMA as locked and provides an efficient way to: 1. Identify locked VMAs. 2. Unlock all locked VMAs in bulk. We also postpone unlocking the locked VMAs until the end of the update, when we do mmap_write_unlock. Potentially this keeps a VMA locked for longer than is absolutely necessary but it results in a big reduction of code complexity. Read-locking a VMA is done using two sequence numbers - one in the vm_area_struct and one in the mm_struct. VMA is considered read-locked when these sequence numbers are equal. To read-lock a VMA we set the sequence number in vm_area_struct to be equal to the sequence number in mm_struct. To unlock all VMAs we increment mm_struct's seq number. This allows for an efficient way to track locked VMAs and to drop the locks on all VMAs at the end of the update. The patchset implements per-VMA locking only for anonymous pages which are not in swap and avoids userfaultfs as their implementation is more complex. Additional support for file-back page faults, swapped and user pages can be added incrementally. Performance benchmarks show similar although slightly smaller benefits as with SPF patchset (~75% of SPF benefits). Still, with lower complexity this approach might be more desirable. Since RFC was posted in September 2022, two separate Google teams outside of Android evaluated the patchset and confirmed positive results. Here are the known usecases when per-VMA locks show benefits: Android: Apps with high number of threads (~100) launch times improve by up to 20%. Each thread mmaps several areas upon startup (Stack and Thread-local storage (TLS), thread signal stack, indirect ref table), which requires taking mmap_lock in write mode. Page faults take mmap_lock in read mode. During app launch, both thread creation and page faults establishing the active workinget are happening in parallel and that causes lock contention between mm writers and readers even if updates and page faults are happening in different VMAs. Per-vma locks prevent this contention by providing more granular lock. Google Fibers: We have several dynamically sized thread pools that spawn new threads under increased load and reduce their number when idling. For example, Google's in-process scheduling/threading framework, UMCG/Fibers, is backed by such a thread pool. When idling, only a small number of idle worker threads are available; when a spike of incoming requests arrive, each request is handled in its own "fiber", which is a work item posted onto a UMCG worker thread; quite often these spikes lead to a number of new threads spawning. Each new thread needs to allocate and register an RSEQ section on its TLS, then register itself with the kernel as a UMCG worker thread, and only after that it can be considered by the in-process UMCG/Fiber scheduler as available to do useful work. In short, during an incoming workload spike new threads have to be spawned, and they perform several syscalls (RSEQ registration, UMCG worker registration, memory allocations) before they can actually start doing useful work. Removing any bottlenecks on this thread startup path will greatly improve our services' latencies when faced with request/workload spikes. At high scale, mmap_lock contention during thread creation and stack page faults leads to user-visible multi-second serving latencies in a similar pattern to Android app startup. Per-VMA locking patchset has been run successfully in limited experiments with user-facing production workloads. In these experiments, we observed that the peak thread creation