Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
On Sun, Oct 16, 2016 at 03:48:42PM -0700, Joel Fernandes wrote: > Also, one more thing about the barrier dances you mentioned, this will > also be done by the spinlock which was there before my patch. So in > favor of my patch, it doesn't make things any worse than they were and > actually fixes the reported issue while preserving the original code > behavior. So I think it is a good thing to fix the issue considering > so many people are reporting it and any clean ups of the vmalloc code > itself can follow. I'm not worried about having barriers, we use them all over our synchronization primitives. I'm worried about opencoding them instead of having them in these well defined helpers. So based on this discussion and the feedback from Nick I'll propose a new patch (or rather a series to make it more understandable) that adds a mutex, adds the lockbreak and gives sensible calling conventions to __purge_vmap_area_lazy.
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
On Sun, Oct 16, 2016 at 03:48:42PM -0700, Joel Fernandes wrote: > Also, one more thing about the barrier dances you mentioned, this will > also be done by the spinlock which was there before my patch. So in > favor of my patch, it doesn't make things any worse than they were and > actually fixes the reported issue while preserving the original code > behavior. So I think it is a good thing to fix the issue considering > so many people are reporting it and any clean ups of the vmalloc code > itself can follow. I'm not worried about having barriers, we use them all over our synchronization primitives. I'm worried about opencoding them instead of having them in these well defined helpers. So based on this discussion and the feedback from Nick I'll propose a new patch (or rather a series to make it more understandable) that adds a mutex, adds the lockbreak and gives sensible calling conventions to __purge_vmap_area_lazy.
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
Hi Christoph, On Sun, Oct 16, 2016 at 3:06 PM, Joel Fernandeswrote: > On Sat, Oct 15, 2016 at 11:10 PM, Christoph Hellwig > wrote: >> On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote: >>> Also, could you share your concerns about use of atomic_t in my patch? >>> I believe that since this is not a contented variable, the question of >>> lock fairness is not a concern. It is also not a lock really the way >>> I'm using it, it just keeps track of how many purges are in progress.. >> >> atomic_t doesn't have any acquire/release semantics, and will require >> off memory barrier dances to actually get the behavior you intended. >> And from looking at the code I can't really see why we even would >> want synchronization behavior - for the sort of problems where we >> don't want multiple threads to run the same code at the same time >> for effiency but not correctness reasons it's usually better to have >> batch thresholds and/or splicing into local data structures before >> operations. Both are techniques used in this code, and I'd rather >> rely on them and if required improve on them then using very odd >> hoc synchronization methods. > > Thanks for the explanation. If you know of a better way to handle the sync=1 > case, let me know. In defense of atomics, even vmap_lazy_nr in the same code > is > atomic_t :) I am also not using it as a lock really, but just to count how > many > times something is in progress that's all - I added some more comments to my > last patch to make this clearer in the code and now I'm also handling the case > for sync=1. Also, one more thing about the barrier dances you mentioned, this will also be done by the spinlock which was there before my patch. So in favor of my patch, it doesn't make things any worse than they were and actually fixes the reported issue while preserving the original code behavior. So I think it is a good thing to fix the issue considering so many people are reporting it and any clean ups of the vmalloc code itself can follow. If you want I can looking into replacing the atomic_cmpxchg with an atomic_inc and not do anything different for sync vs !sync except for spinning when purges are pending. Would that make you feel a bit better? So instead of: if (!sync && !force_flush) { /* * Incase a purge is already in progress, just return. */ if (atomic_cmpxchg(, 0, 1)) return; } else atomic_inc(); , Just do a: atomic_inc(); This should be Ok to do since in the !sync case, we'll just return anyway if another purge was in progress. Thanks, Joel
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
Hi Christoph, On Sun, Oct 16, 2016 at 3:06 PM, Joel Fernandes wrote: > On Sat, Oct 15, 2016 at 11:10 PM, Christoph Hellwig > wrote: >> On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote: >>> Also, could you share your concerns about use of atomic_t in my patch? >>> I believe that since this is not a contented variable, the question of >>> lock fairness is not a concern. It is also not a lock really the way >>> I'm using it, it just keeps track of how many purges are in progress.. >> >> atomic_t doesn't have any acquire/release semantics, and will require >> off memory barrier dances to actually get the behavior you intended. >> And from looking at the code I can't really see why we even would >> want synchronization behavior - for the sort of problems where we >> don't want multiple threads to run the same code at the same time >> for effiency but not correctness reasons it's usually better to have >> batch thresholds and/or splicing into local data structures before >> operations. Both are techniques used in this code, and I'd rather >> rely on them and if required improve on them then using very odd >> hoc synchronization methods. > > Thanks for the explanation. If you know of a better way to handle the sync=1 > case, let me know. In defense of atomics, even vmap_lazy_nr in the same code > is > atomic_t :) I am also not using it as a lock really, but just to count how > many > times something is in progress that's all - I added some more comments to my > last patch to make this clearer in the code and now I'm also handling the case > for sync=1. Also, one more thing about the barrier dances you mentioned, this will also be done by the spinlock which was there before my patch. So in favor of my patch, it doesn't make things any worse than they were and actually fixes the reported issue while preserving the original code behavior. So I think it is a good thing to fix the issue considering so many people are reporting it and any clean ups of the vmalloc code itself can follow. If you want I can looking into replacing the atomic_cmpxchg with an atomic_inc and not do anything different for sync vs !sync except for spinning when purges are pending. Would that make you feel a bit better? So instead of: if (!sync && !force_flush) { /* * Incase a purge is already in progress, just return. */ if (atomic_cmpxchg(, 0, 1)) return; } else atomic_inc(); , Just do a: atomic_inc(); This should be Ok to do since in the !sync case, we'll just return anyway if another purge was in progress. Thanks, Joel
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
On Sat, Oct 15, 2016 at 11:10 PM, Christoph Hellwigwrote: > On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote: >> Your patch changes the behavior of the original code I think. > > It does. And it does so as I don't think the existing behavior makes > sense, as mentioned in the changelog. > >> With the >> patch, for the case where you have 2 concurrent tasks executing >> alloc_vmap_area function, say both hit the overflow label and enter >> the __purge_vmap_area_lazy at the same time. The first task empties >> the purge list and sets nr to the total number of pages of all the >> vmap areas in the list. Say the first task has just emptied the list >> but hasn't started freeing the vmap areas and is preempted at this >> point. Now the second task runs and since the purge list is empty, the >> second task doesn't have anything to do and immediately returns to >> alloc_vmap_area. Once it returns, it sets purged to 1 in >> alloc_vmap_area and retries. Say it hits overflow label again in the >> retry path. Now because purged was set to 1, it goes to err_free. >> Without your patch, it would have waited on the spin_lock (sync = 1) >> instead of just erroring out, so your patch does change the behavior >> of the original code by not using the purge_lock. I realize my patch >> also changes the behavior, but in mine I think we can make it behave >> like the original code by spinning until purging=0 (if sync = 1) >> because I still have the purging variable.. > > But for sync = 1 you don't spin on it in any way. This is the logic > in your patch: > > if (!sync && !force_flush) { > if (atomic_cmpxchg(, 0, 1)) > return; > } else > atomic_inc(); I agree my patch doesn't spin too, I mentioned this above: "I realize my patch also changes the behavior". However if you dont have too much of a problem with my use of atomics, then my below new patch handles the case for sync=1 and is identical in behavior to the original code while still fixing the preempt off latency issues. Your patch just got rid of sync completely, I'm not sure if that's Ok to do? the alloc_vmap_area overflow case was assuming sync=1. The original alloc_vmap_area code calls purge with sync=1 and waits for pending purges to complete, instead of erroring out. I wanted to preserve this behavior. >> Also, could you share your concerns about use of atomic_t in my patch? >> I believe that since this is not a contented variable, the question of >> lock fairness is not a concern. It is also not a lock really the way >> I'm using it, it just keeps track of how many purges are in progress.. > > atomic_t doesn't have any acquire/release semantics, and will require > off memory barrier dances to actually get the behavior you intended. > And from looking at the code I can't really see why we even would > want synchronization behavior - for the sort of problems where we > don't want multiple threads to run the same code at the same time > for effiency but not correctness reasons it's usually better to have > batch thresholds and/or splicing into local data structures before > operations. Both are techniques used in this code, and I'd rather > rely on them and if required improve on them then using very odd > hoc synchronization methods. Thanks for the explanation. If you know of a better way to handle the sync=1 case, let me know. In defense of atomics, even vmap_lazy_nr in the same code is atomic_t :) I am also not using it as a lock really, but just to count how many times something is in progress that's all - I added some more comments to my last patch to make this clearer in the code and now I'm also handling the case for sync=1. -8< From: Joel Fernandes Date: Sat, 15 Oct 2016 01:04:14 -0700 Subject: [PATCH v4] mm: vmalloc: Replace purge_lock spinlock with atomic refcount The purge_lock spinlock causes high latencies with non RT kernel. This has been reported multiple times on lkml [1] [2] and affects applications like audio. In this patch, I replace the spinlock with an atomic refcount so that preemption is kept turned on during purge. This Ok to do since [3] builds the lazy free list in advance and atomically retrieves the list so any instance of purge will have its own list it is purging. Since the individual vmap area frees are themselves protected by a lock, this is Ok. The only thing left is the fact that previously it had trylock behavior, so preserve that by using the refcount to keep track of in-progress purges and abort like previously if there is an ongoing purge. Lastly, decrement vmap_lazy_nr as the vmap areas are freed, and not in advance, so that the vmap_lazy_nr is not reduced too soon as suggested in [2]. Tests: x86_64 quad core machine on kernel 4.8, run: cyclictest -p 99 -n Concurrently in a kernel module, run vmalloc and vfree 8K buffer in a loop. Preemption configuration:
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
On Sat, Oct 15, 2016 at 11:10 PM, Christoph Hellwig wrote: > On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote: >> Your patch changes the behavior of the original code I think. > > It does. And it does so as I don't think the existing behavior makes > sense, as mentioned in the changelog. > >> With the >> patch, for the case where you have 2 concurrent tasks executing >> alloc_vmap_area function, say both hit the overflow label and enter >> the __purge_vmap_area_lazy at the same time. The first task empties >> the purge list and sets nr to the total number of pages of all the >> vmap areas in the list. Say the first task has just emptied the list >> but hasn't started freeing the vmap areas and is preempted at this >> point. Now the second task runs and since the purge list is empty, the >> second task doesn't have anything to do and immediately returns to >> alloc_vmap_area. Once it returns, it sets purged to 1 in >> alloc_vmap_area and retries. Say it hits overflow label again in the >> retry path. Now because purged was set to 1, it goes to err_free. >> Without your patch, it would have waited on the spin_lock (sync = 1) >> instead of just erroring out, so your patch does change the behavior >> of the original code by not using the purge_lock. I realize my patch >> also changes the behavior, but in mine I think we can make it behave >> like the original code by spinning until purging=0 (if sync = 1) >> because I still have the purging variable.. > > But for sync = 1 you don't spin on it in any way. This is the logic > in your patch: > > if (!sync && !force_flush) { > if (atomic_cmpxchg(, 0, 1)) > return; > } else > atomic_inc(); I agree my patch doesn't spin too, I mentioned this above: "I realize my patch also changes the behavior". However if you dont have too much of a problem with my use of atomics, then my below new patch handles the case for sync=1 and is identical in behavior to the original code while still fixing the preempt off latency issues. Your patch just got rid of sync completely, I'm not sure if that's Ok to do? the alloc_vmap_area overflow case was assuming sync=1. The original alloc_vmap_area code calls purge with sync=1 and waits for pending purges to complete, instead of erroring out. I wanted to preserve this behavior. >> Also, could you share your concerns about use of atomic_t in my patch? >> I believe that since this is not a contented variable, the question of >> lock fairness is not a concern. It is also not a lock really the way >> I'm using it, it just keeps track of how many purges are in progress.. > > atomic_t doesn't have any acquire/release semantics, and will require > off memory barrier dances to actually get the behavior you intended. > And from looking at the code I can't really see why we even would > want synchronization behavior - for the sort of problems where we > don't want multiple threads to run the same code at the same time > for effiency but not correctness reasons it's usually better to have > batch thresholds and/or splicing into local data structures before > operations. Both are techniques used in this code, and I'd rather > rely on them and if required improve on them then using very odd > hoc synchronization methods. Thanks for the explanation. If you know of a better way to handle the sync=1 case, let me know. In defense of atomics, even vmap_lazy_nr in the same code is atomic_t :) I am also not using it as a lock really, but just to count how many times something is in progress that's all - I added some more comments to my last patch to make this clearer in the code and now I'm also handling the case for sync=1. -8< From: Joel Fernandes Date: Sat, 15 Oct 2016 01:04:14 -0700 Subject: [PATCH v4] mm: vmalloc: Replace purge_lock spinlock with atomic refcount The purge_lock spinlock causes high latencies with non RT kernel. This has been reported multiple times on lkml [1] [2] and affects applications like audio. In this patch, I replace the spinlock with an atomic refcount so that preemption is kept turned on during purge. This Ok to do since [3] builds the lazy free list in advance and atomically retrieves the list so any instance of purge will have its own list it is purging. Since the individual vmap area frees are themselves protected by a lock, this is Ok. The only thing left is the fact that previously it had trylock behavior, so preserve that by using the refcount to keep track of in-progress purges and abort like previously if there is an ongoing purge. Lastly, decrement vmap_lazy_nr as the vmap areas are freed, and not in advance, so that the vmap_lazy_nr is not reduced too soon as suggested in [2]. Tests: x86_64 quad core machine on kernel 4.8, run: cyclictest -p 99 -n Concurrently in a kernel module, run vmalloc and vfree 8K buffer in a loop. Preemption configuration: CONFIG_PREEMPT__LL=y (low-latency desktop)
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote: > Your patch changes the behavior of the original code I think. It does. And it does so as I don't think the existing behavior makes sense, as mentioned in the changelog. > With the > patch, for the case where you have 2 concurrent tasks executing > alloc_vmap_area function, say both hit the overflow label and enter > the __purge_vmap_area_lazy at the same time. The first task empties > the purge list and sets nr to the total number of pages of all the > vmap areas in the list. Say the first task has just emptied the list > but hasn't started freeing the vmap areas and is preempted at this > point. Now the second task runs and since the purge list is empty, the > second task doesn't have anything to do and immediately returns to > alloc_vmap_area. Once it returns, it sets purged to 1 in > alloc_vmap_area and retries. Say it hits overflow label again in the > retry path. Now because purged was set to 1, it goes to err_free. > Without your patch, it would have waited on the spin_lock (sync = 1) > instead of just erroring out, so your patch does change the behavior > of the original code by not using the purge_lock. I realize my patch > also changes the behavior, but in mine I think we can make it behave > like the original code by spinning until purging=0 (if sync = 1) > because I still have the purging variable.. But for sync = 1 you don't spin on it in any way. This is the logic in your patch: if (!sync && !force_flush) { if (atomic_cmpxchg(, 0, 1)) return; } else atomic_inc(); So when called from free_vmap_area_noflush your skip the whole call if anyone else is currently purging the list, but all other cases we will always execute the code. So maybe my mistake with to follow what your patch did as I just jumped onto it for seeing this atomic_t "synchronization", but the change in behavior to your is very limited, basically the only difference is that if free_vmap_area_noflush hits the purge cases due to having lots of lazy pages on the purge list it will execute despite some other purge being in progress. > You should add a cond_resched_lock here as Chris Wilson suggested. I > tried your patch both with and without cond_resched_lock in this loop, > and without it I see the same problems my patch solves (high latencies > on cyclic test). With cond_resched_lock, your patch does solve my > problem although as I was worried above - that it changes the original > behavior. Yes, I actually pointed that out to "zhouxianrong", who sent a patch just after yours. I just thought lock breaking is a different aspect to improving this whole function. In fact I suspect even my patch should probably be split up quite a bit more. > Also, could you share your concerns about use of atomic_t in my patch? > I believe that since this is not a contented variable, the question of > lock fairness is not a concern. It is also not a lock really the way > I'm using it, it just keeps track of how many purges are in progress.. atomic_t doesn't have any acquire/release semantics, and will require off memory barrier dances to actually get the behavior you intended. And from looking at the code I can't really see why we even would want synchronization behavior - for the sort of problems where we don't want multiple threads to run the same code at the same time for effiency but not correctness reasons it's usually better to have batch thresholds and/or splicing into local data structures before operations. Both are techniques used in this code, and I'd rather rely on them and if required improve on them then using very odd hoc synchronization methods.
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
On Sat, Oct 15, 2016 at 03:59:34PM -0700, Joel Fernandes wrote: > Your patch changes the behavior of the original code I think. It does. And it does so as I don't think the existing behavior makes sense, as mentioned in the changelog. > With the > patch, for the case where you have 2 concurrent tasks executing > alloc_vmap_area function, say both hit the overflow label and enter > the __purge_vmap_area_lazy at the same time. The first task empties > the purge list and sets nr to the total number of pages of all the > vmap areas in the list. Say the first task has just emptied the list > but hasn't started freeing the vmap areas and is preempted at this > point. Now the second task runs and since the purge list is empty, the > second task doesn't have anything to do and immediately returns to > alloc_vmap_area. Once it returns, it sets purged to 1 in > alloc_vmap_area and retries. Say it hits overflow label again in the > retry path. Now because purged was set to 1, it goes to err_free. > Without your patch, it would have waited on the spin_lock (sync = 1) > instead of just erroring out, so your patch does change the behavior > of the original code by not using the purge_lock. I realize my patch > also changes the behavior, but in mine I think we can make it behave > like the original code by spinning until purging=0 (if sync = 1) > because I still have the purging variable.. But for sync = 1 you don't spin on it in any way. This is the logic in your patch: if (!sync && !force_flush) { if (atomic_cmpxchg(, 0, 1)) return; } else atomic_inc(); So when called from free_vmap_area_noflush your skip the whole call if anyone else is currently purging the list, but all other cases we will always execute the code. So maybe my mistake with to follow what your patch did as I just jumped onto it for seeing this atomic_t "synchronization", but the change in behavior to your is very limited, basically the only difference is that if free_vmap_area_noflush hits the purge cases due to having lots of lazy pages on the purge list it will execute despite some other purge being in progress. > You should add a cond_resched_lock here as Chris Wilson suggested. I > tried your patch both with and without cond_resched_lock in this loop, > and without it I see the same problems my patch solves (high latencies > on cyclic test). With cond_resched_lock, your patch does solve my > problem although as I was worried above - that it changes the original > behavior. Yes, I actually pointed that out to "zhouxianrong", who sent a patch just after yours. I just thought lock breaking is a different aspect to improving this whole function. In fact I suspect even my patch should probably be split up quite a bit more. > Also, could you share your concerns about use of atomic_t in my patch? > I believe that since this is not a contented variable, the question of > lock fairness is not a concern. It is also not a lock really the way > I'm using it, it just keeps track of how many purges are in progress.. atomic_t doesn't have any acquire/release semantics, and will require off memory barrier dances to actually get the behavior you intended. And from looking at the code I can't really see why we even would want synchronization behavior - for the sort of problems where we don't want multiple threads to run the same code at the same time for effiency but not correctness reasons it's usually better to have batch thresholds and/or splicing into local data structures before operations. Both are techniques used in this code, and I'd rather rely on them and if required improve on them then using very odd hoc synchronization methods.
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
Hi Christoph, On Sat, Oct 15, 2016 at 9:54 AM, Christoph Hellwigwrote: > And now with a proper changelog, and the accidentall dropped call to > flush_tlb_kernel_range reinstated: > > --- > From f720cc324498ab5e7931c7ccb1653bd9b8cddc63 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig > Date: Sat, 15 Oct 2016 18:39:44 +0200 > Subject: mm: rewrite __purge_vmap_area_lazy > > Remove the purge lock, there was nothing left to be protected: > > - purge_fragmented_blocks seems to has it's own local protection. > - all handling of of valist is implicity protected by the atomic > list deletion in llist_del_all, which also avoids multiple callers > stomping on each other here. > - the manipulation of vmap_lazy_nr already is atomic > - flush_tlb_kernel_range does not require any synchronization > - the calls to __free_vmap_area are sychronized by vmap_area_lock > - *start and *end always point to on-stack variables, never mind > that the caller never looks at the updated values anyway. > > Once that is done we can remove the sync argument by moving the calls > to purge_fragmented_blocks_allcpus into the callers that need it, > and the forced flush_tlb_kernel_range call even if no entries were > found into the one caller that cares, and we can also pass start and > end by reference. Your patch changes the behavior of the original code I think. With the patch, for the case where you have 2 concurrent tasks executing alloc_vmap_area function, say both hit the overflow label and enter the __purge_vmap_area_lazy at the same time. The first task empties the purge list and sets nr to the total number of pages of all the vmap areas in the list. Say the first task has just emptied the list but hasn't started freeing the vmap areas and is preempted at this point. Now the second task runs and since the purge list is empty, the second task doesn't have anything to do and immediately returns to alloc_vmap_area. Once it returns, it sets purged to 1 in alloc_vmap_area and retries. Say it hits overflow label again in the retry path. Now because purged was set to 1, it goes to err_free. Without your patch, it would have waited on the spin_lock (sync = 1) instead of just erroring out, so your patch does change the behavior of the original code by not using the purge_lock. I realize my patch also changes the behavior, but in mine I think we can make it behave like the original code by spinning until purging=0 (if sync = 1) because I still have the purging variable.. Some more comments below: > Signed-off-by: Christoph Hellwig > --- > mm/vmalloc.c | 84 > +++- > 1 file changed, 26 insertions(+), 58 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f2481cb..c3ca992 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -613,82 +613,44 @@ void set_iounmap_nonlazy(void) > atomic_set(_lazy_nr, lazy_max_pages()+1); > } > > -/* > - * Purges all lazily-freed vmap areas. > - * > - * If sync is 0 then don't purge if there is already a purge in progress. > - * If force_flush is 1, then flush kernel TLBs between *start and *end even > - * if we found no lazy vmap areas to unmap (callers can use this to optimise > - * their own TLB flushing). > - * Returns with *start = min(*start, lowest purged address) > - * *end = max(*end, highest purged address) > - */ > -static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > - int sync, int force_flush) > +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > { > - static DEFINE_SPINLOCK(purge_lock); > struct llist_node *valist; > struct vmap_area *va; > struct vmap_area *n_va; > int nr = 0; > > - /* > -* If sync is 0 but force_flush is 1, we'll go sync anyway but callers > -* should not expect such behaviour. This just simplifies locking for > -* the case that isn't actually used at the moment anyway. > -*/ > - if (!sync && !force_flush) { > - if (!spin_trylock(_lock)) > - return; > - } else > - spin_lock(_lock); > - > - if (sync) > - purge_fragmented_blocks_allcpus(); > - > valist = llist_del_all(_purge_list); > llist_for_each_entry(va, valist, purge_list) { > - if (va->va_start < *start) > - *start = va->va_start; > - if (va->va_end > *end) > - *end = va->va_end; > + if (va->va_start < start) > + start = va->va_start; > + if (va->va_end > end) > + end = va->va_end; > nr += (va->va_end - va->va_start) >> PAGE_SHIFT; > } > > - if (nr) > - atomic_sub(nr, _lazy_nr); > - > - if (nr ||
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
Hi Christoph, On Sat, Oct 15, 2016 at 9:54 AM, Christoph Hellwig wrote: > And now with a proper changelog, and the accidentall dropped call to > flush_tlb_kernel_range reinstated: > > --- > From f720cc324498ab5e7931c7ccb1653bd9b8cddc63 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig > Date: Sat, 15 Oct 2016 18:39:44 +0200 > Subject: mm: rewrite __purge_vmap_area_lazy > > Remove the purge lock, there was nothing left to be protected: > > - purge_fragmented_blocks seems to has it's own local protection. > - all handling of of valist is implicity protected by the atomic > list deletion in llist_del_all, which also avoids multiple callers > stomping on each other here. > - the manipulation of vmap_lazy_nr already is atomic > - flush_tlb_kernel_range does not require any synchronization > - the calls to __free_vmap_area are sychronized by vmap_area_lock > - *start and *end always point to on-stack variables, never mind > that the caller never looks at the updated values anyway. > > Once that is done we can remove the sync argument by moving the calls > to purge_fragmented_blocks_allcpus into the callers that need it, > and the forced flush_tlb_kernel_range call even if no entries were > found into the one caller that cares, and we can also pass start and > end by reference. Your patch changes the behavior of the original code I think. With the patch, for the case where you have 2 concurrent tasks executing alloc_vmap_area function, say both hit the overflow label and enter the __purge_vmap_area_lazy at the same time. The first task empties the purge list and sets nr to the total number of pages of all the vmap areas in the list. Say the first task has just emptied the list but hasn't started freeing the vmap areas and is preempted at this point. Now the second task runs and since the purge list is empty, the second task doesn't have anything to do and immediately returns to alloc_vmap_area. Once it returns, it sets purged to 1 in alloc_vmap_area and retries. Say it hits overflow label again in the retry path. Now because purged was set to 1, it goes to err_free. Without your patch, it would have waited on the spin_lock (sync = 1) instead of just erroring out, so your patch does change the behavior of the original code by not using the purge_lock. I realize my patch also changes the behavior, but in mine I think we can make it behave like the original code by spinning until purging=0 (if sync = 1) because I still have the purging variable.. Some more comments below: > Signed-off-by: Christoph Hellwig > --- > mm/vmalloc.c | 84 > +++- > 1 file changed, 26 insertions(+), 58 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f2481cb..c3ca992 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -613,82 +613,44 @@ void set_iounmap_nonlazy(void) > atomic_set(_lazy_nr, lazy_max_pages()+1); > } > > -/* > - * Purges all lazily-freed vmap areas. > - * > - * If sync is 0 then don't purge if there is already a purge in progress. > - * If force_flush is 1, then flush kernel TLBs between *start and *end even > - * if we found no lazy vmap areas to unmap (callers can use this to optimise > - * their own TLB flushing). > - * Returns with *start = min(*start, lowest purged address) > - * *end = max(*end, highest purged address) > - */ > -static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > - int sync, int force_flush) > +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > { > - static DEFINE_SPINLOCK(purge_lock); > struct llist_node *valist; > struct vmap_area *va; > struct vmap_area *n_va; > int nr = 0; > > - /* > -* If sync is 0 but force_flush is 1, we'll go sync anyway but callers > -* should not expect such behaviour. This just simplifies locking for > -* the case that isn't actually used at the moment anyway. > -*/ > - if (!sync && !force_flush) { > - if (!spin_trylock(_lock)) > - return; > - } else > - spin_lock(_lock); > - > - if (sync) > - purge_fragmented_blocks_allcpus(); > - > valist = llist_del_all(_purge_list); > llist_for_each_entry(va, valist, purge_list) { > - if (va->va_start < *start) > - *start = va->va_start; > - if (va->va_end > *end) > - *end = va->va_end; > + if (va->va_start < start) > + start = va->va_start; > + if (va->va_end > end) > + end = va->va_end; > nr += (va->va_end - va->va_start) >> PAGE_SHIFT; > } > > - if (nr) > - atomic_sub(nr, _lazy_nr); > - > - if (nr || force_flush) > -
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
And now with a proper changelog, and the accidentall dropped call to flush_tlb_kernel_range reinstated: --- >From f720cc324498ab5e7931c7ccb1653bd9b8cddc63 Mon Sep 17 00:00:00 2001 From: Christoph HellwigDate: Sat, 15 Oct 2016 18:39:44 +0200 Subject: mm: rewrite __purge_vmap_area_lazy Remove the purge lock, there was nothing left to be protected: - purge_fragmented_blocks seems to has it's own local protection. - all handling of of valist is implicity protected by the atomic list deletion in llist_del_all, which also avoids multiple callers stomping on each other here. - the manipulation of vmap_lazy_nr already is atomic - flush_tlb_kernel_range does not require any synchronization - the calls to __free_vmap_area are sychronized by vmap_area_lock - *start and *end always point to on-stack variables, never mind that the caller never looks at the updated values anyway. Once that is done we can remove the sync argument by moving the calls to purge_fragmented_blocks_allcpus into the callers that need it, and the forced flush_tlb_kernel_range call even if no entries were found into the one caller that cares, and we can also pass start and end by reference. Signed-off-by: Christoph Hellwig --- mm/vmalloc.c | 84 +++- 1 file changed, 26 insertions(+), 58 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb..c3ca992 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -613,82 +613,44 @@ void set_iounmap_nonlazy(void) atomic_set(_lazy_nr, lazy_max_pages()+1); } -/* - * Purges all lazily-freed vmap areas. - * - * If sync is 0 then don't purge if there is already a purge in progress. - * If force_flush is 1, then flush kernel TLBs between *start and *end even - * if we found no lazy vmap areas to unmap (callers can use this to optimise - * their own TLB flushing). - * Returns with *start = min(*start, lowest purged address) - * *end = max(*end, highest purged address) - */ -static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, - int sync, int force_flush) +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) { - static DEFINE_SPINLOCK(purge_lock); struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; int nr = 0; - /* -* If sync is 0 but force_flush is 1, we'll go sync anyway but callers -* should not expect such behaviour. This just simplifies locking for -* the case that isn't actually used at the moment anyway. -*/ - if (!sync && !force_flush) { - if (!spin_trylock(_lock)) - return; - } else - spin_lock(_lock); - - if (sync) - purge_fragmented_blocks_allcpus(); - valist = llist_del_all(_purge_list); llist_for_each_entry(va, valist, purge_list) { - if (va->va_start < *start) - *start = va->va_start; - if (va->va_end > *end) - *end = va->va_end; + if (va->va_start < start) + start = va->va_start; + if (va->va_end > end) + end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; } - if (nr) - atomic_sub(nr, _lazy_nr); - - if (nr || force_flush) - flush_tlb_kernel_range(*start, *end); - - if (nr) { - spin_lock(_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) - __free_vmap_area(va); - spin_unlock(_area_lock); - } - spin_unlock(_lock); -} + if (!nr) + return false; -/* - * Kick off a purge of the outstanding lazy areas. Don't bother if somebody - * is already purging. - */ -static void try_purge_vmap_area_lazy(void) -{ - unsigned long start = ULONG_MAX, end = 0; + atomic_sub(nr, _lazy_nr); + flush_tlb_kernel_range(start, end); - __purge_vmap_area_lazy(, , 0, 0); + spin_lock(_area_lock); + llist_for_each_entry_safe(va, n_va, valist, purge_list) + __free_vmap_area(va); + spin_unlock(_area_lock); + return true; } /* - * Kick off a purge of the outstanding lazy areas. + * Kick off a purge of the outstanding lazy areas, including the fragment + * blocks on the per-cpu lists. */ static void purge_vmap_area_lazy(void) { - unsigned long start = ULONG_MAX, end = 0; + purge_fragmented_blocks_allcpus(); + __purge_vmap_area_lazy(ULONG_MAX, 0); - __purge_vmap_area_lazy(, , 1, 0); } /* @@ -706,8 +668,12 @@ static void free_vmap_area_noflush(struct vmap_area *va) /* After this point, we may free va at any time */ llist_add(>purge_list, _purge_list); +
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
And now with a proper changelog, and the accidentall dropped call to flush_tlb_kernel_range reinstated: --- >From f720cc324498ab5e7931c7ccb1653bd9b8cddc63 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sat, 15 Oct 2016 18:39:44 +0200 Subject: mm: rewrite __purge_vmap_area_lazy Remove the purge lock, there was nothing left to be protected: - purge_fragmented_blocks seems to has it's own local protection. - all handling of of valist is implicity protected by the atomic list deletion in llist_del_all, which also avoids multiple callers stomping on each other here. - the manipulation of vmap_lazy_nr already is atomic - flush_tlb_kernel_range does not require any synchronization - the calls to __free_vmap_area are sychronized by vmap_area_lock - *start and *end always point to on-stack variables, never mind that the caller never looks at the updated values anyway. Once that is done we can remove the sync argument by moving the calls to purge_fragmented_blocks_allcpus into the callers that need it, and the forced flush_tlb_kernel_range call even if no entries were found into the one caller that cares, and we can also pass start and end by reference. Signed-off-by: Christoph Hellwig --- mm/vmalloc.c | 84 +++- 1 file changed, 26 insertions(+), 58 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb..c3ca992 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -613,82 +613,44 @@ void set_iounmap_nonlazy(void) atomic_set(_lazy_nr, lazy_max_pages()+1); } -/* - * Purges all lazily-freed vmap areas. - * - * If sync is 0 then don't purge if there is already a purge in progress. - * If force_flush is 1, then flush kernel TLBs between *start and *end even - * if we found no lazy vmap areas to unmap (callers can use this to optimise - * their own TLB flushing). - * Returns with *start = min(*start, lowest purged address) - * *end = max(*end, highest purged address) - */ -static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, - int sync, int force_flush) +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) { - static DEFINE_SPINLOCK(purge_lock); struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; int nr = 0; - /* -* If sync is 0 but force_flush is 1, we'll go sync anyway but callers -* should not expect such behaviour. This just simplifies locking for -* the case that isn't actually used at the moment anyway. -*/ - if (!sync && !force_flush) { - if (!spin_trylock(_lock)) - return; - } else - spin_lock(_lock); - - if (sync) - purge_fragmented_blocks_allcpus(); - valist = llist_del_all(_purge_list); llist_for_each_entry(va, valist, purge_list) { - if (va->va_start < *start) - *start = va->va_start; - if (va->va_end > *end) - *end = va->va_end; + if (va->va_start < start) + start = va->va_start; + if (va->va_end > end) + end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; } - if (nr) - atomic_sub(nr, _lazy_nr); - - if (nr || force_flush) - flush_tlb_kernel_range(*start, *end); - - if (nr) { - spin_lock(_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) - __free_vmap_area(va); - spin_unlock(_area_lock); - } - spin_unlock(_lock); -} + if (!nr) + return false; -/* - * Kick off a purge of the outstanding lazy areas. Don't bother if somebody - * is already purging. - */ -static void try_purge_vmap_area_lazy(void) -{ - unsigned long start = ULONG_MAX, end = 0; + atomic_sub(nr, _lazy_nr); + flush_tlb_kernel_range(start, end); - __purge_vmap_area_lazy(, , 0, 0); + spin_lock(_area_lock); + llist_for_each_entry_safe(va, n_va, valist, purge_list) + __free_vmap_area(va); + spin_unlock(_area_lock); + return true; } /* - * Kick off a purge of the outstanding lazy areas. + * Kick off a purge of the outstanding lazy areas, including the fragment + * blocks on the per-cpu lists. */ static void purge_vmap_area_lazy(void) { - unsigned long start = ULONG_MAX, end = 0; + purge_fragmented_blocks_allcpus(); + __purge_vmap_area_lazy(ULONG_MAX, 0); - __purge_vmap_area_lazy(, , 1, 0); } /* @@ -706,8 +668,12 @@ static void free_vmap_area_noflush(struct vmap_area *va) /* After this point, we may free va at any time */ llist_add(>purge_list, _purge_list); + /* +* Kick off a
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
atomic_t magic for llocking is always a bit iffy. The real question is: what is purge_lock even supposed to protect? - purge_fragmented_blocks seems to have it's own local protection. - all handling of of valist is implicity protected by the atomic list deletion in llist_del_all - the manipulation of vmap_lazy_nr already is atomic - flush_tlb_kernel_range should not require any synchronization - the calls to __free_vmap_area are sychronized by vmap_area_lock - *start and *end always point to on-stack variables, never mind that the caller never looks at the updated values anyway. And once we use the cmpxchg in llist_del_all as the effectit protection against multiple concurrent callers (slightly sloopy) there just be nothing else to synchronize. And while we're at it the code structure here is totally grotty. So what about the patch below (only boot tested): diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb..c2b9a98 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -613,82 +613,43 @@ void set_iounmap_nonlazy(void) atomic_set(_lazy_nr, lazy_max_pages()+1); } -/* - * Purges all lazily-freed vmap areas. - * - * If sync is 0 then don't purge if there is already a purge in progress. - * If force_flush is 1, then flush kernel TLBs between *start and *end even - * if we found no lazy vmap areas to unmap (callers can use this to optimise - * their own TLB flushing). - * Returns with *start = min(*start, lowest purged address) - * *end = max(*end, highest purged address) - */ -static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, - int sync, int force_flush) +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) { - static DEFINE_SPINLOCK(purge_lock); struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; int nr = 0; - /* -* If sync is 0 but force_flush is 1, we'll go sync anyway but callers -* should not expect such behaviour. This just simplifies locking for -* the case that isn't actually used at the moment anyway. -*/ - if (!sync && !force_flush) { - if (!spin_trylock(_lock)) - return; - } else - spin_lock(_lock); - - if (sync) - purge_fragmented_blocks_allcpus(); - valist = llist_del_all(_purge_list); llist_for_each_entry(va, valist, purge_list) { - if (va->va_start < *start) - *start = va->va_start; - if (va->va_end > *end) - *end = va->va_end; + if (va->va_start < start) + start = va->va_start; + if (va->va_end > end) + end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; } - if (nr) - atomic_sub(nr, _lazy_nr); - - if (nr || force_flush) - flush_tlb_kernel_range(*start, *end); - - if (nr) { - spin_lock(_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) - __free_vmap_area(va); - spin_unlock(_area_lock); - } - spin_unlock(_lock); -} + if (!nr) + return false; -/* - * Kick off a purge of the outstanding lazy areas. Don't bother if somebody - * is already purging. - */ -static void try_purge_vmap_area_lazy(void) -{ - unsigned long start = ULONG_MAX, end = 0; + atomic_sub(nr, _lazy_nr); - __purge_vmap_area_lazy(, , 0, 0); + spin_lock(_area_lock); + llist_for_each_entry_safe(va, n_va, valist, purge_list) + __free_vmap_area(va); + spin_unlock(_area_lock); + return true; } /* - * Kick off a purge of the outstanding lazy areas. + * Kick off a purge of the outstanding lazy areas, including the fragment + * blocks on the per-cpu lists. */ static void purge_vmap_area_lazy(void) { - unsigned long start = ULONG_MAX, end = 0; + purge_fragmented_blocks_allcpus(); + __purge_vmap_area_lazy(ULONG_MAX, 0); - __purge_vmap_area_lazy(, , 1, 0); } /* @@ -706,8 +667,12 @@ static void free_vmap_area_noflush(struct vmap_area *va) /* After this point, we may free va at any time */ llist_add(>purge_list, _purge_list); + /* +* Kick off a purge of the outstanding lazy areas. Don't bother to +* to purge the per-cpu lists of fragmented blocks. +*/ if (unlikely(nr_lazy > lazy_max_pages())) - try_purge_vmap_area_lazy(); + __purge_vmap_area_lazy(ULONG_MAX, 0); } /* @@ -1094,7 +1059,9 @@ void vm_unmap_aliases(void) rcu_read_unlock(); } - __purge_vmap_area_lazy(, , 1, flush); + purge_fragmented_blocks_allcpus(); + if (!__purge_vmap_area_lazy(start, end) && flush) +
Re: [PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
atomic_t magic for llocking is always a bit iffy. The real question is: what is purge_lock even supposed to protect? - purge_fragmented_blocks seems to have it's own local protection. - all handling of of valist is implicity protected by the atomic list deletion in llist_del_all - the manipulation of vmap_lazy_nr already is atomic - flush_tlb_kernel_range should not require any synchronization - the calls to __free_vmap_area are sychronized by vmap_area_lock - *start and *end always point to on-stack variables, never mind that the caller never looks at the updated values anyway. And once we use the cmpxchg in llist_del_all as the effectit protection against multiple concurrent callers (slightly sloopy) there just be nothing else to synchronize. And while we're at it the code structure here is totally grotty. So what about the patch below (only boot tested): diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb..c2b9a98 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -613,82 +613,43 @@ void set_iounmap_nonlazy(void) atomic_set(_lazy_nr, lazy_max_pages()+1); } -/* - * Purges all lazily-freed vmap areas. - * - * If sync is 0 then don't purge if there is already a purge in progress. - * If force_flush is 1, then flush kernel TLBs between *start and *end even - * if we found no lazy vmap areas to unmap (callers can use this to optimise - * their own TLB flushing). - * Returns with *start = min(*start, lowest purged address) - * *end = max(*end, highest purged address) - */ -static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, - int sync, int force_flush) +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) { - static DEFINE_SPINLOCK(purge_lock); struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; int nr = 0; - /* -* If sync is 0 but force_flush is 1, we'll go sync anyway but callers -* should not expect such behaviour. This just simplifies locking for -* the case that isn't actually used at the moment anyway. -*/ - if (!sync && !force_flush) { - if (!spin_trylock(_lock)) - return; - } else - spin_lock(_lock); - - if (sync) - purge_fragmented_blocks_allcpus(); - valist = llist_del_all(_purge_list); llist_for_each_entry(va, valist, purge_list) { - if (va->va_start < *start) - *start = va->va_start; - if (va->va_end > *end) - *end = va->va_end; + if (va->va_start < start) + start = va->va_start; + if (va->va_end > end) + end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; } - if (nr) - atomic_sub(nr, _lazy_nr); - - if (nr || force_flush) - flush_tlb_kernel_range(*start, *end); - - if (nr) { - spin_lock(_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) - __free_vmap_area(va); - spin_unlock(_area_lock); - } - spin_unlock(_lock); -} + if (!nr) + return false; -/* - * Kick off a purge of the outstanding lazy areas. Don't bother if somebody - * is already purging. - */ -static void try_purge_vmap_area_lazy(void) -{ - unsigned long start = ULONG_MAX, end = 0; + atomic_sub(nr, _lazy_nr); - __purge_vmap_area_lazy(, , 0, 0); + spin_lock(_area_lock); + llist_for_each_entry_safe(va, n_va, valist, purge_list) + __free_vmap_area(va); + spin_unlock(_area_lock); + return true; } /* - * Kick off a purge of the outstanding lazy areas. + * Kick off a purge of the outstanding lazy areas, including the fragment + * blocks on the per-cpu lists. */ static void purge_vmap_area_lazy(void) { - unsigned long start = ULONG_MAX, end = 0; + purge_fragmented_blocks_allcpus(); + __purge_vmap_area_lazy(ULONG_MAX, 0); - __purge_vmap_area_lazy(, , 1, 0); } /* @@ -706,8 +667,12 @@ static void free_vmap_area_noflush(struct vmap_area *va) /* After this point, we may free va at any time */ llist_add(>purge_list, _purge_list); + /* +* Kick off a purge of the outstanding lazy areas. Don't bother to +* to purge the per-cpu lists of fragmented blocks. +*/ if (unlikely(nr_lazy > lazy_max_pages())) - try_purge_vmap_area_lazy(); + __purge_vmap_area_lazy(ULONG_MAX, 0); } /* @@ -1094,7 +1059,9 @@ void vm_unmap_aliases(void) rcu_read_unlock(); } - __purge_vmap_area_lazy(, , 1, flush); + purge_fragmented_blocks_allcpus(); + if (!__purge_vmap_area_lazy(start, end) && flush) +
[PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
The purge_lock spinlock causes high latencies with non RT kernel. This has been reported multiple times on lkml [1] [2] and affects applications like audio. In this patch, I replace the spinlock with an atomic refcount so that preemption is kept turned on during purge. This Ok to do since [3] builds the lazy free list in advance and atomically retrieves the list so any instance of purge will have its own list it is purging. Since the individual vmap area frees are themselves protected by a lock, this is Ok. The only thing left is the fact that previously it had trylock behavior, so preserve that by using the refcount to keep track of in-progress purges and abort like previously if there is an ongoing purge. Lastly, decrement vmap_lazy_nr as the vmap areas are freed, and not in advance, so that the vmap_lazy_nr is not reduced too soon as suggested in [2]. Tests: x86_64 quad core machine on kernel 4.8, run: cyclictest -p 99 -n Concurrently in a kernel module, run vmalloc and vfree 8K buffer in a loop. Preemption configuration: CONFIG_PREEMPT__LL=y (low-latency desktop) Without patch, cyclictest output: policy: fifo: loadavg: 0.05 0.01 0.00 1/85 1272 Avg: 128 Max:1177 policy: fifo: loadavg: 0.11 0.03 0.01 2/87 1447 Avg: 122 Max:1897 policy: fifo: loadavg: 0.10 0.03 0.01 1/89 1656 Avg: 93 Max:2886 With patch, cyclictest output: policy: fifo: loadavg: 1.15 0.68 0.30 1/162 8399 Avg: 92 Max: 284 policy: fifo: loadavg: 1.21 0.71 0.32 2/164 9840 Avg: 94 Max: 296 policy: fifo: loadavg: 1.18 0.72 0.32 2/166 11253Avg: 107 Max: 321 [1] http://lists.openwall.net/linux-kernel/2016/03/23/29 [2] https://lkml.org/lkml/2016/10/9/59 [3] https://lkml.org/lkml/2016/4/15/287 [thanks Chris for the cond_resched_lock ideas] Cc: Chris WilsonCc: Jisheng Zhang Cc: John Dias Cc: Andrew Morton Signed-off-by: Joel Fernandes --- v3 changes: Fixed ordering of va pointer access and __free_vmap_area v2 changes: Updated test description in commit message, and typos. mm/vmalloc.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 613d1d9..0270723 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -626,11 +626,11 @@ void set_iounmap_nonlazy(void) static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, int sync, int force_flush) { - static DEFINE_SPINLOCK(purge_lock); + static atomic_t purging; struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; - int nr = 0; + int dofree = 0; /* * If sync is 0 but force_flush is 1, we'll go sync anyway but callers @@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, * the case that isn't actually used at the moment anyway. */ if (!sync && !force_flush) { - if (!spin_trylock(_lock)) + if (atomic_cmpxchg(, 0, 1)) return; } else - spin_lock(_lock); + atomic_inc(); if (sync) purge_fragmented_blocks_allcpus(); @@ -652,22 +652,23 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, *start = va->va_start; if (va->va_end > *end) *end = va->va_end; - nr += (va->va_end - va->va_start) >> PAGE_SHIFT; + dofree = 1; } - if (nr) - atomic_sub(nr, _lazy_nr); - - if (nr || force_flush) + if (dofree || force_flush) flush_tlb_kernel_range(*start, *end); - if (nr) { + if (dofree) { spin_lock(_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) + llist_for_each_entry_safe(va, n_va, valist, purge_list) { + int nrfree = ((va->va_end - va->va_start) >> PAGE_SHIFT); __free_vmap_area(va); + atomic_sub(nrfree, _lazy_nr); + cond_resched_lock(_area_lock); + } spin_unlock(_area_lock); } - spin_unlock(_lock); + atomic_dec(); } /* -- 2.8.0.rc3.226.g39d4020
[PATCH v3] mm: vmalloc: Replace purge_lock spinlock with atomic refcount
The purge_lock spinlock causes high latencies with non RT kernel. This has been reported multiple times on lkml [1] [2] and affects applications like audio. In this patch, I replace the spinlock with an atomic refcount so that preemption is kept turned on during purge. This Ok to do since [3] builds the lazy free list in advance and atomically retrieves the list so any instance of purge will have its own list it is purging. Since the individual vmap area frees are themselves protected by a lock, this is Ok. The only thing left is the fact that previously it had trylock behavior, so preserve that by using the refcount to keep track of in-progress purges and abort like previously if there is an ongoing purge. Lastly, decrement vmap_lazy_nr as the vmap areas are freed, and not in advance, so that the vmap_lazy_nr is not reduced too soon as suggested in [2]. Tests: x86_64 quad core machine on kernel 4.8, run: cyclictest -p 99 -n Concurrently in a kernel module, run vmalloc and vfree 8K buffer in a loop. Preemption configuration: CONFIG_PREEMPT__LL=y (low-latency desktop) Without patch, cyclictest output: policy: fifo: loadavg: 0.05 0.01 0.00 1/85 1272 Avg: 128 Max:1177 policy: fifo: loadavg: 0.11 0.03 0.01 2/87 1447 Avg: 122 Max:1897 policy: fifo: loadavg: 0.10 0.03 0.01 1/89 1656 Avg: 93 Max:2886 With patch, cyclictest output: policy: fifo: loadavg: 1.15 0.68 0.30 1/162 8399 Avg: 92 Max: 284 policy: fifo: loadavg: 1.21 0.71 0.32 2/164 9840 Avg: 94 Max: 296 policy: fifo: loadavg: 1.18 0.72 0.32 2/166 11253Avg: 107 Max: 321 [1] http://lists.openwall.net/linux-kernel/2016/03/23/29 [2] https://lkml.org/lkml/2016/10/9/59 [3] https://lkml.org/lkml/2016/4/15/287 [thanks Chris for the cond_resched_lock ideas] Cc: Chris Wilson Cc: Jisheng Zhang Cc: John Dias Cc: Andrew Morton Signed-off-by: Joel Fernandes --- v3 changes: Fixed ordering of va pointer access and __free_vmap_area v2 changes: Updated test description in commit message, and typos. mm/vmalloc.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 613d1d9..0270723 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -626,11 +626,11 @@ void set_iounmap_nonlazy(void) static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, int sync, int force_flush) { - static DEFINE_SPINLOCK(purge_lock); + static atomic_t purging; struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; - int nr = 0; + int dofree = 0; /* * If sync is 0 but force_flush is 1, we'll go sync anyway but callers @@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, * the case that isn't actually used at the moment anyway. */ if (!sync && !force_flush) { - if (!spin_trylock(_lock)) + if (atomic_cmpxchg(, 0, 1)) return; } else - spin_lock(_lock); + atomic_inc(); if (sync) purge_fragmented_blocks_allcpus(); @@ -652,22 +652,23 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, *start = va->va_start; if (va->va_end > *end) *end = va->va_end; - nr += (va->va_end - va->va_start) >> PAGE_SHIFT; + dofree = 1; } - if (nr) - atomic_sub(nr, _lazy_nr); - - if (nr || force_flush) + if (dofree || force_flush) flush_tlb_kernel_range(*start, *end); - if (nr) { + if (dofree) { spin_lock(_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) + llist_for_each_entry_safe(va, n_va, valist, purge_list) { + int nrfree = ((va->va_end - va->va_start) >> PAGE_SHIFT); __free_vmap_area(va); + atomic_sub(nrfree, _lazy_nr); + cond_resched_lock(_area_lock); + } spin_unlock(_area_lock); } - spin_unlock(_lock); + atomic_dec(); } /* -- 2.8.0.rc3.226.g39d4020