from:"Nadav Amit"

Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Nadav Amit

> On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> 
> On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
>> 
>> 
>> 
>> I might be missing something, but it seems a bit racy.
>> 
>> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
>> point, since you don’t hold the text_mutex, some might do text_poke(),
>> e.g., by enabling/disabling static-key, and the update would be
>> overwritten. No?
> 
> Right :(
> Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> area.
> 
> So I'm thinking about calling alternatives_smp_module_add() from an
> additional callback after the execmem_update_copy().
> 
> Does it make sense to you?

Going over the code again - I might have just been wrong: I confused the
alternatives and the jump-label mechanisms (as they do share a lot of
code and characteristics).

The jump-labels are updated when prepare_coming_module() is called, which
happens after post_relocation() [which means they would be updated using
text_poke() “inefficiently” but should be safe].

The “alternatives” appear only to use text_poke() (in contrast for
text_poke_early()) from very specific few flows, e.g., 
common_cpu_up() -> alternatives_enable_smp().

Are those flows pose a problem after boot?

Anyhow, sorry for the noise.

Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-16 Thread Nadav Amit




> On 11 Apr 2024, at 19:05, Mike Rapoport  wrote:
> 
> @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const 
> struct load_info *info)
>   add_kallsyms(mod, info);
> 
>   /* Arch-specific module finalizing. */
> - return module_finalize(info->hdr, info->sechdrs, mod);
> + ret = module_finalize(info->hdr, info->sechdrs, mod);
> + if (ret)
> + return ret;
> +
> + for_each_mod_mem_type(type) {
> + struct module_memory *mem = >mem[type];
> +
> + if (mem->is_rox) {
> + if (!execmem_update_copy(mem->base, mem->rw_copy,
> +  mem->size))
> + return -ENOMEM;
> +
> + vfree(mem->rw_copy);
> + mem->rw_copy = NULL;
> + }
> + }
> +
> + return 0;
> }

I might be missing something, but it seems a bit racy.

IIUC, module_finalize() calls alternatives_smp_module_add(). At this
point, since you don’t hold the text_mutex, some might do text_poke(),
e.g., by enabling/disabling static-key, and the update would be
overwritten. No?

Re: [PATCH] iommu/amd: page-specific invalidations for more than one page

2021-04-08 Thread Nadav Amit

> On Apr 8, 2021, at 12:18 AM, Joerg Roedel  wrote:
> 
> Hi Nadav,
> 
> On Wed, Apr 07, 2021 at 05:57:31PM +0000, Nadav Amit wrote:
>> I tested it on real bare-metal hardware. I ran some basic I/O workloads
>> with the IOMMU enabled, checkers enabled/disabled, and so on.
>> 
>> However, I only tested the IOMMU-flushes and I did not test that the
>> device-IOTLB flush work, since I did not have the hardware for that.
>> 
>> If you can refer me to the old patches, I will have a look and see
>> whether I can see a difference in the logic or test them. If you want
>> me to run different tests - let me know. If you want me to remove
>> the device-IOTLB invalidations logic - that is also fine with me.
> 
> Here is the patch-set, it is from 2010 and against a very old version of
> the AMD IOMMU driver:

Thanks. I looked at your code and I see a difference between the
implementations.

As far as I understand, pages are always assumed to be aligned to their
own sizes. I therefore assume that flushes should regard the lower bits
as a “mask” and not just as encoding of the size.

In the version that you referred me to, iommu_update_domain_tlb() only
regards the size of the region to be flushed and disregards the
alignment:

+   order   = get_order(domain->flush.end - domain->flush.start);
+   mask= (0x1000ULL << order) - 1;
+   address = ((domain->flush.start & ~mask) | (mask >> 1)) & ~0xfffULL;

If you need to flush for instance the region between 0x1000-0x5000, this
version would use the address|mask of 0x1000 (16KB page). The version I
sent regards the alignment, and since the range is not aligned would use
address|mask of 0x3000 (32KB page).

IIUC, IOVA allocations today are aligned in such way, but at least in
the past (looking on 3.19 for the matter), it was not like always like
that, which can explain the problems.

Thoughts?

Re: [PATCH] iommu/amd: page-specific invalidations for more than one page

2021-04-07 Thread Nadav Amit




> On Apr 7, 2021, at 3:01 AM, Joerg Roedel  wrote:
> 
> On Tue, Mar 23, 2021 at 02:06:19PM -0700, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Currently, IOMMU invalidations and device-IOTLB invalidations using
>> AMD IOMMU fall back to full address-space invalidation if more than a
>> single page need to be flushed.
>> 
>> Full flushes are especially inefficient when the IOMMU is virtualized by
>> a hypervisor, since it requires the hypervisor to synchronize the entire
>> address-space.
>> 
>> AMD IOMMUs allow to provide a mask to perform page-specific
>> invalidations for multiple pages that match the address. The mask is
>> encoded as part of the address, and the first zero bit in the address
>> (in bits [51:12]) indicates the mask size.
>> 
>> Use this hardware feature to perform selective IOMMU and IOTLB flushes.
>> Combine the logic between both for better code reuse.
>> 
>> The IOMMU invalidations passed a smoke-test. The device IOTLB
>> invalidations are untested.
> 
> Have you thoroughly tested this on real hardware? I had a patch-set
> doing the same many years ago and it lead to data corruption under load.
> Back then it could have been a bug in my code of course, but it made me
> cautious about using targeted invalidations.

I tested it on real bare-metal hardware. I ran some basic I/O workloads
with the IOMMU enabled, checkers enabled/disabled, and so on.

However, I only tested the IOMMU-flushes and I did not test that the
device-IOTLB flush work, since I did not have the hardware for that.

If you can refer me to the old patches, I will have a look and see
whether I can see a difference in the logic or test them. If you want
me to run different tests - let me know. If you want me to remove
the device-IOTLB invalidations logic - that is also fine with me.

Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-04-01 Thread Nadav Amit



> On Apr 1, 2021, at 1:38 AM, Mel Gorman  wrote:
> 
> On Wed, Mar 31, 2021 at 09:36:04AM -0700, Nadav Amit wrote:
>> 
>> 
>>> On Mar 31, 2021, at 6:16 AM, Mel Gorman  wrote:
>>> 
>>> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>>>> Mel Gorman  writes:
>>>> 
>>>>> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>>>>>> For NUMA balancing, in hint page fault handler, the faulting page will
>>>>>> be migrated to the accessing node if necessary.  During the migration,
>>>>>> TLB will be shot down on all CPUs that the process has run on
>>>>>> recently.  Because in the hint page fault handler, the PTE will be
>>>>>> made accessible before the migration is tried.  The overhead of TLB
>>>>>> shooting down is high, so it's better to be avoided if possible.  In
>>>>>> fact, if we delay mapping the page in PTE until migration, that can be
>>>>>> avoided.  This is what this patch doing.
>>>>>> 
>>>>> 
>>>>> Why would the overhead be high? It was previously inaccessibly so it's
>>>>> only parallel accesses making forward progress that trigger the need
>>>>> for a flush.
>>>> 
>>>> Sorry, I don't understand this.  Although the page is inaccessible, the
>>>> threads may access other pages, so TLB flushing is still necessary.
>>>> 
>>> 
>>> You assert the overhead of TLB shootdown is high and yes, it can be
>>> very high but you also said "the benchmark score has no visible changes"
>>> indicating the TLB shootdown cost is not a major problem for the workload.
>>> It does not mean we should ignore it though.
>> 
>> If you are looking for a benchmark that is negatively affected by NUMA
>> balancing, then IIRC Parsec???s dedup is such a workload. [1]
>> 
> 
> Few questions;
> 
> Is Parsec imparied due to NUMA balancing in general or due to TLB
> shootdowns specifically?

TLB shootdowns specifically.

> 
> Are you using "gcc-pthreads" for parallelisation and the "native" size
> for Parsec?

native as it is the biggest workload, so it is most apparent with
native. I don’t remember that I played with the threading model
parameters.

> 
> Is there any specific thread count that matters either in
> absolute terms or as a precentage of online CPUs?

IIRC, when thread count matches the CPU numbers (or perhaps
slightly lower), the impact is the greatest.



signature.asc
Description: Message signed with OpenPGP

Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-03-31 Thread Nadav Amit



> On Mar 31, 2021, at 6:16 AM, Mel Gorman  wrote:
> 
> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>> Mel Gorman  writes:
>> 
>>> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
 For NUMA balancing, in hint page fault handler, the faulting page will
 be migrated to the accessing node if necessary.  During the migration,
 TLB will be shot down on all CPUs that the process has run on
 recently.  Because in the hint page fault handler, the PTE will be
 made accessible before the migration is tried.  The overhead of TLB
 shooting down is high, so it's better to be avoided if possible.  In
 fact, if we delay mapping the page in PTE until migration, that can be
 avoided.  This is what this patch doing.
 
>>> 
>>> Why would the overhead be high? It was previously inaccessibly so it's
>>> only parallel accesses making forward progress that trigger the need
>>> for a flush.
>> 
>> Sorry, I don't understand this.  Although the page is inaccessible, the
>> threads may access other pages, so TLB flushing is still necessary.
>> 
> 
> You assert the overhead of TLB shootdown is high and yes, it can be
> very high but you also said "the benchmark score has no visible changes"
> indicating the TLB shootdown cost is not a major problem for the workload.
> It does not mean we should ignore it though.

If you are looking for a benchmark that is negatively affected by NUMA
balancing, then IIRC Parsec’s dedup is such a workload. [1]

[1] https://parsec.cs.princeton.edu/


signature.asc
Description: Message signed with OpenPGP

Re: A problem of Intel IOMMU hardware ？

2021-03-26 Thread Nadav Amit

> On Mar 26, 2021, at 7:31 PM, Lu Baolu  wrote:
> 
> Hi Nadav,
> 
> On 3/19/21 12:46 AM, Nadav Amit wrote:
>> So here is my guess:
>> Intel probably used as a basis for the IOTLB an implementation of
>> some other (regular) TLB design.
>> Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):
>> "Software wishing to prevent this uncertainty should not write to
>> a paging-structure entry in a way that would change, for any linear
>> address, both the page size and either the page frame, access rights,
>> or other attributes.”
>> Now the aforementioned uncertainty is a bit different (multiple
>> *valid*  translations of a single address). Yet, perhaps this is
>> yet another thing that might happen.
>> From a brief look on the handling of MMU (not IOMMU) hugepages
>> in Linux, indeed the PMD is first cleared and flushed before a
>> new valid PMD is set. This is possible for MMUs since they
>> allow the software to handle spurious page-faults gracefully.
>> This is not the case for the IOMMU though (without PRI).
>> Not sure this explains everything though. If that is the problem,
>> then during a mapping that changes page-sizes, a TLB flush is
>> needed, similarly to the one Longpeng did manually.
> 
> I have been working with Longpeng on this issue these days. It turned
> out that your guess is right. The PMD is first cleared but not flushed
> before a new valid one is set. The previous entry might be cached in the
> paging structure caches hence leads to disaster.
> 
> In __domain_mapping():
> 
> 2352 /*
> 2353  * Ensure that old small page tables are
> 2354  * removed to make room for superpage(s).
> 2355  * We're adding new large pages, so make 
> sure
> 2356  * we don't remove their parent tables.
> 2357  */
> 2358 dma_pte_free_pagetable(domain, iov_pfn, 
> end_pfn,
> 2359 largepage_lvl + 1);
> 
> I guess adding a cache flush operation after PMD switching should solve
> the problem.
> 
> I am still not clear about this comment:
> 
> "
> This is possible for MMUs since they allow the software to handle
> spurious page-faults gracefully. This is not the case for the IOMMU
> though (without PRI).
> "
> 
> Can you please shed more light on this?

I was looking at the code in more detail, and apparently my concern
is incorrect.

I was under the assumption that the IOMMU map/unmap can merge/split
(specifically split) huge-pages. For instance, if you map 2MB and
then unmap 4KB out of the 2MB, then you would split the hugepage
and keep the rest of the mappings alive. This is the way MMU is
usually managed. To my defense, I also saw such partial unmappings
in Longpeng’s first scenario.

If this was possible, then you would have a case in which out of 2MB
(for instance), 4KB were unmapped, and you need to split the 2MB
hugepage into 4KB pages. If you try to clear the PMD, flush, and then
set the PMD to point to table with live 4KB PTES, you can have
an interim state in which the PMD is not present. DMAs that arrive
at this stage might fault, and without PRI (and device support)
you do not have a way of restarting the DMA after the hugepage split
is completed.

Anyhow, this concern is apparently not relevant. I guess I was too
naive to assume the IOMMU management is similar to the MMU. I now
see that there is a comment in intel_iommu_unmap() saying:

/* Cope with horrid API which requires us to unmap more than the
   size argument if it happens to be a large-page mapping. */

Regards,
Nadav

signature.asc
Description: Message signed with OpenPGP

[PATCH] iommu/amd: page-specific invalidations for more than one page

2021-03-23 Thread Nadav Amit

From: Nadav Amit 

Currently, IOMMU invalidations and device-IOTLB invalidations using
AMD IOMMU fall back to full address-space invalidation if more than a
single page need to be flushed.

Full flushes are especially inefficient when the IOMMU is virtualized by
a hypervisor, since it requires the hypervisor to synchronize the entire
address-space.

AMD IOMMUs allow to provide a mask to perform page-specific
invalidations for multiple pages that match the address. The mask is
encoded as part of the address, and the first zero bit in the address
(in bits [51:12]) indicates the mask size.

Use this hardware feature to perform selective IOMMU and IOTLB flushes.
Combine the logic between both for better code reuse.

The IOMMU invalidations passed a smoke-test. The device IOTLB
invalidations are untested.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: io...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 76 +--
 1 file changed, 42 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 9256f84f5ebf..5f2dc3d7f2dc 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -927,33 +927,58 @@ static void build_inv_dte(struct iommu_cmd *cmd, u16 
devid)
CMD_SET_TYPE(cmd, CMD_INV_DEV_ENTRY);
 }
 
-static void build_inv_iommu_pages(struct iommu_cmd *cmd, u64 address,
- size_t size, u16 domid, int pde)
+/*
+ * Builds an invalidation address which is suitable for one page or multiple
+ * pages. Sets the size bit (S) as needed is more than one page is flushed.
+ */
+static inline u64 build_inv_address(u64 address, size_t size)
 {
-   u64 pages;
-   bool s;
+   u64 pages, end, msb_diff;
 
pages = iommu_num_pages(address, size, PAGE_SIZE);
-   s = false;
 
-   if (pages > 1) {
+   if (pages == 1)
+   return address & PAGE_MASK;
+
+   end = address + size - 1;
+
+   /*
+* msb_diff would hold the index of the most significant bit that
+* flipped between the start and end.
+*/
+   msb_diff = fls64(end ^ address) - 1;
+
+   /*
+* Bits 63:52 are sign extended. If for some reason bit 51 is different
+* between the start and the end, invalidate everything.
+*/
+   if (unlikely(msb_diff > 51)) {
+   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
+   } else {
/*
-* If we have to flush more than one page, flush all
-* TLB entries for this domain
+* The msb-bit must be clear on the address. Just set all the
+* lower bits.
 */
-   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
-   s = true;
+   address |= 1ull << (msb_diff - 1);
}
 
+   /* Clear bits 11:0 */
address &= PAGE_MASK;
 
+   /* Set the size bit - we flush more than one 4kb page */
+   return address | CMD_INV_IOMMU_PAGES_SIZE_MASK;
+}
+
+static void build_inv_iommu_pages(struct iommu_cmd *cmd, u64 address,
+ size_t size, u16 domid, int pde)
+{
+   u64 inv_address = build_inv_address(address, size);
+
memset(cmd, 0, sizeof(*cmd));
cmd->data[1] |= domid;
-   cmd->data[2]  = lower_32_bits(address);
-   cmd->data[3]  = upper_32_bits(address);
+   cmd->data[2]  = lower_32_bits(inv_address);
+   cmd->data[3]  = upper_32_bits(inv_address);
CMD_SET_TYPE(cmd, CMD_INV_IOMMU_PAGES);
-   if (s) /* size bit - we flush more than one 4kb page */
-   cmd->data[2] |= CMD_INV_IOMMU_PAGES_SIZE_MASK;
if (pde) /* PDE bit - we want to flush everything, not only the PTEs */
cmd->data[2] |= CMD_INV_IOMMU_PAGES_PDE_MASK;
 }
@@ -961,32 +986,15 @@ static void build_inv_iommu_pages(struct iommu_cmd *cmd, 
u64 address,
 static void build_inv_iotlb_pages(struct iommu_cmd *cmd, u16 devid, int qdep,
  u64 address, size_t size)
 {
-   u64 pages;
-   bool s;
-
-   pages = iommu_num_pages(address, size, PAGE_SIZE);
-   s = false;
-
-   if (pages > 1) {
-   /*
-* If we have to flush more than one page, flush all
-* TLB entries for this domain
-*/
-   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
-   s = true;
-   }
-
-   address &= PAGE_MASK;
+   u64 inv_address = build_inv_address(address, size);
 
memset(cmd, 0, sizeof(*cmd));
cmd->data[0]  = devid;
cmd->data[0] |= (qdep & 0xff) << 24;
cmd->data[1]  = devid;
-   cmd->data[2]  = lower_32_bits(address);
-   cmd->data[3]  = upper_32_bits(address);
+   cmd->data[2]  = l

Re: A problem of Intel IOMMU hardware ？

2021-03-18 Thread Nadav Amit



> On Mar 18, 2021, at 2:25 AM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 
> 
> 
>> -Original Message-
>> From: Tian, Kevin [mailto:kevin.t...@intel.com]
>> Sent: Thursday, March 18, 2021 4:56 PM
>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>> ; Nadav Amit 
>> Cc: chenjiashang ; David Woodhouse
>> ; io...@lists.linux-foundation.org; LKML
>> ; alex.william...@redhat.com; Gonglei (Arei)
>> ; w...@kernel.org
>> Subject: RE: A problem of Intel IOMMU hardware ？
>> 
>>> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>> 
>>> 
>>>> -Original Message-
>>>> From: Tian, Kevin [mailto:kevin.t...@intel.com]
>>>> Sent: Thursday, March 18, 2021 4:27 PM
>>>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>>> ; Nadav Amit 
>>>> Cc: chenjiashang ; David Woodhouse
>>>> ; io...@lists.linux-foundation.org; LKML
>>>> ; alex.william...@redhat.com; Gonglei
>>> (Arei)
>>>> ; w...@kernel.org
>>>> Subject: RE: A problem of Intel IOMMU hardware ？
>>>> 
>>>>> From: iommu  On Behalf
>>>>> Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>>>> 
>>>>>> 2. Consider ensuring that the problem is not somehow related to
>>>>>> queued invalidations. Try to use __iommu_flush_iotlb() instead
>>>>>> of
>>>> qi_flush_iotlb().
>>>>>> 
>>>>> 
>>>>> I tried to force to use __iommu_flush_iotlb(), but maybe something
>>>>> wrong, the system crashed, so I prefer to lower the priority of
>>>>> this
>>> operation.
>>>>> 
>>>> 
>>>> The VT-d spec clearly says that register-based invalidation can be
>>>> used only
>>> when
>>>> queued-invalidations are not enabled. Intel-IOMMU driver doesn't
>>>> provide
>>> an
>>>> option to disable queued-invalidation though, when the hardware is
>>> capable. If you
>>>> really want to try, tweak the code in intel_iommu_init_qi.
>>>> 
>>> 
>>> Hi Kevin,
>>> 
>>> Thanks to point out this. Do you have any ideas about this problem ? I
>>> tried to descript the problem much clear in my reply to Alex, hope you
>>> could have a look if you're interested.
>>> 
>> 
>> btw I saw you used 4.18 kernel in this test. What about latest kernel?
>> 
> 
> Not test yet. It's hard to upgrade kernel in our environment.
> 
>> Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
>> qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
>> hardware. Check the window between b) and c) and see whether the software 
>> does
>> the right thing as expected there.
>> 
> 
> We add some log in iommu driver these days, the software seems fine. But we
> didn't look inside the qi_submit_sync yet, I'll try it tonight.

So here is my guess:

Intel probably used as a basis for the IOTLB an implementation of
some other (regular) TLB design.

Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):

"Software wishing to prevent this uncertainty should not write to
a paging-structure entry in a way that would change, for any linear
address, both the page size and either the page frame, access rights,
or other attributes.”


Now the aforementioned uncertainty is a bit different (multiple
*valid* translations of a single address). Yet, perhaps this is
yet another thing that might happen.

From a brief look on the handling of MMU (not IOMMU) hugepages
in Linux, indeed the PMD is first cleared and flushed before a
new valid PMD is set. This is possible for MMUs since they
allow the software to handle spurious page-faults gracefully.
This is not the case for the IOMMU though (without PRI).

Not sure this explains everything though. If that is the problem,
then during a mapping that changes page-sizes, a TLB flush is
needed, similarly to the one Longpeng did manually.




signature.asc
Description: Message signed with OpenPGP

Re: A problem of Intel IOMMU hardware ？

2021-03-18 Thread Nadav Amit


> On Mar 17, 2021, at 9:46 PM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 

[Snip]

> 
> NOTE, the magical thing happen...(*Operation-4*) we write the PTE
> of Operation-1 from 0 to 0x3 which means can Read/Write, and then
> we trigger DMA read again, it success and return the data of HPA 0 !!
> 
> Why we modify the older page table would make sense ? As we
> have discussed previously, the cache flush part of the driver is correct,
> it call flush_iotlb after (b) and no need to flush after (c). But the result
> of the experiment shows the older page table or older caches is effective
> actually.
> 
> Any ideas ?

Interesting. Sounds as if there is some page-walk cache that was not
invalidated properly.



signature.asc
Description: Message signed with OpenPGP

Re: A problem of Intel IOMMU hardware ？

2021-03-17 Thread Nadav Amit



> On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 
> Hi Nadav,
> 
>> -Original Message-
>> From: Nadav Amit [mailto:nadav.a...@gmail.com]
>>>  reproduce the problem with high probability (~50%).
>> 
>> I saw Lu replied, and he is much more knowledgable than I am (I was just 
>> intrigued
>> by your email).
>> 
>> However, if I were you I would try also to remove some “optimizations” to 
>> look for
>> the root-cause (e.g., use domain specific invalidations instead of 
>> page-specific).
>> 
> 
> Good suggestion! But we did it these days, we tried to use global 
> invalidations as follow:
>   iommu->flush.flush_iotlb(iommu, did, 0, 0,
>   DMA_TLB_DSI_FLUSH);
> But can not resolve the problem.
> 
>> The first thing that comes to my mind is the invalidation hint (ih) in
>> iommu_flush_iotlb_psi(). I would remove it to see whether you get the failure
>> without it.
> 
> We also notice the IH, but the IH is always ZERO in our case, as the spec 
> says:
> '''
> Paging-structure-cache entries caching second-level mappings associated with 
> the specified
> domain-id and the second-level-input-address range are invalidated, if the 
> Invalidation Hint
> (IH) field is Clear.
> '''
> 
> It seems the software is everything fine, so we've no choice but to suspect 
> the hardware.

Ok, I am pretty much out of ideas. I have two more suggestions, but
they are much less likely to help. Yet, they can further help to rule
out software bugs:

1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE()
to prevent split-write, which might potentially cause “invalid” (partially
cleared) PTE to be stored in the TLB. Having said that, the subsequent
IOTLB flush should have prevented the problem.

2. Consider ensuring that the problem is not somehow related to queued
invalidations. Try to use __iommu_flush_iotlb() instead of
qi_flush_iotlb().

Regards,
Nadav


signature.asc
Description: Message signed with OpenPGP

Re: A problem of Intel IOMMU hardware ？

2021-03-16 Thread Nadav Amit

> On Mar 16, 2021, at 8:16 PM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 
> Hi guys,
> 
> We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a special
> situation, it would cause DMA fails or get wrong data.
> 
> The reproducer (based on Alex's vfio testsuite[1]) is in attachment, it can
> reproduce the problem with high probability (~50%).

I saw Lu replied, and he is much more knowledgable than I am (I was just
intrigued by your email).

However, if I were you I would try also to remove some “optimizations” to
look for the root-cause (e.g., use domain specific invalidations instead
of page-specific).

The first thing that comes to my mind is the invalidation hint (ih) in
iommu_flush_iotlb_psi(). I would remove it to see whether you get the
failure without it.

signature.asc
Description: Message signed with OpenPGP

[tip: x86/mm] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 6035152d8eebe16a5bb60398d3e05dc7799067b0
Gitweb:
https://git.kernel.org/tip/6035152d8eebe16a5bb60398d3e05dc7799067b0
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:06 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:09 +01:00

x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.

This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.

Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-4-na...@vmware.com
---
 arch/x86/mm/tlb.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index bf12371..07b6701 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -788,11 +788,13 @@ done:
nr_invalidate);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static bool tlb_is_not_lazy(int cpu)
 {
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
 STATIC_NOPV void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info)
 {
@@ -813,12 +815,37 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
cpumask *cpumask,
 * up on the new contents of what used to be page tables, while
 * doing a speculative memory access.
 */
-   if (info->freed_tables)
+   if (info->freed_tables) {
smp_call_function_many(cpumask, flush_tlb_func,
   (void *)info, 1);
-   else
-   on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func,
-   (void *)info, 1, cpumask);
+   } else {
+   /*
+* Although we could have used on_each_cpu_cond_mask(),
+* open-coding it has performance advantages, as it eliminates
+* the need for indirect calls or retpolines. In addition, it
+* allows to use a designated cpumask for evaluating the
+* condition, instead of allocating one.
+*
+* This code works under the assumption that there are no nested
+* TLB flushes, an assumption that is already made in
+* flush_tlb_mm_range().
+*
+* cond_cpumask is logically a stack-local variable, but it is
+* more efficient to have it off the stack and not to allocate
+* it on demand. Preemption is disabled and this code is
+* non-reentrant.
+*/
+   struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask);
+   int cpu;
+
+   cpumask_clear(cond_cpumask);
+
+   for_each_cpu(cpu, cpumask) {
+   if (tlb_is_not_lazy(cpu))
+   __cpumask_set_cpu(cpu, cond_cpumask);
+   }
+   smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
*)info, 1);
+   }
 }
 
 void flush_tlb_others(const struct cpumask *cpumask,

[tip: x86/mm] smp: Run functions concurrently in smp_call_function_many_cond()

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: a32a4d8a815c4eb6dc64b8962dc13a9dfae70868
Gitweb:
https://git.kernel.org/tip/a32a4d8a815c4eb6dc64b8962dc13a9dfae70868
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:04 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:09 +01:00

smp: Run functions concurrently in smp_call_function_many_cond()

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, modify smp_call_function_many_cond(), to allows the callers to
provide a function that should be executed (remotely/locally), and run
them concurrently. Keep other smp_call_function_many() semantic as it is
today for backward compatibility: the called function is not executed in
this case locally.

smp_call_function_many_cond() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-2-na...@vmware.com
---
 kernel/smp.c | 156 --
 1 file changed, 88 insertions(+), 68 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index aeb0adf..c8a5a1f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -608,12 +608,28 @@ call:
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
+/*
+ * Flags to be used as scf_flags argument of smp_call_function_many_cond().
+ *
+ * %SCF_WAIT:  Wait until function execution is completed
+ * %SCF_RUN_LOCAL: Run also locally if local cpu is set in cpumask
+ */
+#define SCF_WAIT   (1U << 0)
+#define SCF_RUN_LOCAL  (1U << 1)
+
 static void smp_call_function_many_cond(const struct cpumask *mask,
smp_call_func_t func, void *info,
-   bool wait, smp_cond_func_t cond_func)
+   unsigned int scf_flags,
+   smp_cond_func_t cond_func)
 {
+   int cpu, last_cpu, this_cpu = smp_processor_id();
struct call_function_data *cfd;
-   int cpu, next_cpu, this_cpu = smp_processor_id();
+   bool wait = scf_flags & SCF_WAIT;
+   bool run_remote = false;
+   bool run_local = false;
+   int nr_cpus = 0;
+
+   lockdep_assert_preemption_disabled();
 
/*
 * Can deadlock when called with interrupts disabled.
@@ -621,8 +637,9 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 * send smp call function interrupt to this cpu and as such deadlocks
 * can't happen.
 */
-   WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
-&& !oops_in_progress && !early_boot_irqs_disabled);
+   if (cpu_online(this_cpu) && !oops_in_progress &&
+   !early_boot_irqs_disabled)
+   lockdep_assert_irqs_enabled();
 
/*
 * When @wait we can deadlock when we interrupt between llist_add() and
@@ -632,60 +649,65 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 */
WARN_ON_ONCE(!in_task());
 
-   /* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
+   /* Check if we need local execution. */
+   if ((scf_flags & SCF_RUN_LOCAL) && cpumask_test_cpu(this_cpu, mask))
+   run_local = true;
+
+   /* Check if we need remote execution, i.e., any CPU excluding this one. 
*/
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+   if (cpu < nr_cpu_ids)
+   run_remote = true;
 
-   /* No online cpus?  We're done. */
-   if (cpu >= nr_cpu_ids)
-   return;
-
-   /* Do we have another CPU which isn't us? */
-   next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-   if (next_cpu == this_cpu)
-   next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
-   /* Fastpath: do that cpu by itself. */
-   if (next_cpu >= nr_cpu_ids) {
-   if (!cond_func || cond_func(cpu,

[tip: x86/mm] x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 4c1ba3923e6c8aa736e40f481a278c21b956c072
Gitweb:
https://git.kernel.org/tip/4c1ba3923e6c8aa736e40f481a278c21b956c072
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:05 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:09 +01:00

x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

The unification of these two functions allows to use them in the updated
SMP infrastrucutre.

To do so, remove the reason argument from flush_tlb_func_local(), add
a member to struct tlb_flush_info that says which CPU initiated the
flush and act accordingly. Optimize the size of flush_tlb_info while we
are at it.

Unfortunately, this prevents us from using a constant tlb_flush_info for
arch_tlbbatch_flush(), but in a later stage we may be able to inline
tlb_flush_info into the IPI data, so it should not have an impact
eventually.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-3-na...@vmware.com
---
 arch/x86/include/asm/tlbflush.h |  5 +-
 arch/x86/mm/tlb.c   | 81 ++--
 2 files changed, 39 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e..a7a598a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -201,8 +201,9 @@ struct flush_tlb_info {
unsigned long   start;
unsigned long   end;
u64 new_tlb_gen;
-   unsigned intstride_shift;
-   boolfreed_tables;
+   unsigned intinitiating_cpu;
+   u8  stride_shift;
+   u8  freed_tables;
 };
 
 void flush_tlb_local(void);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 569ac1d..bf12371 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -439,7 +439,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 */
 
-   /* We don't want flush_tlb_func_* to run concurrently with us. */
+   /* We don't want flush_tlb_func() to run concurrently with us. */
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());
 
@@ -647,14 +647,13 @@ void initialize_tlbstate_and_flush(void)
 }
 
 /*
- * flush_tlb_func_common()'s memory ordering requirement is that any
+ * flush_tlb_func()'s memory ordering requirement is that any
  * TLB fills that happen after we flush the TLB are ordered after we
  * read active_mm's tlb_gen.  We don't need any explicit barriers
  * because all x86 flush operations are serializing and the
  * atomic64_read operation won't be reordered by the compiler.
  */
-static void flush_tlb_func_common(const struct flush_tlb_info *f,
- bool local, enum tlb_flush_reason reason)
+static void flush_tlb_func(void *info)
 {
/*
 * We have three different tlb_gen values in here.  They are:
@@ -665,14 +664,26 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * - f->new_tlb_gen: the generation that the requester of the flush
 *   wants us to catch up to.
 */
+   const struct flush_tlb_info *f = info;
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
u64 local_tlb_gen = 
this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+   bool local = smp_processor_id() == f->initiating_cpu;
+   unsigned long nr_invalidate = 0;
 
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
 
+   if (!local) {
+   inc_irq_stat(irq_tlb_count);
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+
+   /* Can only happen on remote CPUs */
+   if (f->mm && f->mm != loaded_mm)
+   return;
+   }
+
if (unlikely(loaded_mm == _mm))
return;
 
@@ -700,8 +711,7 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * be handled can catch us all the way up, leaving no work for
 * the second flush.
 */
-   trace_tlb_flush(reason, 0);
-   return;
+   goto done;
}
 
WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
@@ -748,46 +758,34 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {
/* Partial flush */
-   unsigned

[tip: x86/mm] x86/mm/tlb: Privatize cpu_tlbstate

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 2f4305b19fe6a2a261d76c21856c5598f7d878fe
Gitweb:
https://git.kernel.org/tip/2f4305b19fe6a2a261d76c21856c5598f7d878fe
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:08 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:10 +01:00

x86/mm/tlb: Privatize cpu_tlbstate

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-6-na...@vmware.com
---
 arch/x86/include/asm/tlbflush.h | 39 +---
 arch/x86/kernel/alternative.c   |  2 +-
 arch/x86/mm/init.c  |  2 +-
 arch/x86/mm/tlb.c   | 17 --
 4 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3c6681d..fa952ea 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -90,23 +90,6 @@ struct tlb_state {
u16 next_asid;
 
/*
-* We can be in one of several states:
-*
-*  - Actively using an mm.  Our CPU's bit will be set in
-*mm_cpumask(loaded_mm) and is_lazy == false;
-*
-*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
-*will not be set in mm_cpumask(_mm) and is_lazy == false.
-*
-*  - Lazily using a real mm.  loaded_mm != _mm, our bit
-*is set in mm_cpumask(loaded_mm), but is_lazy == true.
-*We're heuristically guessing that the CR3 load we
-*skipped more than makes up for the overhead added by
-*lazy mode.
-*/
-   bool is_lazy;
-
-   /*
 * If set we changed the page tables in such a way that we
 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
 * This tells us to go invalidate all the non-loaded ctxs[]
@@ -151,7 +134,27 @@ struct tlb_state {
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+   /*
+* We can be in one of several states:
+*
+*  - Actively using an mm.  Our CPU's bit will be set in
+*mm_cpumask(loaded_mm) and is_lazy == false;
+*
+*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
+*will not be set in mm_cpumask(_mm) and is_lazy == false.
+*
+*  - Lazily using a real mm.  loaded_mm != _mm, our bit
+*is set in mm_cpumask(loaded_mm), but is_lazy == true.
+*We're heuristically guessing that the CR3 load we
+*skipped more than makes up for the overhead added by
+*lazy mode.
+*/
+   bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 bool nmi_uaccess_okay(void);
 #define nmi_uaccess_okay nmi_uaccess_okay
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e4..94649f8 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -813,7 +813,7 @@ static inline temp_mm_state_t use_temporary_mm(struct 
mm_struct *mm)
 * with a stale address space WITHOUT being in lazy mode after
 * restoring the previous mm.
 */
-   if (this_cpu_read(cpu_tlbstate.is_lazy))
+   if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
leave_mm(smp_processor_id());
 
temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index dd694fb..ed2e367 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1017,7 +1017,7 @@ void __init zone_sizes_init(void)
free_area_init(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
.next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8db87cd..345a0af 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -300,7 +300,7 @@ void leave_mm(int cpu)
return;
 
/* Warn if we're not lazy. */
-   WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+   WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -424,7 +424,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid

[tip: x86/mm] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 4ce94eabac16b1d2c95762b40f49e5654ab288d7
Gitweb:
https://git.kernel.org/tip/4ce94eabac16b1d2c95762b40f49e5654ab288d7
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:07 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:10 +01:00

x86/mm/tlb: Flush remote and local TLBs concurrently

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-5-na...@vmware.com
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h   |  4 +-
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 --
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 46 --
 arch/x86/xen/mmu_pv.c | 11 ++
 include/trace/events/xen.h|  2 +-
 10 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350..681dba8 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,8 +52,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -61,7 +61,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -164,7 +164,7 @@ check_status:
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -239,6 +239,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110..45b55e3 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -50,7 +50,7 @@ static inline void slow_down_io(void)
 void native_flush_tlb_local(void);
 void native_flush_tlb_global(void);
 void native_flush_tlb_one_user(unsigned long addr);
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
 static inline void __flush_tlb_local(void)
@@ -68,10 +68,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void __flush_tlb_others(const struct cpumask *cpumask,
+static inline void __flush_tlb_multi(const struct cpumask *cpumask,
  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index de87087..b7b35d5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -188,8 +188,8 @@ struct pv_mmu_ops

[tip: x86/mm] x86/mm/tlb: Remove unnecessary uses of the inline keyword

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 1608e4cf31b88c8c448ce13aa1d77969dda6bdb7
Gitweb:
https://git.kernel.org/tip/1608e4cf31b88c8c448ce13aa1d77969dda6bdb7
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:11 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:10 +01:00

x86/mm/tlb: Remove unnecessary uses of the inline keyword

The compiler is smart enough without these hints.

Suggested-by: Dave Hansen 
Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-9-na...@vmware.com
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 17ec4bf..f4b162f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -316,7 +316,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct 
*next,
local_irq_restore(flags);
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
@@ -880,7 +880,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, 
flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
-static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
+static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned long start, unsigned long end,
unsigned int stride_shift, bool freed_tables,
u64 new_tlb_gen)
@@ -907,7 +907,7 @@ static inline struct flush_tlb_info 
*get_flush_tlb_info(struct mm_struct *mm,
return info;
 }
 
-static inline void put_flush_tlb_info(void)
+static void put_flush_tlb_info(void)
 {
 #ifdef CONFIG_DEBUG_VM
/* Complete reentrency prevention checks */

[tip: x86/mm] cpumask: Mark functions as pure

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 291c4011dd7ac0cd0cebb727a75ee5a50d16dcf7
Gitweb:
https://git.kernel.org/tip/291c4011dd7ac0cd0cebb727a75ee5a50d16dcf7
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:10 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:10 +01:00

cpumask: Mark functions as pure

cpumask_next_and() and cpumask_any_but() are pure, and marking them as
such seems to generate different and presumably better code for
native_flush_tlb_multi().

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-8-na...@vmware.com
---
 include/linux/cpumask.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 383684e..c53364c 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -235,7 +235,7 @@ static inline unsigned int cpumask_last(const struct 
cpumask *srcp)
return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -252,8 +252,8 @@ static inline unsigned int cpumask_next_zero(int n, const 
struct cpumask *srcp)
return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+int __pure cpumask_next_and(int n, const struct cpumask *, const struct 
cpumask *);
+int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 int cpumask_any_and_distribute(const struct cpumask *src1p,
   const struct cpumask *src2p);

[tip: x86/mm] smp: Inline on_each_cpu_cond() and on_each_cpu()

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: a5aa5ce300597224ec76dacc8e63ba3ad7a18bbd
Gitweb:
https://git.kernel.org/tip/a5aa5ce300597224ec76dacc8e63ba3ad7a18bbd
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:12 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:10 +01:00

smp: Inline on_each_cpu_cond() and on_each_cpu()

Simplify the code and avoid having an additional function on the stack
by inlining on_each_cpu_cond() and on_each_cpu().

Suggested-by: Peter Zijlstra 
Signed-off-by: Nadav Amit 
[ Minor edits. ]
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20210220231712.2475218-10-na...@vmware.com
---
 include/linux/smp.h | 50 ---
 kernel/smp.c| 56 +
 kernel/up.c | 38 +--
 3 files changed, 37 insertions(+), 107 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 70c6f62..84a0b48 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -50,30 +50,52 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 int wait);
 
+void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
+  void *info, bool wait, const struct cpumask *mask);
+
+int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+
 /*
  * Call a function on all processors
  */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, cpu_online_mask);
+}
 
-/*
- * Call a function on processors specified by mask, which might include
- * the local one.
+/**
+ * on_each_cpu_mask(): Run a function on processors specified by
+ * cpumask, which may include the local processor.
+ * @mask: The set of cpus to run on (only runs on online subset).
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait (atomically) until function has completed
+ *on other CPUs.
+ *
+ * If @wait is true, then returns once @func has returned.
+ *
+ * You must not call this function with disabled interrupts or from a
+ * hardware interrupt handler or from a bottom half handler.  The
+ * exception is that it may be used during early boot while
+ * early_boot_irqs_disabled is set.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
-   void *info, bool wait);
+static inline void on_each_cpu_mask(const struct cpumask *mask,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, mask);
+}
 
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
- * processor.
+ * processor.  May be used during early boot while early_boot_irqs_disabled is
+ * set. Use local_irq_save/restore() instead of local_irq_disable/enable().
  */
-void on_each_cpu_cond(smp_cond_func_t cond_func, smp_call_func_t func,
- void *info, bool wait);
-
-void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
-  void *info, bool wait, const struct cpumask *mask);
-
-int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+static inline void on_each_cpu_cond(smp_cond_func_t cond_func,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(cond_func, func, info, wait, cpu_online_mask);
+}
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/smp.c b/kernel/smp.c
index c8a5a1f..b6375d7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -848,55 +848,6 @@ void __init smp_init(void)
 }
 
 /*
- * Call a function on all processors.  May be used during early boot while
- * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait)
-{
-   unsigned long flags;
-
-   preempt_disable();
-   smp_call_function(func, info, wait);
-   local_irq_save(flags);
-   func(info);
-   local_irq_restore(flags);
-   preempt_enable();
-}
-EXPORT_SYMBOL(on_each_cpu);
-
-/**
- * on_each_cpu_mask(): Run a function on processors specified by
- * cpumask, which may include the local processor.
- * @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
- * @info: An arbitrary pointer to pass to the function.
- * @wait: If true, wait (atomically) until function has completed
- *on other CPUs.
- *
- * If @wait is true, then returns once @func has returned.
- *
- * You must not call

[tip: x86/mm] x86/mm/tlb: Do not make is_lazy dirty for no reason

2021-03-06 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 09c5272e48614a30598e759c3c7bed126d22037d
Gitweb:
https://git.kernel.org/tip/09c5272e48614a30598e759c3c7bed126d22037d
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:09 -08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:59:10 +01:00

x86/mm/tlb: Do not make is_lazy dirty for no reason

Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.

Suggested-by: Dave Hansen 
Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-7-na...@vmware.com
---
 arch/x86/mm/tlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 345a0af..17ec4bf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -469,7 +469,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
+   if (was_lazy)
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and

[PATCH v4] mm/userfaultfd: fix memory corruption due to writeprotect

2021-03-04 Thread Nadav Amit

From: Nadav Amit 

Userfaultfd self-test fails occasionally, indicating a memory
corruption.

Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range() and defers
flushes, and since there is insufficient consideration of concurrent
deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from
the TLBs in wp_page_copy(), this flush takes place after the copy has
already been performed, and therefore changes of the page are possible
between the time of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses
a problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural
PTE permission: when userfaultfd_writeprotect() unprotects memory
region, it unintentionally *clears* the RW-bit if it was already set.
Note that this unprotecting a PTE that is not write-protected is a valid
use-case: the userfaultfd monitor might ask to unprotect a region that
holds both write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0cpu1cpu2

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
 cow_user_page()
  [ copy page ]
[ write to old
  page ]
...
 set_pte_at_notify()

A similar scenario can happen:

cpu0cpu1cpu2cpu3

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
 cow_user_page()
 [ copy page ]
 ...[ write to page ]
set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became
apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification")
which made wp_page_copy() more likely to take place, specifically if
page_count(page) > 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.

Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Minchan Kim 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: sta...@vger.kernel.org # 5.9+
Suggested-by: Yu Zhao 
Reviewed-by: Peter Xu 
Tested-by: Peter Xu 
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Nadav Amit 

---
v3->v4:
* Fix the "Fixes" tag for real [Peter Xu]
* Reviewed-by, suggested-by tags [Peter Xu]
* Adding unlikely() [Peter Xu]

v2->v3:
* Do not acquire mmap_lock for write, flush conditionally instead [Yu]
* Change the fixes tag to the patch that made the race apparent [Yu]
* Removing patch to avoid write-protect on uffd unprotect. More
  comprehensive solution to follow (and avoid the TLB flush as well).
---
 mm/memory.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9e8576a83147..79253cb3bcd5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3092,6 +3092,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_WP);
}
 
+   /*
+* Userfaultfd write-protect can defer flushes. Ensure the TLB
+* is flushed in this case before copying.
+*/
+   if (unlikely(userfaultfd_wp(vmf->vma) &&
+mm_tlb_flush_pending(vmf->vma->vm_mm)))
+   flush_tlb_page(vmf->vma, vmf->address

Re: [PATCH RESEND v3] mm/userfaultfd: fix memory corruption due to writeprotect

2021-03-03 Thread Nadav Amit



> On Mar 3, 2021, at 11:03 AM, Peter Xu  wrote:
> 
> On Wed, Mar 03, 2021 at 01:57:02AM -0800, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Userfaultfd self-test fails occasionally, indicating a memory
>> corruption.
> 
> It's failing very constantly now for me after I got it run on a 40 cores
> system...  While indeed not easy to fail on my laptop.
> 

It fails rather constantly for me, but since nobody else reproduced it,
I was afraid to say otherwise ;-)

> 
>> Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
>> Signed-off-by: Nadav Amit 
>> 
>> ---
>> v2->v3:
>> * Do not acquire mmap_lock for write, flush conditionally instead [Yu]
>> * Change the fixes tag to the patch that made the race apparent [Yu]
> 
> Did you forget about this one?  It would still be good to point to 
> 09854ba94c6a
> just to show that 5.7/5.8 stable branches shouldn't need this patch as they're
> not prone to the tlb data curruption.  Maybe also cc stable with 5.9+?

The fixes tag is wrong, as you say. I will fix it and cc stable with 5.9+.

> 
>> * Removing patch to avoid write-protect on uffd unprotect. More
>>  comprehensive solution to follow (and avoid the TLB flush as well).
>> ---
>> mm/memory.c | 7 +++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9e8576a83147..06da04f98936 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3092,6 +3092,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>  return handle_userfault(vmf, VM_UFFD_WP);
>>  }
>> 
>> +/*
>> + * Userfaultfd write-protect can defer flushes. Ensure the TLB
>> + * is flushed in this case before copying.
>> + */
>> +if (userfaultfd_wp(vmf->vma) && mm_tlb_flush_pending(vmf->vma->vm_mm))
>> +flush_tlb_page(vmf->vma, vmf->address);
>> +
>>  vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>>  if (!vmf->page) {
>>  /*
>> --
>> 2.25.1
>> 
> 
> Thanks for being consistent on fixing this problem.
> 
> Maybe it's even better to put that into a "unlikely" to reduce the affect of
> normal do_wp_page as much as possible?  But I'll leave it to others.
> 
> If with the fixes tag modified:
> 
> Reviewed-by: Peter Xu 
> Tested-by: Peter Xu 

Thanks, I will send v4 later today.

Regards,
Nadav



signature.asc
Description: Message signed with OpenPGP

[PATCH v3] mm/userfaultfd: fix memory corruption due to writeprotect

2021-03-03 Thread Nadav Amit

From: Nadav Amit 

Userfaultfd self-test fails occasionally, indicating a memory
corruption.

Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range() and defers
flushes, and since there is insufficient consideration of concurrent
deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from
the TLBs in wp_page_copy(), this flush takes place after the copy has
already been performed, and therefore changes of the page are possible
between the time of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses
a problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural
PTE permission: when userfaultfd_writeprotect() unprotects memory
region, it unintentionally *clears* the RW-bit if it was already set.
Note that this unprotecting a PTE that is not write-protected is a valid
use-case: the userfaultfd monitor might ask to unprotect a region that
holds both write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0cpu1cpu2

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
 cow_user_page()
  [ copy page ]
[ write to old
  page ]
...
 set_pte_at_notify()

A similar scenario can happen:

cpu0cpu1cpu2cpu3

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
 cow_user_page()
 [ copy page ]
 ...[ write to page ]
set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became
apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification")
which made wp_page_copy() more likely to take place, specifically if
page_count(page) > 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow, since currently write-unprotect would
also

Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Peter Xu 
Cc: Pavel Emelyanov 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Minchan Kim 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: sta...@vger.kernel.org
Suggested-by: Yu Zhao 
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Signed-off-by: Nadav Amit 

---
v2->v3:
* Do not acquire mmap_lock for write, flush conditionally instead [Yu]
* Change the fixes tag to the patch that made the race apparent [Yu]
* Removing patch to avoid write-protect on uffd unprotect. More
  comprehensive solution to follow (and avoid the TLB flush as well).
---
 mm/memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9e8576a83147..06da04f98936 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3092,6 +3092,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_WP);
}
 
+   /*
+* Userfaultfd write-protect can defer flushes. Ensure the TLB
+* is flushed in this case before copying.
+*/
+   if (userfaultfd_wp(vmf->vma) && mm_tlb_flush_pending(vmf->vma->vm_mm))
+   flush_tlb_page(vmf->vma, vmf->address);
+
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
-- 
2.25.1

[PATCH RESEND v3] mm/userfaultfd: fix memory corruption due to writeprotect

2021-03-03 Thread Nadav Amit

From: Nadav Amit 

Userfaultfd self-test fails occasionally, indicating a memory
corruption.

Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range() and defers
flushes, and since there is insufficient consideration of concurrent
deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from
the TLBs in wp_page_copy(), this flush takes place after the copy has
already been performed, and therefore changes of the page are possible
between the time of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses
a problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural
PTE permission: when userfaultfd_writeprotect() unprotects memory
region, it unintentionally *clears* the RW-bit if it was already set.
Note that this unprotecting a PTE that is not write-protected is a valid
use-case: the userfaultfd monitor might ask to unprotect a region that
holds both write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0cpu1cpu2

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
 cow_user_page()
  [ copy page ]
[ write to old
  page ]
...
 set_pte_at_notify()

A similar scenario can happen:

cpu0cpu1cpu2cpu3

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
 cow_user_page()
 [ copy page ]
 ...[ write to page ]
set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became
apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification")
which made wp_page_copy() more likely to take place, specifically if
page_count(page) > 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.

Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Peter Xu 
Cc: Pavel Emelyanov 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Minchan Kim 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: sta...@vger.kernel.org
Suggested-by: Yu Zhao 
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Signed-off-by: Nadav Amit 

---
v2->v3:
* Do not acquire mmap_lock for write, flush conditionally instead [Yu]
* Change the fixes tag to the patch that made the race apparent [Yu]
* Removing patch to avoid write-protect on uffd unprotect. More
  comprehensive solution to follow (and avoid the TLB flush as well).
---
 mm/memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9e8576a83147..06da04f98936 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3092,6 +3092,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_WP);
}
 
+   /*
+* Userfaultfd write-protect can defer flushes. Ensure the TLB
+* is flushed in this case before copying.
+*/
+   if (userfaultfd_wp(vmf->vma) && mm_tlb_flush_pending(vmf->vma->vm_mm))
+   flush_tlb_page(vmf->vma, vmf->address);
+
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
-- 
2.25.1

Re: [PATCH v3] mm/userfaultfd: fix memory corruption due to writeprotect

2021-03-03 Thread Nadav Amit


> On Mar 3, 2021, at 1:51 AM, Nadav Amit  wrote:
> 
> From: Nadav Amit 
> 
> Userfaultfd self-test fails occasionally, indicating a memory
> corruption.

Please ignore - I will resend.


signature.asc
Description: Message signed with OpenPGP

Re: [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes

2021-03-02 Thread Nadav Amit



> On Mar 2, 2021, at 2:13 PM, Peter Xu  wrote:
> 
> On Fri, Dec 25, 2020 at 01:25:27AM -0800, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> This patch-set went from v1 to RFCv2, as there is still an ongoing
>> discussion regarding the way of solving the recently found races due to
>> deferred TLB flushes. These patches are only sent for reference for now,
>> and can be applied later if no better solution is taken.
>> 
>> In a nutshell, write-protecting PTEs with deferred TLB flushes was mostly
>> performed while holding mmap_lock for write. This prevented concurrent
>> page-fault handler invocations from mistakenly assuming that a page is
>> write-protected when in fact, due to the deferred TLB flush, other CPU
>> could still write to the page. Such a write can cause a memory
>> corruption if it takes place after the page was copied (in
>> cow_user_page()), and before the PTE was flushed (by wp_page_copy()).
>> 
>> However, the userfaultfd and soft-dirty mechanisms did not take
>> mmap_lock for write, but only for read, which made such races possible.
>> Since commit 09854ba94c6a ("mm: do_wp_page() simplification") these
>> races became more likely to take place as non-COW'd pages are more
>> likely to be COW'd instead of being reused. Both of the races that
>> these patches are intended to resolve were produced on v5.10.
>> 
>> To avoid the performance overhead some alternative solutions that do not
>> require to acquire mmap_lock for write were proposed, specifically for
>> userfaultfd. So far no better solution that can be backported was
>> proposed for the soft-dirty case.
>> 
>> v1->RFCv2:
>> - Better (i.e., correct) description of the userfaultfd buggy case [Yu]
>> - Patch for the soft-dirty case
> 
> Nadav,
> 
> Do you plan to post a new version to fix the tlb corrupt issue that this 
> series
> wanted to solve?

Yes, yes. Sorry for that. Will do so later today.

Regards,
Nadav


signature.asc
Description: Message signed with OpenPGP

[tip: x86/mm] smp: Run functions concurrently in smp_call_function_many_cond()

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: b54d50640ca698383fc5b711487f303c17f4b47f
Gitweb:
https://git.kernel.org/tip/b54d50640ca698383fc5b711487f303c17f4b47f
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:04 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:37 +01:00

smp: Run functions concurrently in smp_call_function_many_cond()

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, modify smp_call_function_many_cond(), to allows the callers to
provide a function that should be executed (remotely/locally), and run
them concurrently. Keep other smp_call_function_many() semantic as it is
today for backward compatibility: the called function is not executed in
this case locally.

smp_call_function_many_cond() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-2-na...@vmware.com
---
 kernel/smp.c | 156 --
 1 file changed, 88 insertions(+), 68 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index aeb0adf..c8a5a1f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -608,12 +608,28 @@ call:
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
+/*
+ * Flags to be used as scf_flags argument of smp_call_function_many_cond().
+ *
+ * %SCF_WAIT:  Wait until function execution is completed
+ * %SCF_RUN_LOCAL: Run also locally if local cpu is set in cpumask
+ */
+#define SCF_WAIT   (1U << 0)
+#define SCF_RUN_LOCAL  (1U << 1)
+
 static void smp_call_function_many_cond(const struct cpumask *mask,
smp_call_func_t func, void *info,
-   bool wait, smp_cond_func_t cond_func)
+   unsigned int scf_flags,
+   smp_cond_func_t cond_func)
 {
+   int cpu, last_cpu, this_cpu = smp_processor_id();
struct call_function_data *cfd;
-   int cpu, next_cpu, this_cpu = smp_processor_id();
+   bool wait = scf_flags & SCF_WAIT;
+   bool run_remote = false;
+   bool run_local = false;
+   int nr_cpus = 0;
+
+   lockdep_assert_preemption_disabled();
 
/*
 * Can deadlock when called with interrupts disabled.
@@ -621,8 +637,9 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 * send smp call function interrupt to this cpu and as such deadlocks
 * can't happen.
 */
-   WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
-&& !oops_in_progress && !early_boot_irqs_disabled);
+   if (cpu_online(this_cpu) && !oops_in_progress &&
+   !early_boot_irqs_disabled)
+   lockdep_assert_irqs_enabled();
 
/*
 * When @wait we can deadlock when we interrupt between llist_add() and
@@ -632,60 +649,65 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 */
WARN_ON_ONCE(!in_task());
 
-   /* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
+   /* Check if we need local execution. */
+   if ((scf_flags & SCF_RUN_LOCAL) && cpumask_test_cpu(this_cpu, mask))
+   run_local = true;
+
+   /* Check if we need remote execution, i.e., any CPU excluding this one. 
*/
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+   if (cpu < nr_cpu_ids)
+   run_remote = true;
 
-   /* No online cpus?  We're done. */
-   if (cpu >= nr_cpu_ids)
-   return;
-
-   /* Do we have another CPU which isn't us? */
-   next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-   if (next_cpu == this_cpu)
-   next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
-   /* Fastpath: do that cpu by itself. */
-   if (next_cpu >= nr_cpu_ids) {
-   if (!cond_func || cond_func(cpu,

[tip: x86/mm] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: bc51e8e6f9c387d8dda1d8dea2b8856d0ade4101
Gitweb:
https://git.kernel.org/tip/bc51e8e6f9c387d8dda1d8dea2b8856d0ade4101
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:06 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:37 +01:00

x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.

This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.

Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-4-na...@vmware.com
---
 arch/x86/mm/tlb.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index bf12371..07b6701 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -788,11 +788,13 @@ done:
nr_invalidate);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static bool tlb_is_not_lazy(int cpu)
 {
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
 STATIC_NOPV void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info)
 {
@@ -813,12 +815,37 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
cpumask *cpumask,
 * up on the new contents of what used to be page tables, while
 * doing a speculative memory access.
 */
-   if (info->freed_tables)
+   if (info->freed_tables) {
smp_call_function_many(cpumask, flush_tlb_func,
   (void *)info, 1);
-   else
-   on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func,
-   (void *)info, 1, cpumask);
+   } else {
+   /*
+* Although we could have used on_each_cpu_cond_mask(),
+* open-coding it has performance advantages, as it eliminates
+* the need for indirect calls or retpolines. In addition, it
+* allows to use a designated cpumask for evaluating the
+* condition, instead of allocating one.
+*
+* This code works under the assumption that there are no nested
+* TLB flushes, an assumption that is already made in
+* flush_tlb_mm_range().
+*
+* cond_cpumask is logically a stack-local variable, but it is
+* more efficient to have it off the stack and not to allocate
+* it on demand. Preemption is disabled and this code is
+* non-reentrant.
+*/
+   struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask);
+   int cpu;
+
+   cpumask_clear(cond_cpumask);
+
+   for_each_cpu(cpu, cpumask) {
+   if (tlb_is_not_lazy(cpu))
+   __cpumask_set_cpu(cpu, cond_cpumask);
+   }
+   smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
*)info, 1);
+   }
 }
 
 void flush_tlb_others(const struct cpumask *cpumask,

[tip: x86/mm] x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: f4f14f7c20440a442b4eaeb7b6f25cd0fc437e36
Gitweb:
https://git.kernel.org/tip/f4f14f7c20440a442b4eaeb7b6f25cd0fc437e36
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:05 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:37 +01:00

x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

The unification of these two functions allows to use them in the updated
SMP infrastrucutre.

To do so, remove the reason argument from flush_tlb_func_local(), add
a member to struct tlb_flush_info that says which CPU initiated the
flush and act accordingly. Optimize the size of flush_tlb_info while we
are at it.

Unfortunately, this prevents us from using a constant tlb_flush_info for
arch_tlbbatch_flush(), but in a later stage we may be able to inline
tlb_flush_info into the IPI data, so it should not have an impact
eventually.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-3-na...@vmware.com
---
 arch/x86/include/asm/tlbflush.h |  5 +-
 arch/x86/mm/tlb.c   | 81 ++--
 2 files changed, 39 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e..a7a598a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -201,8 +201,9 @@ struct flush_tlb_info {
unsigned long   start;
unsigned long   end;
u64 new_tlb_gen;
-   unsigned intstride_shift;
-   boolfreed_tables;
+   unsigned intinitiating_cpu;
+   u8  stride_shift;
+   u8  freed_tables;
 };
 
 void flush_tlb_local(void);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 569ac1d..bf12371 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -439,7 +439,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 */
 
-   /* We don't want flush_tlb_func_* to run concurrently with us. */
+   /* We don't want flush_tlb_func() to run concurrently with us. */
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());
 
@@ -647,14 +647,13 @@ void initialize_tlbstate_and_flush(void)
 }
 
 /*
- * flush_tlb_func_common()'s memory ordering requirement is that any
+ * flush_tlb_func()'s memory ordering requirement is that any
  * TLB fills that happen after we flush the TLB are ordered after we
  * read active_mm's tlb_gen.  We don't need any explicit barriers
  * because all x86 flush operations are serializing and the
  * atomic64_read operation won't be reordered by the compiler.
  */
-static void flush_tlb_func_common(const struct flush_tlb_info *f,
- bool local, enum tlb_flush_reason reason)
+static void flush_tlb_func(void *info)
 {
/*
 * We have three different tlb_gen values in here.  They are:
@@ -665,14 +664,26 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * - f->new_tlb_gen: the generation that the requester of the flush
 *   wants us to catch up to.
 */
+   const struct flush_tlb_info *f = info;
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
u64 local_tlb_gen = 
this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+   bool local = smp_processor_id() == f->initiating_cpu;
+   unsigned long nr_invalidate = 0;
 
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
 
+   if (!local) {
+   inc_irq_stat(irq_tlb_count);
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+
+   /* Can only happen on remote CPUs */
+   if (f->mm && f->mm != loaded_mm)
+   return;
+   }
+
if (unlikely(loaded_mm == _mm))
return;
 
@@ -700,8 +711,7 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * be handled can catch us all the way up, leaving no work for
 * the second flush.
 */
-   trace_tlb_flush(reason, 0);
-   return;
+   goto done;
}
 
WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
@@ -748,46 +758,34 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {
/* Partial flush */
-   unsigned

[tip: x86/mm] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: efa72447b0b95cd5e8b2bd7cf55ae23c716f8702
Gitweb:
https://git.kernel.org/tip/efa72447b0b95cd5e8b2bd7cf55ae23c716f8702
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:07 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:37 +01:00

x86/mm/tlb: Flush remote and local TLBs concurrently

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-5-na...@vmware.com
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h   |  4 +-
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 --
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 46 --
 arch/x86/xen/mmu_pv.c | 11 ++
 include/trace/events/xen.h|  2 +-
 10 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350..681dba8 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,8 +52,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -61,7 +61,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -164,7 +164,7 @@ check_status:
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -239,6 +239,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110..45b55e3 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -50,7 +50,7 @@ static inline void slow_down_io(void)
 void native_flush_tlb_local(void);
 void native_flush_tlb_global(void);
 void native_flush_tlb_one_user(unsigned long addr);
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
 static inline void __flush_tlb_local(void)
@@ -68,10 +68,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void __flush_tlb_others(const struct cpumask *cpumask,
+static inline void __flush_tlb_multi(const struct cpumask *cpumask,
  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index de87087..b7b35d5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -188,8 +188,8 @@ struct pv_mmu_ops

[tip: x86/mm] x86/mm/tlb: Privatize cpu_tlbstate

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: fe978069739b59804c911fc9e9645ce768ec5b9e
Gitweb:
https://git.kernel.org/tip/fe978069739b59804c911fc9e9645ce768ec5b9e
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:08 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:37 +01:00

x86/mm/tlb: Privatize cpu_tlbstate

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-6-na...@vmware.com
---
 arch/x86/include/asm/tlbflush.h | 39 +---
 arch/x86/kernel/alternative.c   |  2 +-
 arch/x86/mm/init.c  |  2 +-
 arch/x86/mm/tlb.c   | 17 --
 4 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3c6681d..fa952ea 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -90,23 +90,6 @@ struct tlb_state {
u16 next_asid;
 
/*
-* We can be in one of several states:
-*
-*  - Actively using an mm.  Our CPU's bit will be set in
-*mm_cpumask(loaded_mm) and is_lazy == false;
-*
-*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
-*will not be set in mm_cpumask(_mm) and is_lazy == false.
-*
-*  - Lazily using a real mm.  loaded_mm != _mm, our bit
-*is set in mm_cpumask(loaded_mm), but is_lazy == true.
-*We're heuristically guessing that the CR3 load we
-*skipped more than makes up for the overhead added by
-*lazy mode.
-*/
-   bool is_lazy;
-
-   /*
 * If set we changed the page tables in such a way that we
 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
 * This tells us to go invalidate all the non-loaded ctxs[]
@@ -151,7 +134,27 @@ struct tlb_state {
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+   /*
+* We can be in one of several states:
+*
+*  - Actively using an mm.  Our CPU's bit will be set in
+*mm_cpumask(loaded_mm) and is_lazy == false;
+*
+*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
+*will not be set in mm_cpumask(_mm) and is_lazy == false.
+*
+*  - Lazily using a real mm.  loaded_mm != _mm, our bit
+*is set in mm_cpumask(loaded_mm), but is_lazy == true.
+*We're heuristically guessing that the CR3 load we
+*skipped more than makes up for the overhead added by
+*lazy mode.
+*/
+   bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 bool nmi_uaccess_okay(void);
 #define nmi_uaccess_okay nmi_uaccess_okay
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e4..94649f8 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -813,7 +813,7 @@ static inline temp_mm_state_t use_temporary_mm(struct 
mm_struct *mm)
 * with a stale address space WITHOUT being in lazy mode after
 * restoring the previous mm.
 */
-   if (this_cpu_read(cpu_tlbstate.is_lazy))
+   if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
leave_mm(smp_processor_id());
 
temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index dd694fb..ed2e367 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1017,7 +1017,7 @@ void __init zone_sizes_init(void)
free_area_init(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
.next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8db87cd..345a0af 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -300,7 +300,7 @@ void leave_mm(int cpu)
return;
 
/* Warn if we're not lazy. */
-   WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+   WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -424,7 +424,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid

[tip: x86/mm] smp: Inline on_each_cpu_cond() and on_each_cpu()

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 28344ab0a282a5ab5e4d56bfbcb2b363f4c15447
Gitweb:
https://git.kernel.org/tip/28344ab0a282a5ab5e4d56bfbcb2b363f4c15447
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:12 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 09:09:50 +01:00

smp: Inline on_each_cpu_cond() and on_each_cpu()

Simplify the code and avoid having an additional function on the stack
by inlining on_each_cpu_cond() and on_each_cpu().

Suggested-by: Peter Zijlstra 
Signed-off-by: Nadav Amit 
[ Minor edits. ]
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20210220231712.2475218-10-na...@vmware.com
---
 include/linux/smp.h | 50 ---
 kernel/smp.c| 56 +
 kernel/up.c | 38 +--
 3 files changed, 37 insertions(+), 107 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 70c6f62..84a0b48 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -50,30 +50,52 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 int wait);
 
+void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
+  void *info, bool wait, const struct cpumask *mask);
+
+int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+
 /*
  * Call a function on all processors
  */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, cpu_online_mask);
+}
 
-/*
- * Call a function on processors specified by mask, which might include
- * the local one.
+/**
+ * on_each_cpu_mask(): Run a function on processors specified by
+ * cpumask, which may include the local processor.
+ * @mask: The set of cpus to run on (only runs on online subset).
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait (atomically) until function has completed
+ *on other CPUs.
+ *
+ * If @wait is true, then returns once @func has returned.
+ *
+ * You must not call this function with disabled interrupts or from a
+ * hardware interrupt handler or from a bottom half handler.  The
+ * exception is that it may be used during early boot while
+ * early_boot_irqs_disabled is set.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
-   void *info, bool wait);
+static inline void on_each_cpu_mask(const struct cpumask *mask,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, mask);
+}
 
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
- * processor.
+ * processor.  May be used during early boot while early_boot_irqs_disabled is
+ * set. Use local_irq_save/restore() instead of local_irq_disable/enable().
  */
-void on_each_cpu_cond(smp_cond_func_t cond_func, smp_call_func_t func,
- void *info, bool wait);
-
-void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
-  void *info, bool wait, const struct cpumask *mask);
-
-int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+static inline void on_each_cpu_cond(smp_cond_func_t cond_func,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(cond_func, func, info, wait, cpu_online_mask);
+}
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/smp.c b/kernel/smp.c
index c8a5a1f..b6375d7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -848,55 +848,6 @@ void __init smp_init(void)
 }
 
 /*
- * Call a function on all processors.  May be used during early boot while
- * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait)
-{
-   unsigned long flags;
-
-   preempt_disable();
-   smp_call_function(func, info, wait);
-   local_irq_save(flags);
-   func(info);
-   local_irq_restore(flags);
-   preempt_enable();
-}
-EXPORT_SYMBOL(on_each_cpu);
-
-/**
- * on_each_cpu_mask(): Run a function on processors specified by
- * cpumask, which may include the local processor.
- * @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
- * @info: An arbitrary pointer to pass to the function.
- * @wait: If true, wait (atomically) until function has completed
- *on other CPUs.
- *
- * If @wait is true, then returns once @func has returned.
- *
- * You must not call

[tip: x86/mm] x86/mm/tlb: Do not make is_lazy dirty for no reason

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: db73f8099a502be8ed46f6332c91754c74ac76c2
Gitweb:
https://git.kernel.org/tip/db73f8099a502be8ed46f6332c91754c74ac76c2
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:09 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:38 +01:00

x86/mm/tlb: Do not make is_lazy dirty for no reason

Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.

Suggested-by: Dave Hansen 
Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-7-na...@vmware.com
---
 arch/x86/mm/tlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 345a0af..17ec4bf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -469,7 +469,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
+   if (was_lazy)
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and

[tip: x86/mm] cpumask: Mark functions as pure

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 1028a5918cbaae6b9d7f0a04b6a200b9e67aec14
Gitweb:
https://git.kernel.org/tip/1028a5918cbaae6b9d7f0a04b6a200b9e67aec14
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:10 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:38 +01:00

cpumask: Mark functions as pure

cpumask_next_and() and cpumask_any_but() are pure, and marking them as
such seems to generate different and presumably better code for
native_flush_tlb_multi().

Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-8-na...@vmware.com
---
 include/linux/cpumask.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 383684e..c53364c 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -235,7 +235,7 @@ static inline unsigned int cpumask_last(const struct 
cpumask *srcp)
return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -252,8 +252,8 @@ static inline unsigned int cpumask_next_zero(int n, const 
struct cpumask *srcp)
return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+int __pure cpumask_next_and(int n, const struct cpumask *, const struct 
cpumask *);
+int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 int cpumask_any_and_distribute(const struct cpumask *src1p,
   const struct cpumask *src2p);

[tip: x86/mm] x86/mm/tlb: Remove unnecessary uses of the inline keyword

2021-03-02 Thread tip-bot2 for Nadav Amit

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 327db7a160b33865e086f7fff73e08f6d8d47005
Gitweb:
https://git.kernel.org/tip/327db7a160b33865e086f7fff73e08f6d8d47005
Author:Nadav Amit 
AuthorDate:Sat, 20 Feb 2021 15:17:11 -08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Mar 2021 08:01:38 +01:00

x86/mm/tlb: Remove unnecessary uses of the inline keyword

The compiler is smart enough without these hints.

Suggested-by: Dave Hansen 
Signed-off-by: Nadav Amit 
Signed-off-by: Ingo Molnar 
Reviewed-by: Dave Hansen 
Link: https://lore.kernel.org/r/20210220231712.2475218-9-na...@vmware.com
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 17ec4bf..f4b162f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -316,7 +316,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct 
*next,
local_irq_restore(flags);
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
@@ -880,7 +880,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, 
flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
-static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
+static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned long start, unsigned long end,
unsigned int stride_shift, bool freed_tables,
u64 new_tlb_gen)
@@ -907,7 +907,7 @@ static inline struct flush_tlb_info 
*get_flush_tlb_info(struct mm_struct *mm,
return info;
 }
 
-static inline void put_flush_tlb_info(void)
+static void put_flush_tlb_info(void)
 {
 #ifdef CONFIG_DEBUG_VM
/* Complete reentrency prevention checks */

Re: [PATCH v6 1/9] smp: Run functions concurrently in smp_call_function_many_cond()

2021-03-01 Thread Nadav Amit



> On Mar 1, 2021, at 9:10 AM, Peter Zijlstra  wrote:
> 
> On Sat, Feb 20, 2021 at 03:17:04PM -0800, Nadav Amit wrote:
>> +/*
>> + * Choose the most efficient way to send an IPI. Note that the
>> + * number of CPUs might be zero due to concurrent changes to the
>> + * provided mask.
>> + */
>> +if (nr_cpus == 1)
>> +arch_send_call_function_single_ipi(last_cpu);
>> +else if (likely(nr_cpus > 1))
>> +arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
> 
> I just ran into conflicts with another patch set, and noticed that the
> above should probably be:
> 
>   if (nr_cpus == 1)
>   send_call_function_single_ipi(last_cpu);
>   else if (likely(nr_cpus > 1))
>   arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
> 
> Which will avoid the IPI when @last_cpu is idle.

Good point. Makes one wonder whether all these inter-core communication
(through cpu_tlbstate.is_lazy, csd->node.llist and ti->flags) are
really necessary or can be combined.

Well, that’s for later I presume.

Re: [RFC 1/6] vdso/extable: fix calculation of base

2021-02-28 Thread Nadav Amit



> On Feb 26, 2021, at 9:47 AM, Sean Christopherson  wrote:
> 
> On Fri, Feb 26, 2021, Nadav Amit wrote:
>> 
>>> On Feb 25, 2021, at 1:16 PM, Sean Christopherson  wrote:
>>> It's been literally years since I wrote this code, but I distinctly 
>>> remember the
>>> addresses being relative to the base.  I also remember testing multiple 
>>> entries,
>>> but again, that was a long time ago.
>>> 
>>> Assuming things have changed, or I was flat out wrong, the comment above the
>>> macro magic should also be updated.
>>> 
>>> /*
>>> * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
>>> * vDSO uses a dedicated handler the addresses are relative to the overall
>>> * exception table, not each individual entry.
>>> */
>> 
>> I will update the comment. I am not very familiar with pushsection stuff,
>> but the offsets were wrong.
>> 
>> Since you say you checked it, I wonder whether it can somehow be caused
>> by having exception table entries defined from multiple object files.
> 
> Oooh, I think that would do it.  Have you checked what happens if there are
> multiple object files and multiple fixups within an object file?

Good thing that you insisted...

I certainly do not know well enough the assembly section directives,
but indeed it seems (after some experiments) that referring to the
section provides different values from different objects.

So both the current (yours) and this patch (mine) are broken. I think
the easiest thing is to fall back to the kernel exception table scheme.
I checked the following with both entries in the same and different
objects and it seems to work correctly:

-- >8 --

diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index afcf5b65beef..3f395b782553 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -32,9 +32,11 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
nr_entries = image->extable_len / (sizeof(*extable));
extable = image->extable;

-   for (i = 0; i < nr_entries; i++) {
-   if (regs->ip == base + extable[i].insn) {
-   regs->ip = base + extable[i].fixup;
+   for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
+   if (regs->ip == base + extable[i].insn +
+   offsetof(struct vdso_exception_table_entry, insn)) {
+   regs->ip = base + extable[i].fixup +
+   offsetof(struct vdso_exception_table_entry, 
fixup);
regs->di = trapnr;
regs->si = error_code;
regs->dx = fault_addr;
diff --git a/arch/x86/entry/vdso/extable.h b/arch/x86/entry/vdso/extable.h
index b56f6b012941..4ffe3d533148 100644
--- a/arch/x86/entry/vdso/extable.h
+++ b/arch/x86/entry/vdso/extable.h
@@ -13,8 +13,8 @@

 .macro ASM_VDSO_EXTABLE_HANDLE from:req to:req
.pushsection __ex_table, "a"
-   .long (\from) - __ex_table
-   .long (\to) - __ex_table
+   .long (\from) - .
+   .long (\to) - .
.popsection
 .endm
 #else
--
2.25.1




signature.asc
Description: Message signed with OpenPGP

Re: [RFC 1/6] vdso/extable: fix calculation of base

2021-02-26 Thread Nadav Amit



> On Feb 25, 2021, at 1:16 PM, Sean Christopherson  wrote:
> 
> On Wed, Feb 24, 2021, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Apparently, the assembly considers __ex_table as the location when the
>> pushsection directive was issued. Therefore when there is more than a
>> single entry in the vDSO exception table, the calculations of the base
>> and fixup are wrong.
>> 
>> Fix the calculations of the expected fault IP and new IP by adjusting
>> the base after each entry.
>> 
>> Cc: Andy Lutomirski 
>> Cc: Peter Zijlstra 
>> Cc: Sean Christopherson 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Borislav Petkov 
>> Cc: Andrew Morton 
>> Cc: x...@kernel.org
>> Signed-off-by: Nadav Amit 
>> ---
>> arch/x86/entry/vdso/extable.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
>> index afcf5b65beef..c81e78636220 100644
>> --- a/arch/x86/entry/vdso/extable.c
>> +++ b/arch/x86/entry/vdso/extable.c
>> @@ -32,7 +32,7 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
>>  nr_entries = image->extable_len / (sizeof(*extable));
>>  extable = image->extable;
>> 
>> -for (i = 0; i < nr_entries; i++) {
>> +for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
> 
> It's been literally years since I wrote this code, but I distinctly remember 
> the
> addresses being relative to the base.  I also remember testing multiple 
> entries,
> but again, that was a long time ago.
> 
> Assuming things have changed, or I was flat out wrong, the comment above the
> macro magic should also be updated.
> 
> /*
> * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
> * vDSO uses a dedicated handler the addresses are relative to the overall
> * exception table, not each individual entry.
> */

I will update the comment. I am not very familiar with pushsection stuff,
but the offsets were wrong.

Since you say you checked it, I wonder whether it can somehow be caused
by having exception table entries defined from multiple object files.

Anyhow, this change follows the kernel’s (not vDSO) exception table
scheme.



signature.asc
Description: Message signed with OpenPGP

Re: [RFC 0/6] x86: prefetch_page() vDSO call

2021-02-25 Thread Nadav Amit




> On Feb 25, 2021, at 9:32 AM, Matthew Wilcox  wrote:
> 
> On Thu, Feb 25, 2021 at 04:56:50PM +0000, Nadav Amit wrote:
>> 
>>> On Feb 25, 2021, at 4:16 AM, Matthew Wilcox  wrote:
>>> 
>>> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>>>> Just as applications can use prefetch instructions to overlap
>>>> computations and memory accesses, applications may want to overlap the
>>>> page-faults and compute or overlap the I/O accesses that are required
>>>> for page-faults of different pages.
>>> 
>>> Isn't this madvise(MADV_WILLNEED)?
>> 
>> Good point that I should have mentioned. In a way prefetch_page() a
>> combination of mincore() and MADV_WILLNEED.
>> 
>> There are 4 main differences from MADV_WILLNEED:
>> 
>> 1. Much lower invocation cost if the readahead is not needed: this allows
>> to prefetch pages more abundantly.
> 
> That seems like something that could be fixed in libc -- if we add a
> page prefetch vdso call, an application calling posix_madvise() could
> be implemented by calling this fast path.  Assuming the performance
> increase justifies this extra complexity.
> 
>> 2. Return value: return value tells you whether the page is accessible.
>> This makes it usable for coroutines, for instance. In this regard the
>> call is more similar to mincore() than MADV_WILLNEED.
> 
> I don't quite understand the programming model you're describing here.
> 
>> 3. The PTEs are mapped if the pages are already present in the
>> swap/page-cache, preventing an additional page-fault just to map them.
> 
> We could enhance madvise() to do this, no?
> 
>> 4. Avoiding heavy-weight reclamation on low memory (this may need to
>> be selective, and can be integrated with MADV_WILLNEED).
> 
> Likewise.
> 
> I don't want to add a new Linux-specific call when there's already a
> POSIX interface that communicates the exact same thing.  The return
> value seems like the only problem.

I agree that this call does not have to be exposed to the application.

I am not sure there is a lot of extra complexity now, but obviously
some evaluations are needed.

Re: [RFC 0/6] x86: prefetch_page() vDSO call

2021-02-25 Thread Nadav Amit

> On Feb 25, 2021, at 4:16 AM, Matthew Wilcox  wrote:
> 
> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>> Just as applications can use prefetch instructions to overlap
>> computations and memory accesses, applications may want to overlap the
>> page-faults and compute or overlap the I/O accesses that are required
>> for page-faults of different pages.
> 
> Isn't this madvise(MADV_WILLNEED)?

Good point that I should have mentioned. In a way prefetch_page() a
combination of mincore() and MADV_WILLNEED.

There are 4 main differences from MADV_WILLNEED:

1. Much lower invocation cost if the readahead is not needed: this allows
to prefetch pages more abundantly.

2. Return value: return value tells you whether the page is accessible.
This makes it usable for coroutines, for instance. In this regard the
call is more similar to mincore() than MADV_WILLNEED.

3. The PTEs are mapped if the pages are already present in the
swap/page-cache, preventing an additional page-fault just to map them.

4. Avoiding heavy-weight reclamation on low memory (this may need to
be selective, and can be integrated with MADV_WILLNEED).

Re: [RFC 0/6] x86: prefetch_page() vDSO call

2021-02-25 Thread Nadav Amit



> On Feb 25, 2021, at 12:52 AM, Nadav Amit  wrote:
> 
> 
> 
>> On Feb 25, 2021, at 12:40 AM, Peter Zijlstra  wrote:
>> 
>> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>>> From: Nadav Amit 
>>> 
>>> Just as applications can use prefetch instructions to overlap
>>> computations and memory accesses, applications may want to overlap the
>>> page-faults and compute or overlap the I/O accesses that are required
>>> for page-faults of different pages.
[
[ snip ]

>> Interesting, but given we've been removing explicit prefetch from some
>> parts of the kernel how useful is this in actual use? I'm thinking there
>> should at least be a real user and performance numbers with this before
>> merging.
> 
> Can you give me a reference to the “removing explicit prefetch from some
> parts of the kernel”?

Oh. I get it - you mean we remove we remove the use of explicit memory
prefetch from the kernel code. Well, I don’t think it is really related,
but yes, performance numbers are needed.



signature.asc
Description: Message signed with OpenPGP

Re: [RFC 0/6] x86: prefetch_page() vDSO call

2021-02-25 Thread Nadav Amit



> On Feb 25, 2021, at 12:40 AM, Peter Zijlstra  wrote:
> 
> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Just as applications can use prefetch instructions to overlap
>> computations and memory accesses, applications may want to overlap the
>> page-faults and compute or overlap the I/O accesses that are required
>> for page-faults of different pages.
>> 
>> Applications can use multiple threads and cores for this matter, by
>> running one thread that prefetches the data (i.e., faults in the data)
>> and another that does the compute, but this scheme is inefficient. Using
>> mincore() can tell whether a page is mapped, but might not tell whether
>> the page is in the page-cache and does not fault in the data.
>> 
>> Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
>> asynchronously. The semantic of this call is: try to prefetch a page of
>> in a given address and return zero if the page is accessible following
>> the call. Start I/O operations to retrieve the page if such operations
>> are required and there is no high memory pressure that might introduce
>> slowdowns.
>> 
>> Note that as usual the page might be paged-out at any point and
>> therefore, similarly to mincore(), there is no guarantee that the page
>> will be present at the time that the user application uses the data that
>> resides on the page. Nevertheless, it is expected that in the vast
>> majority of the cases this would not happen, since prefetch_page()
>> accesses the page and therefore sets the PTE access-bit (if it is
>> clear).
>> 
>> The implementation is as follows. The vDSO code accesses the data,
>> triggering a page-fault it is not present. The handler detects based on
>> the instruction pointer that this is an asynchronous-#PF, using the
>> recently introduce vDSO exception tables. If the page can be brought
>> without waiting (e.g., the page is already in the page-cache), the
>> kernel handles the fault and returns success (zero). If there is memory
>> pressure that prevents the proper handling of the fault (i.e., requires
>> heavy-weight reclamation) it returns a failure. Otherwise, it starts an
>> I/O to bring the page and returns failure.
>> 
>> Compilers can be extended to issue the prefetch_page() calls when
>> needed.
> 
> Interesting, but given we've been removing explicit prefetch from some
> parts of the kernel how useful is this in actual use? I'm thinking there
> should at least be a real user and performance numbers with this before
> merging.

Can you give me a reference to the “removing explicit prefetch from some
parts of the kernel”?

I will work on an llvm/gcc plugin to provide some performance numbers.
I wanted to make sure that the idea is not a complete obscenity first.



signature.asc
Description: Message signed with OpenPGP

[RFC 5/6] mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT

2021-02-24 Thread Nadav Amit

From: Nadav Amit 

When FAULT_FLAG_RETRY_NOWAIT is set, the caller arguably wants only a
lightweight reclaim to avoid a long reclamation, which would not respect
the "NOWAIT" semantic. Regard the request in swap and file-backed
page-faults accordingly during the first try.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Andrew Morton 
Cc: x...@kernel.org
Signed-off-by: Nadav Amit 
---
 mm/memory.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 13b9cf36268f..70899c92a9e6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2679,18 +2679,31 @@ static inline bool cow_user_page(struct page *dst, 
struct page *src,
return ret;
 }
 
-static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
+static gfp_t massage_page_gfp_mask(gfp_t gfp_mask, unsigned long vmf_flags)
 {
-   struct file *vm_file = vma->vm_file;
+   if (fault_flag_allow_retry_first(vmf_flags) &&
+   (vmf_flags & FAULT_FLAG_RETRY_NOWAIT))
+   gfp_mask |= __GFP_NORETRY | __GFP_NOWARN;
 
-   if (vm_file)
-   return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | 
__GFP_IO;
+   return gfp_mask;
+}
+
+static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma,
+ unsigned long flags)
+{
+   struct file *vm_file = vma->vm_file;
+   gfp_t gfp_mask;
 
/*
 * Special mappings (e.g. VDSO) do not have any file so fake
 * a default GFP_KERNEL for them.
 */
-   return GFP_KERNEL;
+   if (!vm_file)
+   return GFP_KERNEL;
+
+   gfp_mask = mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+
+   return massage_page_gfp_mask(gfp_mask, flags);
 }
 
 /*
@@ -3253,6 +3266,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
  */
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
+   gfp_t gfp_mask = massage_page_gfp_mask(GFP_HIGHUSER_MOVABLE, 
vmf->flags);
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
swp_entry_t entry;
@@ -3293,8 +3307,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
/* skip swapcache */
-   page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
-   vmf->address);
+   page = alloc_page_vma(gfp_mask, vma, vmf->address);
if (page) {
int err;
 
@@ -3320,8 +,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swap_readpage(page, true);
}
} else {
-   page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-   vmf);
+   page = swapin_readahead(entry, gfp_mask, vmf);
swapcache = page;
}
 
@@ -4452,7 +4464,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct 
*vma,
.address = address & PAGE_MASK,
.flags = flags,
.pgoff = linear_page_index(vma, address),
-   .gfp_mask = __get_fault_gfp_mask(vma),
+   .gfp_mask = __get_fault_gfp_mask(vma, flags),
};
unsigned int dirty = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
-- 
2.25.1

[PATCH 6/6] testing/selftest: test vDSO prefetch_page()

2021-02-24 Thread Nadav Amit

From: Nadav Amit 

Test prefetch_page() in cases of invalid pointer, file-mmap and
anonymous memory. Partial checks are also done with mincore syscall to
ensure the output of prefetch_page() is consistent with mincore (taking
into account the different semantics of the two).

The tests are not fool-proof as they rely on the behavior of the
page-cache and page reclamation mechanism to get a major page-fault.
They should be robust in the sense of test being skipped if it failed.

There is a question though on how to know how much memory to access in
the test of anonymous memory to force the eviction of a page and trigger
a refault.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Andrew Morton 
Cc: x...@kernel.org
Signed-off-by: Nadav Amit 
---
 tools/testing/selftests/vDSO/Makefile |   2 +
 .../selftests/vDSO/vdso_test_prefetch_page.c  | 265 ++
 2 files changed, 267 insertions(+)
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_prefetch_page.c

diff --git a/tools/testing/selftests/vDSO/Makefile 
b/tools/testing/selftests/vDSO/Makefile
index d53a4d8008f9..dcd1ede8c0f7 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -11,6 +11,7 @@ ifeq ($(ARCH),$(filter $(ARCH),x86 x86_64))
 TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
 endif
 TEST_GEN_PROGS += $(OUTPUT)/vdso_test_correctness
+TEST_GEN_PROGS += $(OUTPUT)/vdso_test_prefetch_page
 
 CFLAGS := -std=gnu99
 CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables 
-fno-stack-protector
@@ -33,3 +34,4 @@ $(OUTPUT)/vdso_test_correctness: vdso_test_correctness.c
vdso_test_correctness.c \
-o $@ \
$(LDFLAGS_vdso_test_correctness)
+$(OUTPUT)/vdso_test_prefetch_page: vdso_test_prefetch_page.c parse_vdso.c
diff --git a/tools/testing/selftests/vDSO/vdso_test_prefetch_page.c 
b/tools/testing/selftests/vDSO/vdso_test_prefetch_page.c
new file mode 100644
index ..35928c3f36ca
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_test_prefetch_page.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vdso_test_prefetch_page.c: Test vDSO's prefetch_page())
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+#include "parse_vdso.h"
+
+const char *version = "LINUX_2.6";
+const char *name = "__vdso_prefetch_page";
+
+struct getcpu_cache;
+typedef long (*prefetch_page_t)(const void *p);
+
+#define MEM_SIZE_K (950ull)
+#define PAGE_SIZE  (4096ull)
+
+#define SKIP_MINCORE_BEFORE(1 << 0)
+#define SKIP_MINCORE_AFTER (1 << 1)
+
+static prefetch_page_t prefetch_page;
+
+static const void *ptr_align(const void *p)
+{
+   return (const void *)((unsigned long)p & ~(PAGE_SIZE - 1));
+}
+
+
+static int __test_prefetch(const void *p, bool expected_no_io,
+  const char *test_name, unsigned int skip_mincore)
+{
+   bool no_io;
+   char vec;
+   long r;
+   uint64_t start;
+
+   p = ptr_align(p);
+
+   /*
+* First, run a sanity check to use mincore() to see if the page is in
+* memory when we expect it not to be.  We can only trust mincore to
+* tell us when a page is already in memory when it should not be.
+*/
+   if (!(skip_mincore & SKIP_MINCORE_BEFORE)) {
+   if (mincore((void *)p, PAGE_SIZE, )) {
+   printf("[SKIP]\t%s: mincore failed: %s\n", test_name,
+  strerror(errno));
+   return 0;
+   }
+
+   no_io = vec & 1;
+   if (!skip_mincore && no_io && !expected_no_io) {
+   printf("[SKIP]\t%s: unexpected page state: %s\n",
+  test_name,
+  no_io ? "in memory" : "not in memory");
+   return 0;
+   }
+   }
+
+   /*
+* Check we got the expected result from prefetch page.
+*/
+   r = prefetch_page(p);
+
+   no_io = r == 0;
+   if (no_io != expected_no_io) {
+   printf("[FAIL]\t%s: prefetch_page() returned %ld\n",
+  test_name, r);
+   return KSFT_FAIL;
+   }
+
+   if (skip_mincore & SKIP_MINCORE_AFTER)
+   return 0;
+
+   /*
+* Check again using mincore that the page state is as expected.
+* A bit racy. Skip the test if mincore fails.
+*/
+   if (mincore((void *)p, PAGE_SIZE, )) {
+   printf("[SKIP]\t%s: mincore failed: %s\n", test_name,
+  str

[RFC 4/6] mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT

2021-02-24 Thread Nadav Amit

From: Nadav Amit 

Certain use-cases (e.g., prefetch_page()) may want to avoid polling
while a page is brought from the swap. Yet, swap_cluster_readahead()
and swap_vma_readahead() do not respect FAULT_FLAG_RETRY_NOWAIT.

Add support to respect FAULT_FLAG_RETRY_NOWAIT by not polling in these
cases.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Andrew Morton 
Cc: x...@kernel.org
Signed-off-by: Nadav Amit 
---
 mm/memory.c | 15 +--
 mm/shmem.c  |  1 +
 mm/swap_state.c | 12 +---
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..13b9cf36268f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3326,12 +3326,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
 
if (!page) {
+   /*
+* Back out if we failed to bring the page while we
+* tried to avoid I/O.
+*/
+   if (fault_flag_allow_retry_first(vmf->flags) &&
+   (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+   ret = VM_FAULT_RETRY;
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   goto out;
+   }
+
/*
 * Back out if somebody else faulted in this pte
 * while we released the pte lock.
 */
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
+   vmf->pte = pte_offset_map_lock(vma->vm_mm,
+   vmf->pmd, vmf->address, >ptl);
if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..b108e9ba9e89 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1525,6 +1525,7 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t 
gfp,
shmem_pseudo_vma_init(, info, index);
vmf.vma = 
vmf.address = 0;
+   vmf.flags = 0;
page = swap_cluster_readahead(swap, gfp, );
shmem_pseudo_vma_destroy();
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 751c1ef2fe0e..1e930f7ff8b3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -656,10 +656,13 @@ struct page *swap_cluster_readahead(swp_entry_t entry, 
gfp_t gfp_mask,
unsigned long mask;
struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
-   bool do_poll = true, page_allocated;
+   bool page_allocated, do_poll;
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;
 
+   do_poll = !fault_flag_allow_retry_first(vmf->flags) ||
+   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
+
mask = swapin_nr_pages(offset) - 1;
if (!mask)
goto skip;
@@ -838,7 +841,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, 
gfp_t gfp_mask,
pte_t *pte, pentry;
swp_entry_t entry;
unsigned int i;
-   bool page_allocated;
+   bool page_allocated, do_poll;
struct vma_swap_readahead ra_info = {
.win = 1,
};
@@ -873,9 +876,12 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, 
gfp_t gfp_mask,
}
blk_finish_plug();
lru_add_drain();
+
 skip:
+   do_poll = (!fault_flag_allow_retry_first(vmf->flags) ||
+   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) && ra_info.win == 1;
return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
-ra_info.win == 1);
+do_poll);
 }
 
 /**
-- 
2.25.1

[RFC 3/6] x86/vdso: introduce page_prefetch()

2021-02-24 Thread Nadav Amit

From: Nadav Amit 

Introduce a new vDSO function: page_prefetch() which is to be used when
certain memory, which might be paged out, is expected to be used soon.
The function prefetches the page if needed. The function returns zero if
the page is accessible after the call and -1 otherwise.

page_prefetch() is intended to be very lightweight both when the page is
already present and when the page is prefetched.

The implementation leverages the new vDSO exception tables mechanism.
page_prefetch() accesses the page for read and has a corresponding vDSO
exception-table entry that indicates that a #PF might occur and that in
such case the page should be brought asynchronously. If #PF indeed
occurs, the page-fault handler sets the FAULT_FLAG_RETRY_NOWAIT flag.

If the page-fault was not resolved, the page-fault handler does not
retry, and instead jumps to the new IP that is marked in the exception
table. The vDSO part returns accordingly the return value.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Andrew Morton 
Cc: x...@kernel.org
Signed-off-by: Nadav Amit 
---
 arch/x86/Kconfig|  1 +
 arch/x86/entry/vdso/Makefile|  1 +
 arch/x86/entry/vdso/extable.c   | 59 +
 arch/x86/entry/vdso/vdso.lds.S  |  1 +
 arch/x86/entry/vdso/vprefetch.S | 39 ++
 arch/x86/include/asm/vdso.h | 38 +++--
 arch/x86/mm/fault.c | 11 --
 lib/vdso/Kconfig|  5 +++
 8 files changed, 136 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/entry/vdso/vprefetch.S

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..86a4c265e8af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -136,6 +136,7 @@ config X86
select GENERIC_TIME_VSYSCALL
select GENERIC_GETTIMEOFDAY
select GENERIC_VDSO_TIME_NS
+   select GENERIC_VDSO_PREFETCH
select GUP_GET_PTE_LOW_HIGH if X86_PAE
select HARDIRQS_SW_RESEND
select HARDLOCKUP_CHECK_TIMESTAMP   if X86_64
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 02e3e42f380b..e32ca1375b84 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -28,6 +28,7 @@ vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
 vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
 vobjs32-y += vdso32/vclock_gettime.o
 vobjs-$(CONFIG_X86_SGX)+= vsgx.o
+vobjs-$(CONFIG_GENERIC_VDSO_PREFETCH) += vprefetch.o
 
 # files to link into kernel
 obj-y  += vma.o extable.o
diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index 93fb37bd32ad..e821887112ce 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -4,36 +4,67 @@
 #include 
 #include 
 #include 
+#include "extable.h"
 
 struct vdso_exception_table_entry {
int insn, fixup;
unsigned int mask, flags;
 };
 
-bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
- unsigned long error_code, unsigned long fault_addr)
+static unsigned long
+get_vdso_exception_table_entry(const struct pt_regs *regs, int trapnr,
+  unsigned int *flags)
 {
const struct vdso_image *image = current->mm->context.vdso_image;
const struct vdso_exception_table_entry *extable;
unsigned int nr_entries, i;
unsigned long base;
+   unsigned long ip = regs->ip;
+   unsigned long vdso_base = (unsigned long)current->mm->context.vdso;
 
-   if (!current->mm->context.vdso)
-   return false;
-
-   base =  (unsigned long)current->mm->context.vdso + image->extable_base;
+   base = vdso_base + image->extable_base;
nr_entries = image->extable_len / (sizeof(*extable));
extable = image->extable;
 
for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
-   if (regs->ip == base + extable[i].insn) {
-   regs->ip = base + extable[i].fixup;
-   regs->di = trapnr;
-   regs->si = error_code;
-   regs->dx = fault_addr;
-   return true;
-   }
+   if (ip != base + extable[i].insn)
+   continue;
+
+   if (!((1u << trapnr) & extable[i].mask))
+   continue;
+
+   /* found */
+   if (flags)
+   *flags = extable[i].flags;
+   return base + extable[i].fixup;
}
 
-   return false;
+   return 0;
+}
+
+bool __fixup_vdso_exception(struct pt_regs *regs, int trapnr,
+   unsigned long error_code, unsigned long fault_addr)
+{
+   unsigned long new_ip;
+
+   new_ip = get_vdso_exception_table

[RFC 1/6] vdso/extable: fix calculation of base

2021-02-24 Thread Nadav Amit

From: Nadav Amit 

Apparently, the assembly considers __ex_table as the location when the
pushsection directive was issued. Therefore when there is more than a
single entry in the vDSO exception table, the calculations of the base
and fixup are wrong.

Fix the calculations of the expected fault IP and new IP by adjusting
the base after each entry.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Andrew Morton 
Cc: x...@kernel.org
Signed-off-by: Nadav Amit 
---
 arch/x86/entry/vdso/extable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index afcf5b65beef..c81e78636220 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -32,7 +32,7 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
nr_entries = image->extable_len / (sizeof(*extable));
extable = image->extable;
 
-   for (i = 0; i < nr_entries; i++) {
+   for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
if (regs->ip == base + extable[i].insn) {
regs->ip = base + extable[i].fixup;
regs->di = trapnr;
-- 
2.25.1

[RFC 2/6] x86/vdso: add mask and flags to extable

2021-02-24 Thread Nadav Amit

From: Nadav Amit 

Add a "mask" field to vDSO exception tables that says which exceptions
should be handled.

Add a "flags" field to vDSO as well to provide additional information
about the exception.

The existing preprocessor macro _ASM_VDSO_EXTABLE_HANDLE for assembly is
not easy to use as it requires the user to stringify the expanded C
macro. Remove _ASM_VDSO_EXTABLE_HANDLE and use a similar scheme to
ALTERNATIVE, using assembly macros directly in assembly without wrapping
them in C macros.

Move the vsgx supported exceptions test out of the generic C code into
vsgx-specific assembly by setting vsgx supported exceptions in the mask.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Andrew Morton 
Cc: x...@kernel.org
Signed-off-by: Nadav Amit 
---
 arch/x86/entry/vdso/extable.c |  9 +
 arch/x86/entry/vdso/extable.h | 21 +
 arch/x86/entry/vdso/vsgx.S|  9 +++--
 3 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index c81e78636220..93fb37bd32ad 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -7,6 +7,7 @@
 
 struct vdso_exception_table_entry {
int insn, fixup;
+   unsigned int mask, flags;
 };
 
 bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
@@ -17,14 +18,6 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
unsigned int nr_entries, i;
unsigned long base;
 
-   /*
-* Do not attempt to fixup #DB or #BP.  It's impossible to identify
-* whether or not a #DB/#BP originated from within an SGX enclave and
-* SGX enclaves are currently the only use case for vDSO fixup.
-*/
-   if (trapnr == X86_TRAP_DB || trapnr == X86_TRAP_BP)
-   return false;
-
if (!current->mm->context.vdso)
return false;
 
diff --git a/arch/x86/entry/vdso/extable.h b/arch/x86/entry/vdso/extable.h
index b56f6b012941..7ca8a0776805 100644
--- a/arch/x86/entry/vdso/extable.h
+++ b/arch/x86/entry/vdso/extable.h
@@ -2,26 +2,31 @@
 #ifndef __VDSO_EXTABLE_H
 #define __VDSO_EXTABLE_H
 
+#include 
+
+#define ASM_VDSO_ASYNC_FLAGS   (1 << 0)
+
 /*
  * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
  * vDSO uses a dedicated handler the addresses are relative to the overall
  * exception table, not each individual entry.
  */
 #ifdef __ASSEMBLY__
-#define _ASM_VDSO_EXTABLE_HANDLE(from, to) \
-   ASM_VDSO_EXTABLE_HANDLE from to
-
-.macro ASM_VDSO_EXTABLE_HANDLE from:req to:req
+.macro ASM_VDSO_EXTABLE_HANDLE from:req to:req mask:req flags:req
.pushsection __ex_table, "a"
.long (\from) - __ex_table
.long (\to) - __ex_table
+   .long (\mask)
+   .long (\flags)
.popsection
 .endm
 #else
-#define _ASM_VDSO_EXTABLE_HANDLE(from, to) \
-   ".pushsection __ex_table, \"a\"\n"  \
-   ".long (" #from ") - __ex_table\n"  \
-   ".long (" #to ") - __ex_table\n"\
+#define ASM_VDSO_EXTABLE_HANDLE(from, to, mask, flags) \
+   ".pushsection __ex_table, \"a\"\n"  \
+   ".long (" #from ") - __ex_table\n"  \
+   ".long (" #to ") - __ex_table\n"\
+   ".long (" #mask ")\n"   \
+   ".long (" #flags ")\n"  \
".popsection\n"
 #endif
 
diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index 86a0e94f68df..c588255af480 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "extable.h"
 
@@ -146,6 +147,10 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 
.cfi_endproc
 
-_ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
-
+/*
+ * Do not attempt to fixup #DB or #BP.  It's impossible to identify
+ * whether or not a #DB/#BP originated from within an SGX enclave.
+ */
+ASM_VDSO_EXTABLE_HANDLE .Lenclu_eenter_eresume, .Lhandle_exception,\
+   ~((1<

[RFC 0/6] x86: prefetch_page() vDSO call

2021-02-24 Thread Nadav Amit

From: Nadav Amit 

Just as applications can use prefetch instructions to overlap
computations and memory accesses, applications may want to overlap the
page-faults and compute or overlap the I/O accesses that are required
for page-faults of different pages.

Applications can use multiple threads and cores for this matter, by
running one thread that prefetches the data (i.e., faults in the data)
and another that does the compute, but this scheme is inefficient. Using
mincore() can tell whether a page is mapped, but might not tell whether
the page is in the page-cache and does not fault in the data.

Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
asynchronously. The semantic of this call is: try to prefetch a page of
in a given address and return zero if the page is accessible following
the call. Start I/O operations to retrieve the page if such operations
are required and there is no high memory pressure that might introduce
slowdowns.

Note that as usual the page might be paged-out at any point and
therefore, similarly to mincore(), there is no guarantee that the page
will be present at the time that the user application uses the data that
resides on the page. Nevertheless, it is expected that in the vast
majority of the cases this would not happen, since prefetch_page()
accesses the page and therefore sets the PTE access-bit (if it is
clear). 

The implementation is as follows. The vDSO code accesses the data,
triggering a page-fault it is not present. The handler detects based on
the instruction pointer that this is an asynchronous-#PF, using the
recently introduce vDSO exception tables. If the page can be brought
without waiting (e.g., the page is already in the page-cache), the
kernel handles the fault and returns success (zero). If there is memory
pressure that prevents the proper handling of the fault (i.e., requires
heavy-weight reclamation) it returns a failure. Otherwise, it starts an
I/O to bring the page and returns failure.

Compilers can be extended to issue the prefetch_page() calls when
needed.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Andrew Morton 
Cc: x...@kernel.org

Nadav Amit (6):
  vdso/extable: fix calculation of base
  x86/vdso: add mask and flags to extable
  x86/vdso: introduce page_prefetch()
  mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT
  mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT
  testing/selftest: test vDSO prefetch_page()

 arch/x86/Kconfig  |   1 +
 arch/x86/entry/vdso/Makefile  |   1 +
 arch/x86/entry/vdso/extable.c |  70 +++--
 arch/x86/entry/vdso/extable.h |  21 +-
 arch/x86/entry/vdso/vdso.lds.S|   1 +
 arch/x86/entry/vdso/vprefetch.S   |  39 +++
 arch/x86/entry/vdso/vsgx.S|   9 +-
 arch/x86/include/asm/vdso.h   |  38 ++-
 arch/x86/mm/fault.c   |  11 +-
 lib/vdso/Kconfig  |   5 +
 mm/memory.c   |  47 +++-
 mm/shmem.c|   1 +
 mm/swap_state.c   |  12 +-
 tools/testing/selftests/vDSO/Makefile |   2 +
 .../selftests/vDSO/vdso_test_prefetch_page.c  | 265 ++
 15 files changed, 470 insertions(+), 53 deletions(-)
 create mode 100644 arch/x86/entry/vdso/vprefetch.S
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_prefetch_page.c

-- 
2.25.1

[PATCH v6 9/9] smp: inline on_each_cpu_cond() and on_each_cpu()

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

Simplify the code and avoid having an additional function on the stack
by inlining on_each_cpu_cond() and on_each_cpu().

Cc: Andy Lutomirski 
Cc: Thomas Gleixner 
Suggested-by: Peter Zijlstra 
Signed-off-by: Nadav Amit 
---
 include/linux/smp.h | 50 
 kernel/smp.c| 56 -
 2 files changed, 36 insertions(+), 70 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 70c6f6284dcf..84a0b4828f66 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -50,30 +50,52 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 int wait);
 
+void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
+  void *info, bool wait, const struct cpumask *mask);
+
+int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+
 /*
  * Call a function on all processors
  */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, cpu_online_mask);
+}
 
-/*
- * Call a function on processors specified by mask, which might include
- * the local one.
+/**
+ * on_each_cpu_mask(): Run a function on processors specified by
+ * cpumask, which may include the local processor.
+ * @mask: The set of cpus to run on (only runs on online subset).
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait (atomically) until function has completed
+ *on other CPUs.
+ *
+ * If @wait is true, then returns once @func has returned.
+ *
+ * You must not call this function with disabled interrupts or from a
+ * hardware interrupt handler or from a bottom half handler.  The
+ * exception is that it may be used during early boot while
+ * early_boot_irqs_disabled is set.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
-   void *info, bool wait);
+static inline void on_each_cpu_mask(const struct cpumask *mask,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, mask);
+}
 
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
- * processor.
+ * processor.  May be used during early boot while early_boot_irqs_disabled is
+ * set. Use local_irq_save/restore() instead of local_irq_disable/enable().
  */
-void on_each_cpu_cond(smp_cond_func_t cond_func, smp_call_func_t func,
- void *info, bool wait);
-
-void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
-  void *info, bool wait, const struct cpumask *mask);
-
-int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+static inline void on_each_cpu_cond(smp_cond_func_t cond_func,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(cond_func, func, info, wait, cpu_online_mask);
+}
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/smp.c b/kernel/smp.c
index c8a5a1facc1a..b6375d775e93 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -847,55 +847,6 @@ void __init smp_init(void)
smp_cpus_done(setup_max_cpus);
 }
 
-/*
- * Call a function on all processors.  May be used during early boot while
- * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait)
-{
-   unsigned long flags;
-
-   preempt_disable();
-   smp_call_function(func, info, wait);
-   local_irq_save(flags);
-   func(info);
-   local_irq_restore(flags);
-   preempt_enable();
-}
-EXPORT_SYMBOL(on_each_cpu);
-
-/**
- * on_each_cpu_mask(): Run a function on processors specified by
- * cpumask, which may include the local processor.
- * @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
- * @info: An arbitrary pointer to pass to the function.
- * @wait: If true, wait (atomically) until function has completed
- *on other CPUs.
- *
- * If @wait is true, then returns once @func has returned.
- *
- * You must not call this function with disabled interrupts or from a
- * hardware interrupt handler or from a bottom half handler.  The
- * exception is that it may be used during early boot while
- * early_boot_irqs_disabled is set.
- */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
-   void *info, bool wait)
-{
-   unsigned int scf_flags;
-
-   scf_flags = SCF_RUN_LOCAL;
-   if (wait)
-   scf_flags |= SCF_WAIT

[PATCH v6 8/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

The compiler is smart enough without these hints.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 17ec4bfeee67..f4b162f273f5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -316,7 +316,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct 
*next,
local_irq_restore(flags);
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
@@ -880,7 +880,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, 
flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
-static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
+static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned long start, unsigned long end,
unsigned int stride_shift, bool freed_tables,
u64 new_tlb_gen)
@@ -907,7 +907,7 @@ static inline struct flush_tlb_info 
*get_flush_tlb_info(struct mm_struct *mm,
return info;
 }
 
-static inline void put_flush_tlb_info(void)
+static void put_flush_tlb_info(void)
 {
 #ifdef CONFIG_DEBUG_VM
/* Complete reentrency prevention checks */
-- 
2.25.1

[PATCH v6 7/9] cpumask: Mark functions as pure

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

cpumask_next_and() and cpumask_any_but() are pure, and marking them as
such seems to generate different and presumably better code for
native_flush_tlb_multi().

Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 include/linux/cpumask.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 383684e30f12..c53364c4296d 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -235,7 +235,7 @@ static inline unsigned int cpumask_last(const struct 
cpumask *srcp)
return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -252,8 +252,8 @@ static inline unsigned int cpumask_next_zero(int n, const 
struct cpumask *srcp)
return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+int __pure cpumask_next_and(int n, const struct cpumask *, const struct 
cpumask *);
+int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 int cpumask_any_and_distribute(const struct cpumask *src1p,
   const struct cpumask *src2p);
-- 
2.25.1

[PATCH v6 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 345a0aff5de4..17ec4bfeee67 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -469,7 +469,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
+   if (was_lazy)
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and
-- 
2.25.1

[PATCH v6 5/9] x86/mm/tlb: Privatize cpu_tlbstate

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 

---
v5 -> v6:
* Fixed warning due to mismatch in
  {DEFINE|DECLARE}_PER_CPU_{SHARED_}ALIGNED
---
 arch/x86/include/asm/tlbflush.h | 39 ++---
 arch/x86/kernel/alternative.c   |  2 +-
 arch/x86/mm/init.c  |  2 +-
 arch/x86/mm/tlb.c   | 17 --
 4 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3c6681def912..fa952eadbc2e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -89,23 +89,6 @@ struct tlb_state {
u16 loaded_mm_asid;
u16 next_asid;
 
-   /*
-* We can be in one of several states:
-*
-*  - Actively using an mm.  Our CPU's bit will be set in
-*mm_cpumask(loaded_mm) and is_lazy == false;
-*
-*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
-*will not be set in mm_cpumask(_mm) and is_lazy == false.
-*
-*  - Lazily using a real mm.  loaded_mm != _mm, our bit
-*is set in mm_cpumask(loaded_mm), but is_lazy == true.
-*We're heuristically guessing that the CR3 load we
-*skipped more than makes up for the overhead added by
-*lazy mode.
-*/
-   bool is_lazy;
-
/*
 * If set we changed the page tables in such a way that we
 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
@@ -151,7 +134,27 @@ struct tlb_state {
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+   /*
+* We can be in one of several states:
+*
+*  - Actively using an mm.  Our CPU's bit will be set in
+*mm_cpumask(loaded_mm) and is_lazy == false;
+*
+*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
+*will not be set in mm_cpumask(_mm) and is_lazy == false.
+*
+*  - Lazily using a real mm.  loaded_mm != _mm, our bit
+*is set in mm_cpumask(loaded_mm), but is_lazy == true.
+*We're heuristically guessing that the CR3 load we
+*skipped more than makes up for the overhead added by
+*lazy mode.
+*/
+   bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 bool nmi_uaccess_okay(void);
 #define nmi_uaccess_okay nmi_uaccess_okay
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e46725d..94649f86d653 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -813,7 +813,7 @@ static inline temp_mm_state_t use_temporary_mm(struct 
mm_struct *mm)
 * with a stale address space WITHOUT being in lazy mode after
 * restoring the previous mm.
 */
-   if (this_cpu_read(cpu_tlbstate.is_lazy))
+   if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
leave_mm(smp_processor_id());
 
temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index dd694fb93916..ed2e36748758 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1017,7 +1017,7 @@ void __init zone_sizes_init(void)
free_area_init(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
.next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8db87cd92e6b..345a0aff5de4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -300,7 +300,7 @@ void leave_mm(int cpu)
return;
 
/* Warn if we're not lazy. */
-   WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+   WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -424,7 +424,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-   bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+   bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -469,7 +469,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, str

[PATCH v6 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 

---
v5->v6:
* Use on_each_cpu_mask() instead of on_each_cpu_cond_mask() [PeterZ]
* Use cond_cpumask when needed instead of cpumask
* Rename remaining instance of native_flush_tlb_others()
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  4 +--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 46 +--
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350c1fb0..681dba8de4f2 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,8 +52,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -61,7 +61,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -164,7 +164,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -239,6 +239,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..45b55e3e0630 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -50,7 +50,7 @@ static inline void slow_down_io(void)
 void native_flush_tlb_local(void);
 void native_flush_tlb_global(void);
 void native_flush_tlb_one_user(unsigned long addr);
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
 static inline void __flush_tlb_local(void)
@@ -68,10 +68,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void __flush_tlb_others(const struct cpumask *cpumask,
+static inline void __flush_tlb_multi(const struct cpumask *cpumask,
  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/p

[PATCH v6 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.

This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.

Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index bf12371db6c4..07b6701a540a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -788,11 +788,13 @@ static void flush_tlb_func(void *info)
nr_invalidate);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static bool tlb_is_not_lazy(int cpu)
 {
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
 STATIC_NOPV void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info)
 {
@@ -813,12 +815,37 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
cpumask *cpumask,
 * up on the new contents of what used to be page tables, while
 * doing a speculative memory access.
 */
-   if (info->freed_tables)
+   if (info->freed_tables) {
smp_call_function_many(cpumask, flush_tlb_func,
   (void *)info, 1);
-   else
-   on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func,
-   (void *)info, 1, cpumask);
+   } else {
+   /*
+* Although we could have used on_each_cpu_cond_mask(),
+* open-coding it has performance advantages, as it eliminates
+* the need for indirect calls or retpolines. In addition, it
+* allows to use a designated cpumask for evaluating the
+* condition, instead of allocating one.
+*
+* This code works under the assumption that there are no nested
+* TLB flushes, an assumption that is already made in
+* flush_tlb_mm_range().
+*
+* cond_cpumask is logically a stack-local variable, but it is
+* more efficient to have it off the stack and not to allocate
+* it on demand. Preemption is disabled and this code is
+* non-reentrant.
+*/
+   struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask);
+   int cpu;
+
+   cpumask_clear(cond_cpumask);
+
+   for_each_cpu(cpu, cpumask) {
+   if (tlb_is_not_lazy(cpu))
+   __cpumask_set_cpu(cpu, cond_cpumask);
+   }
+   smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
*)info, 1);
+   }
 }
 
 void flush_tlb_others(const struct cpumask *cpumask,
-- 
2.25.1

[PATCH v6 2/9] x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

The unification of these two functions allows to use them in the updated
SMP infrastrucutre.

To do so, remove the reason argument from flush_tlb_func_local(), add
a member to struct tlb_flush_info that says which CPU initiated the
flush and act accordingly. Optimize the size of flush_tlb_info while we
are at it.

Unfortunately, this prevents us from using a constant tlb_flush_info for
arch_tlbbatch_flush(), but in a later stage we may be able to inline
tlb_flush_info into the IPI data, so it should not have an impact
eventually.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h |  5 +-
 arch/x86/mm/tlb.c   | 81 +++--
 2 files changed, 39 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..a7a598af116d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -201,8 +201,9 @@ struct flush_tlb_info {
unsigned long   start;
unsigned long   end;
u64 new_tlb_gen;
-   unsigned intstride_shift;
-   boolfreed_tables;
+   unsigned intinitiating_cpu;
+   u8  stride_shift;
+   u8  freed_tables;
 };
 
 void flush_tlb_local(void);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 569ac1d57f55..bf12371db6c4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -439,7 +439,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 */
 
-   /* We don't want flush_tlb_func_* to run concurrently with us. */
+   /* We don't want flush_tlb_func() to run concurrently with us. */
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());
 
@@ -647,14 +647,13 @@ void initialize_tlbstate_and_flush(void)
 }
 
 /*
- * flush_tlb_func_common()'s memory ordering requirement is that any
+ * flush_tlb_func()'s memory ordering requirement is that any
  * TLB fills that happen after we flush the TLB are ordered after we
  * read active_mm's tlb_gen.  We don't need any explicit barriers
  * because all x86 flush operations are serializing and the
  * atomic64_read operation won't be reordered by the compiler.
  */
-static void flush_tlb_func_common(const struct flush_tlb_info *f,
- bool local, enum tlb_flush_reason reason)
+static void flush_tlb_func(void *info)
 {
/*
 * We have three different tlb_gen values in here.  They are:
@@ -665,14 +664,26 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * - f->new_tlb_gen: the generation that the requester of the flush
 *   wants us to catch up to.
 */
+   const struct flush_tlb_info *f = info;
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
u64 local_tlb_gen = 
this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+   bool local = smp_processor_id() == f->initiating_cpu;
+   unsigned long nr_invalidate = 0;
 
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
 
+   if (!local) {
+   inc_irq_stat(irq_tlb_count);
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+
+   /* Can only happen on remote CPUs */
+   if (f->mm && f->mm != loaded_mm)
+   return;
+   }
+
if (unlikely(loaded_mm == _mm))
return;
 
@@ -700,8 +711,7 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * be handled can catch us all the way up, leaving no work for
 * the second flush.
 */
-   trace_tlb_flush(reason, 0);
-   return;
+   goto done;
}
 
WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
@@ -748,46 +758,34 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {
/* Partial flush */
-   unsigned long nr_invalidate = (f->end - f->start) >> 
f->stride_shift;
unsigned long addr = f->start;
 
+   nr_invalidate = (f->end - f->start) >> f->stride_shift;
+
while (addr < f->end) {
flush_tlb_one_user(addr);
addr += 1UL << f->stride_shift;

[PATCH v6 1/9] smp: Run functions concurrently in smp_call_function_many_cond()

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, modify smp_call_function_many_cond(), to allows the callers to
provide a function that should be executed (remotely/locally), and run
them concurrently. Keep other smp_call_function_many() semantic as it is
today for backward compatibility: the called function is not executed in
this case locally.

smp_call_function_many_cond() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 

---
v5 -> v6:
* on_each_cpu_cond_mask() was missing preempt_disable/enable() [PeterZ]
* use multiplication instead of condition [PeterZ]
* assert preempt disabled on smp_call_function_many_cond()
* Break 80-char lines (Christoph)
---
 kernel/smp.c | 156 +--
 1 file changed, 88 insertions(+), 68 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index aeb0adfa0606..c8a5a1facc1a 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -608,12 +608,28 @@ int smp_call_function_any(const struct cpumask *mask,
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
+/*
+ * Flags to be used as scf_flags argument of smp_call_function_many_cond().
+ *
+ * %SCF_WAIT:  Wait until function execution is completed
+ * %SCF_RUN_LOCAL: Run also locally if local cpu is set in cpumask
+ */
+#define SCF_WAIT   (1U << 0)
+#define SCF_RUN_LOCAL  (1U << 1)
+
 static void smp_call_function_many_cond(const struct cpumask *mask,
smp_call_func_t func, void *info,
-   bool wait, smp_cond_func_t cond_func)
+   unsigned int scf_flags,
+   smp_cond_func_t cond_func)
 {
+   int cpu, last_cpu, this_cpu = smp_processor_id();
struct call_function_data *cfd;
-   int cpu, next_cpu, this_cpu = smp_processor_id();
+   bool wait = scf_flags & SCF_WAIT;
+   bool run_remote = false;
+   bool run_local = false;
+   int nr_cpus = 0;
+
+   lockdep_assert_preemption_disabled();
 
/*
 * Can deadlock when called with interrupts disabled.
@@ -621,8 +637,9 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 * send smp call function interrupt to this cpu and as such deadlocks
 * can't happen.
 */
-   WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
-&& !oops_in_progress && !early_boot_irqs_disabled);
+   if (cpu_online(this_cpu) && !oops_in_progress &&
+   !early_boot_irqs_disabled)
+   lockdep_assert_irqs_enabled();
 
/*
 * When @wait we can deadlock when we interrupt between llist_add() and
@@ -632,60 +649,65 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 */
WARN_ON_ONCE(!in_task());
 
-   /* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
+   /* Check if we need local execution. */
+   if ((scf_flags & SCF_RUN_LOCAL) && cpumask_test_cpu(this_cpu, mask))
+   run_local = true;
+
+   /* Check if we need remote execution, i.e., any CPU excluding this one. 
*/
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+   if (cpu < nr_cpu_ids)
+   run_remote = true;
 
-   /* No online cpus?  We're done. */
-   if (cpu >= nr_cpu_ids)
-   return;
-
-   /* Do we have another CPU which isn't us? */
-   next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-   if (next_cpu == this_cpu)
-   next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
-   /* Fastpath: do that cpu by itself. */
-   if (next_cpu >= nr_cpu_ids) {
-   if (!cond_func || cond_func(cpu, info))
-   smp_call_function_single(cpu, func, info, wait);
-   return;
-   }
-

[PATCH v6 0/9] x86/tlb: Concurrent TLB flushes

2021-02-20 Thread Nadav Amit

From: Nadav Amit 

The series improves TLB shootdown by flushing the local TLB concurrently
with remote TLBs, overlapping the IPI delivery time with the local
flush. Performance numbers can be found in the previous version [1].

v5 was rebased on 5.11 (long time after v4), and had some bugs and
embarrassing build errors. Peter Zijlstra and Christoph Hellwig had some
comments as well. These issues were addressed (excluding one 82-chars
line that I left). Based on their feedback, an additional patch was also
added to reuse on_each_cpu_cond_mask() code and avoid unnecessary calls
by inlining.

KernelCI showed RCU stalls on arm64, which I could not figure out from
the kernel splat. If this issue persists, I would appreciate it someone
can assist in debugging or at least provide the output when running the
kernel with CONFIG_CSD_LOCK_WAIT_DEBUG=Y.

[1] https://lore.kernel.org/lkml/20190823224153.15223-1-na...@vmware.com/

v5 -> v6:
* Address build warnings due to rebase mistakes
* Reuse code from on_each_cpu_cond_mask() and inline [PeterZ]
* Fix some style issues [Hellwig]

v4 -> v5:
* Rebase on 5.11
* Move concurrent smp logic to smp_call_function_many_cond() 
* Remove SGI-UV patch which is not needed anymore

v3 -> v4:
* Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
* Prevent preemption on_each_cpu(). It is not needed, but it prevents
  concerns. [Peter/tglx]
* Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many_cond()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove unnecessary uses of the inline keyword
  smp: inline on_each_cpu_cond() and on_each_cpu()

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  48 ---
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/alternative.c |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 176 +--
 arch/x86/xen/mmu_pv.c |  11 +-
 include/linux/cpumask.h   |   6 +-
 include/linux/smp.h   |  50 +--
 include/trace/events/xen.h|   2 +-
 kernel/smp.c  | 196 +++---
 15 files changed, 278 insertions(+), 250 deletions(-)

-- 
2.25.1

Re: [PATCH v5 1/8] smp: Run functions concurrently in smp_call_function_many_cond()

2021-02-18 Thread Nadav Amit

> On Feb 18, 2021, at 12:09 AM, Christoph Hellwig  wrote:
> 
> On Tue, Feb 09, 2021 at 02:16:46PM -0800, Nadav Amit wrote:
>> +/*
>> + * Flags to be used as scf_flags argument of smp_call_function_many_cond().
>> + */
>> +#define SCF_WAIT(1U << 0)   /* Wait until function execution 
>> completed */
>> +#define SCF_RUN_LOCAL   (1U << 1)   /* Run also locally if local 
>> cpu is set in cpumask */
> 
> Can you move the comments on top of the defines to avoid the crazy
> long lines?
> 
>> +if (cpu_online(this_cpu) && !oops_in_progress && 
>> !early_boot_irqs_disabled)
> 
> Another pointlessly overly long line, with various more following.
> 
>> EXPORT_SYMBOL(on_each_cpu_cond_mask);
> 
> This isn't used by any modular code, so maybe throw in a patch to drop
> the export?

I prefer to export on_each_cpu_cond_mask() and instead turn the users
(on_each_cpu(), on_each_cpu_mask() and on_each_cpu_cond()) into inline
functions in smp.h. Otherwise, the call-chain becomes longer for no reason.
Let me know if you object.

So I will add something like:

-- >8 --

Author: Nadav Amit 
Date:   Tue Feb 16 11:04:30 2021 -0800

    smp: inline on_each_cpu_cond() and on_each_cpu_cond_mask()

Suggested-by: Peter Zijlstra 
Signed-off-by: Nadav Amit 

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 70c6f6284dcf..84a0b4828f66 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -50,30 +50,52 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 int wait);
 
+void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
+  void *info, bool wait, const struct cpumask *mask);
+
+int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+
 /*
  * Call a function on all processors
  */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, cpu_online_mask);
+}
 
-/*
- * Call a function on processors specified by mask, which might include
- * the local one.
+/**
+ * on_each_cpu_mask(): Run a function on processors specified by
+ * cpumask, which may include the local processor.
+ * @mask: The set of cpus to run on (only runs on online subset).
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait (atomically) until function has completed
+ *on other CPUs.
+ *
+ * If @wait is true, then returns once @func has returned.
+ *
+ * You must not call this function with disabled interrupts or from a
+ * hardware interrupt handler or from a bottom half handler.  The
+ * exception is that it may be used during early boot while
+ * early_boot_irqs_disabled is set.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
-   void *info, bool wait);
+static inline void on_each_cpu_mask(const struct cpumask *mask,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(NULL, func, info, wait, mask);
+}
 
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
- * processor.
+ * processor.  May be used during early boot while early_boot_irqs_disabled is
+ * set. Use local_irq_save/restore() instead of local_irq_disable/enable().
  */
-void on_each_cpu_cond(smp_cond_func_t cond_func, smp_call_func_t func,
- void *info, bool wait);
-
-void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
-  void *info, bool wait, const struct cpumask *mask);
-
-int smp_call_function_single_async(int cpu, call_single_data_t *csd);
+static inline void on_each_cpu_cond(smp_cond_func_t cond_func,
+   smp_call_func_t func, void *info, bool wait)
+{
+   on_each_cpu_cond_mask(cond_func, func, info, wait, cpu_online_mask);
+}
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/smp.c b/kernel/smp.c
index 629f1f7b80db..a75f3d1dd1b7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -843,55 +843,6 @@ void __init smp_init(void)
smp_cpus_done(setup_max_cpus);
 }
 
-/*
- * Call a function on all processors.  May be used during early boot while
- * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait)
-{
-   unsigned long flags;
-
-   preempt_disable();
-   smp_call_function(func, info, wait);
-   local_irq_save(flags);
-   func(info);
-   local_irq_restore(flags);
-   preempt_enable();
-}
-EXPORT_SYMBOL(on_each

Re: [PATCH v5 3/8] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2021-02-18 Thread Nadav Amit

> On Feb 18, 2021, at 12:16 AM, Christoph Hellwig  wrote:
> 
> On Tue, Feb 09, 2021 at 02:16:48PM -0800, Nadav Amit wrote:
>> +/*
>> + * Although we could have used on_each_cpu_cond_mask(),
>> + * open-coding it has performance advantages, as it eliminates
>> + * the need for indirect calls or retpolines. In addition, it
>> + * allows to use a designated cpumask for evaluating the
>> + * condition, instead of allocating one.
>> + *
>> + * This code works under the assumption that there are no nested
>> + * TLB flushes, an assumption that is already made in
>> + * flush_tlb_mm_range().
>> + *
>> + * cond_cpumask is logically a stack-local variable, but it is
>> + * more efficient to have it off the stack and not to allocate
>> + * it on demand. Preemption is disabled and this code is
>> + * non-reentrant.
>> + */
>> +struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask);
>> +int cpu;
>> +
>> +cpumask_clear(cond_cpumask);
>> +
>> +for_each_cpu(cpu, cpumask) {
>> +if (tlb_is_not_lazy(cpu))
>> +__cpumask_set_cpu(cpu, cond_cpumask);
>> +}
>> +smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
>> *)info, 1);
> 
> No need for the cast here, which would also avoid the pointlessly
> overly long line.

Actually, there is - to remove the const qualifier. You might argue it is
ugly, but that’s the way it is also how it is done right now.

In general, thanks for the feedback (I will reply after I follow your
feedback). I do have a general question - I thought it was decided that
clarity should be preferred over following the 80-column limit. Please let
me know if I misunderstood.

Re: [PATCH v5 1/8] smp: Run functions concurrently in smp_call_function_many_cond()

2021-02-16 Thread Nadav Amit

> On Feb 16, 2021, at 10:59 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 16, 2021 at 06:53:09PM +0000, Nadav Amit wrote:
>>> On Feb 16, 2021, at 8:32 AM, Peter Zijlstra  wrote:
> 
>>> I'm not sure I can explain it yet. It did get me looking at
>>> on_each_cpu() and it appears that wants to be converted too, something
>>> like the below perhaps.
>> 
>> Looks like a good cleanup, but I cannot say I understand the problem and how
>> it would solve it. Err...
> 
> Yeah, me neither. Bit of a mystery so far.

This stall seems to be real. Intuitively I presumed preemption was
mistakenly enabled, but it does not seem so.

Any chance you can build the kernel with “CONFIG_CSD_LOCK_WAIT_DEBUG=Y” and
rerun it? Perhaps that output will tell us more.

Local execution of ipi_sync_rq_state() on sync_runqueues_membarrier_state()

2021-02-16 Thread Nadav Amit

Hello Mathieu,

While trying to find some unrelated by, something in
sync_runqueues_membarrier_state() caught my eye:


  static int sync_runqueues_membarrier_state(struct mm_struct *mm)
  {
if (atomic_read(>mm_users) == 1 || num_online_cpus() == 1) {
this_cpu_write(runqueues.membarrier_state, membarrier_state);

/*
 * For single mm user, we can simply issue a memory barrier
 * after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED in the
 * mm and in the current runqueue to guarantee that no memory
 * access following registration is reordered before
 * registration. 
 */
smp_mb();
return 0;
}

 [ snip ]

smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1);


And ipi_sync_rq_state() does:

this_cpu_write(runqueues.membarrier_state,
   atomic_read(>membarrier_state));


So my question: are you aware smp_call_function_many() would not run
ipi_sync_rq_state() on the local CPU? Is that the intention of the code?

Thanks,
Nadav

Re: [PATCH] drivers: vmw_balloon: remove dentry pointer for debugfs

2021-02-16 Thread Nadav Amit

> On Feb 16, 2021, at 7:12 AM, Greg Kroah-Hartman  
> wrote:
> 
> There is no need to keep the dentry pointer around for the created
> debugfs file, as it is only needed when removing it from the system.
> When it is to be removed, ask debugfs itself for the pointer, to save on
> storage and make things a bit simpler.
> 
> Cc: Nadav Amit 
> Cc: "VMware, Inc." 
> Cc: Arnd Bergmann 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman 
> ---

Thanks for the cleanup.

Acked-by: Nadav Amit

Re: [PATCH v5 4/8] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-02-16 Thread Nadav Amit

> On Feb 16, 2021, at 4:10 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 09, 2021 at 02:16:49PM -0800, Nadav Amit wrote:
>> @@ -816,8 +821,8 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
>> cpumask *cpumask,
>>   * doing a speculative memory access.
>>   */
>>  if (info->freed_tables) {
>> -smp_call_function_many(cpumask, flush_tlb_func,
>> -   (void *)info, 1);
>> +on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
>> +  cpumask);
>>  } else {
>>  /*
>>   * Although we could have used on_each_cpu_cond_mask(),
>> @@ -844,14 +849,15 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
>> cpumask *cpumask,
>>  if (tlb_is_not_lazy(cpu))
>>  __cpumask_set_cpu(cpu, cond_cpumask);
>>  }
>> -smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
>> *)info, 1);
>> +on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
>> +  cpumask);
>>  }
>> }
> 
> Surely on_each_cpu_mask() is more appropriate? There the compiler can do
> the NULL propagation because it's on the same TU.
> 
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -821,8 +821,7 @@ STATIC_NOPV void native_flush_tlb_multi(
>* doing a speculative memory access.
>*/
>   if (info->freed_tables) {
> - on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
> -   cpumask);
> + on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
>   } else {
>   /*
>* Although we could have used on_each_cpu_cond_mask(),
> @@ -849,8 +848,7 @@ STATIC_NOPV void native_flush_tlb_multi(
>   if (tlb_is_not_lazy(cpu))
>   __cpumask_set_cpu(cpu, cond_cpumask);
>   }
> - on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
> -   cpumask);
> + on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
>   }
> }

Indeed, and there is actually an additional bug - I used cpumask in the
second on_each_cpu_cond_mask() instead of cond_cpumask.

Re: [PATCH v5 1/8] smp: Run functions concurrently in smp_call_function_many_cond()

2021-02-16 Thread Nadav Amit

> On Feb 16, 2021, at 10:59 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 16, 2021 at 06:53:09PM +0000, Nadav Amit wrote:
>>> On Feb 16, 2021, at 8:32 AM, Peter Zijlstra  wrote:
> 
>>> I'm not sure I can explain it yet. It did get me looking at
>>> on_each_cpu() and it appears that wants to be converted too, something
>>> like the below perhaps.
>> 
>> Looks like a good cleanup, but I cannot say I understand the problem and how
>> it would solve it. Err...
> 
> Yeah, me neither. Bit of a mystery so far.

I’ll try to see whether I can figure out about it. Perhaps there is
somewhere an assumption of ordering between the local and remote function
invocations.

Regardless, would you want me to have on_each_cpu() as inline or to keep it
in smp.c?

Re: [PATCH v5 1/8] smp: Run functions concurrently in smp_call_function_many_cond()

2021-02-16 Thread Nadav Amit

> On Feb 16, 2021, at 8:32 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 09, 2021 at 02:16:46PM -0800, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Currently, on_each_cpu() and similar functions do not exploit the
>> potential of concurrency: the function is first executed remotely and
>> only then it is executed locally. Functions such as TLB flush can take
>> considerable time, so this provides an opportunity for performance
>> optimization.
>> 
>> To do so, modify smp_call_function_many_cond(), to allows the callers to
>> provide a function that should be executed (remotely/locally), and run
>> them concurrently. Keep other smp_call_function_many() semantic as it is
>> today for backward compatibility: the called function is not executed in
>> this case locally.
>> 
>> smp_call_function_many_cond() does not use the optimized version for a
>> single remote target that smp_call_function_single() implements. For
>> synchronous function call, smp_call_function_single() keeps a
>> call_single_data (which is used for synchronization) on the stack.
>> Interestingly, it seems that not using this optimization provides
>> greater performance improvements (greater speedup with a single remote
>> target than with multiple ones). Presumably, holding data structures
>> that are intended for synchronization on the stack can introduce
>> overheads due to TLB misses and false-sharing when the stack is used for
>> other purposes.
>> 
>> Reviewed-by: Dave Hansen 
>> Cc: Peter Zijlstra 
>> Cc: Rik van Riel 
>> Cc: Thomas Gleixner 
>> Cc: Andy Lutomirski 
>> Cc: Josh Poimboeuf 
>> Signed-off-by: Nadav Amit 
> 
> Kernel-CI is giving me a regression that's most likely this patch:
> 
>  
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkernelci.org%2Ftest%2Fcase%2Fid%2F602bdd621c979f83faaddcc6%2Fdata=04%7C01%7Cnamit%40vmware.com%7C7dc93f3b74d8488de06f08d8d2988b0a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637490899907612612%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=PFs0ydMLh6xVfAQzAxSNd108YjxKMopNwxqsm82lEog%3Dreserved=0
> 
> I'm not sure I can explain it yet. It did get me looking at
> on_each_cpu() and it appears that wants to be converted too, something
> like the below perhaps.

Looks like a good cleanup, but I cannot say I understand the problem and how
it would solve it. Err...

Re: [PATCH v5 1/8] smp: Run functions concurrently in smp_call_function_many_cond()

2021-02-16 Thread Nadav Amit

> On Feb 16, 2021, at 4:04 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 09, 2021 at 02:16:46PM -0800, Nadav Amit wrote:
>> @@ -894,17 +911,12 @@ EXPORT_SYMBOL(on_each_cpu_mask);
>> void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
>> void *info, bool wait, const struct cpumask *mask)
>> {
>> -int cpu = get_cpu();
>> +unsigned int scf_flags = SCF_RUN_LOCAL;
>> 
>> -smp_call_function_many_cond(mask, func, info, wait, cond_func);
>> -if (cpumask_test_cpu(cpu, mask) && cond_func(cpu, info)) {
>> -unsigned long flags;
>> +if (wait)
>> +scf_flags |= SCF_WAIT;
>> 
>> -local_irq_save(flags);
>> -func(info);
>> -local_irq_restore(flags);
>> -}
>> -put_cpu();
>> +smp_call_function_many_cond(mask, func, info, scf_flags, cond_func);
>> }
>> EXPORT_SYMBOL(on_each_cpu_cond_mask);
> 
> You lost the preempt_disable() there, I've added it back:
> 
> ---
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -920,7 +920,9 @@ void on_each_cpu_cond_mask(smp_cond_func
>   if (wait)
>   scf_flags |= SCF_WAIT;
> 
> + preempt_disable();
>   smp_call_function_many_cond(mask, func, info, scf_flags, cond_func);
> + preempt_enable();
> }
> EXPORT_SYMBOL(on_each_cpu_cond_mask);

Indeed. I will add lockdep_assert_preemption_disabled() to
smp_call_function_many_cond() to prevent this mistake from reoccurring.

[PATCH v5 3/8] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.

This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.

Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index bf12371db6c4..07b6701a540a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -788,11 +788,13 @@ static void flush_tlb_func(void *info)
nr_invalidate);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static bool tlb_is_not_lazy(int cpu)
 {
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
 STATIC_NOPV void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info)
 {
@@ -813,12 +815,37 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
cpumask *cpumask,
 * up on the new contents of what used to be page tables, while
 * doing a speculative memory access.
 */
-   if (info->freed_tables)
+   if (info->freed_tables) {
smp_call_function_many(cpumask, flush_tlb_func,
   (void *)info, 1);
-   else
-   on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func,
-   (void *)info, 1, cpumask);
+   } else {
+   /*
+* Although we could have used on_each_cpu_cond_mask(),
+* open-coding it has performance advantages, as it eliminates
+* the need for indirect calls or retpolines. In addition, it
+* allows to use a designated cpumask for evaluating the
+* condition, instead of allocating one.
+*
+* This code works under the assumption that there are no nested
+* TLB flushes, an assumption that is already made in
+* flush_tlb_mm_range().
+*
+* cond_cpumask is logically a stack-local variable, but it is
+* more efficient to have it off the stack and not to allocate
+* it on demand. Preemption is disabled and this code is
+* non-reentrant.
+*/
+   struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask);
+   int cpu;
+
+   cpumask_clear(cond_cpumask);
+
+   for_each_cpu(cpu, cpumask) {
+   if (tlb_is_not_lazy(cpu))
+   __cpumask_set_cpu(cpu, cond_cpumask);
+   }
+   smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
*)info, 1);
+   }
 }
 
 void flush_tlb_others(const struct cpumask *cpumask,
-- 
2.25.1

[PATCH v5 6/8] x86/mm/tlb: Do not make is_lazy dirty for no reason

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e0271e0f84ea..98d212518f67 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -469,7 +469,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
+   if (was_lazy)
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and
-- 
2.25.1

[PATCH v5 7/8] cpumask: Mark functions as pure

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

cpumask_next_and() and cpumask_any_but() are pure, and marking them as
such seems to generate different and presumably better code for
native_flush_tlb_multi().

Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 include/linux/cpumask.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 383684e30f12..e86b7d027cfb 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -235,7 +235,7 @@ static inline unsigned int cpumask_last(const struct 
cpumask *srcp)
return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -252,8 +252,8 @@ static inline unsigned int cpumask_next_zero(int n, const 
struct cpumask *srcp)
return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+__pure int cpumask_next_and(int n, const struct cpumask *, const struct 
cpumask *);
+__pure int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 int cpumask_any_and_distribute(const struct cpumask *src1p,
   const struct cpumask *src2p);
-- 
2.25.1

[PATCH v5 8/8] x86/mm/tlb: Remove unnecessary uses of the inline keyword

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

The compiler is smart enough without these hints.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 98d212518f67..4cc28c624d1f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -316,7 +316,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct 
*next,
local_irq_restore(flags);
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
@@ -882,7 +882,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, 
flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
-static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
+static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned long start, unsigned long end,
unsigned int stride_shift, bool freed_tables,
u64 new_tlb_gen)
@@ -909,7 +909,7 @@ static inline struct flush_tlb_info 
*get_flush_tlb_info(struct mm_struct *mm,
return info;
 }
 
-static inline void put_flush_tlb_info(void)
+static void put_flush_tlb_info(void)
 {
 #ifdef CONFIG_DEBUG_VM
/* Complete reentrency prevention checks */
-- 
2.25.1

[PATCH v5 5/8] x86/mm/tlb: Privatize cpu_tlbstate

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h | 39 ++---
 arch/x86/kernel/alternative.c   |  2 +-
 arch/x86/mm/init.c  |  2 +-
 arch/x86/mm/tlb.c   | 17 --
 4 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3c6681def912..fa952eadbc2e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -89,23 +89,6 @@ struct tlb_state {
u16 loaded_mm_asid;
u16 next_asid;
 
-   /*
-* We can be in one of several states:
-*
-*  - Actively using an mm.  Our CPU's bit will be set in
-*mm_cpumask(loaded_mm) and is_lazy == false;
-*
-*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
-*will not be set in mm_cpumask(_mm) and is_lazy == false.
-*
-*  - Lazily using a real mm.  loaded_mm != _mm, our bit
-*is set in mm_cpumask(loaded_mm), but is_lazy == true.
-*We're heuristically guessing that the CR3 load we
-*skipped more than makes up for the overhead added by
-*lazy mode.
-*/
-   bool is_lazy;
-
/*
 * If set we changed the page tables in such a way that we
 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
@@ -151,7 +134,27 @@ struct tlb_state {
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+   /*
+* We can be in one of several states:
+*
+*  - Actively using an mm.  Our CPU's bit will be set in
+*mm_cpumask(loaded_mm) and is_lazy == false;
+*
+*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
+*will not be set in mm_cpumask(_mm) and is_lazy == false.
+*
+*  - Lazily using a real mm.  loaded_mm != _mm, our bit
+*is set in mm_cpumask(loaded_mm), but is_lazy == true.
+*We're heuristically guessing that the CR3 load we
+*skipped more than makes up for the overhead added by
+*lazy mode.
+*/
+   bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 bool nmi_uaccess_okay(void);
 #define nmi_uaccess_okay nmi_uaccess_okay
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e46725d..94649f86d653 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -813,7 +813,7 @@ static inline temp_mm_state_t use_temporary_mm(struct 
mm_struct *mm)
 * with a stale address space WITHOUT being in lazy mode after
 * restoring the previous mm.
 */
-   if (this_cpu_read(cpu_tlbstate.is_lazy))
+   if (this_cpu_read(cpu_tlbstate_shared.is_lazy))
leave_mm(smp_processor_id());
 
temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e26f5c5c6565..5afa8bdd2021 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1008,7 +1008,7 @@ void __init zone_sizes_init(void)
free_area_init(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
.next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 78fcbd58716e..e0271e0f84ea 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -300,7 +300,7 @@ void leave_mm(int cpu)
return;
 
/* Warn if we're not lazy. */
-   WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+   WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -424,7 +424,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-   bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+   bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -469,7 +469,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write

[PATCH v5 4/8] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  4 +--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 --
 arch/x86/mm/tlb.c | 49 +--
 arch/x86/xen/mmu_pv.c | 11 +++---
 include/trace/events/xen.h|  2 +-
 9 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350c1fb0..681dba8de4f2 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,8 +52,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -61,7 +61,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -164,7 +164,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -239,6 +239,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index f8dce11d2bc1..515e49204c8b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -50,7 +50,7 @@ static inline void slow_down_io(void)
 void native_flush_tlb_local(void);
 void native_flush_tlb_global(void);
 void native_flush_tlb_one_user(unsigned long addr);
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
 static inline void __flush_tlb_local(void)
@@ -68,10 +68,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void __flush_tlb_others(const struct cpumask *cpumask,
+static inline void __flush_tlb_multi(const struct cpumask *cpumask,
  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index b6b02b7c19cc..541fe7193526 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -201,8 +201,8 @@ struct pv_mmu_ops {
voi

[PATCH v5 0/8] x86/tlb: Concurrent TLB flushes

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

This is a respin of a rebased version of an old series, which I did not
follow, as I was preoccupied with personal issues (sorry).

The series improve TLB shootdown by flushing the local TLB concurrently
with remote TLBs, overlapping the IPI delivery time with the local
flush. Performance numbers can be found in the previous version [1].

The patches are essentially the same, but rebasing on the last version
required some changes. I left the reviewed-by tags - if anyone considers
it inappropriate, please let me know (and you have my apology).

[1] https://lore.kernel.org/lkml/20190823224153.15223-1-na...@vmware.com/

v4 -> v5:
* Rebase on 5.11
* Move concurrent smp logic to smp_call_function_many_cond() 
* Remove SGI-UV patch which is not needed anymore

v3 -> v4:
* Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
* Prevent preemption on_each_cpu(). It is not needed, but it prevents
  concerns. [Peter/tglx]
* Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (8):
  smp: Run functions concurrently in smp_call_function_many_cond()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  48 +++
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/alternative.c |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 177 +++---
 arch/x86/xen/mmu_pv.c |  11 +-
 include/linux/cpumask.h   |   6 +-
 include/trace/events/xen.h|   2 +-
 kernel/smp.c  | 148 +++--
 13 files changed, 242 insertions(+), 187 deletions(-)

-- 
2.25.1

[PATCH v5 1/8] smp: Run functions concurrently in smp_call_function_many_cond()

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, modify smp_call_function_many_cond(), to allows the callers to
provide a function that should be executed (remotely/locally), and run
them concurrently. Keep other smp_call_function_many() semantic as it is
today for backward compatibility: the called function is not executed in
this case locally.

smp_call_function_many_cond() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 kernel/smp.c | 148 ---
 1 file changed, 80 insertions(+), 68 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 1b6070bf97bb..c308130f3bc9 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -604,12 +604,23 @@ int smp_call_function_any(const struct cpumask *mask,
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
+/*
+ * Flags to be used as scf_flags argument of smp_call_function_many_cond().
+ */
+#define SCF_WAIT   (1U << 0)   /* Wait until function execution 
completed */
+#define SCF_RUN_LOCAL  (1U << 1)   /* Run also locally if local cpu is set 
in cpumask */
+
 static void smp_call_function_many_cond(const struct cpumask *mask,
smp_call_func_t func, void *info,
-   bool wait, smp_cond_func_t cond_func)
+   unsigned int scf_flags,
+   smp_cond_func_t cond_func)
 {
+   int cpu, last_cpu, this_cpu = smp_processor_id();
struct call_function_data *cfd;
-   int cpu, next_cpu, this_cpu = smp_processor_id();
+   bool wait = scf_flags & SCF_WAIT;
+   bool run_remote = false;
+   bool run_local = false;
+   int nr_cpus = 0;
 
/*
 * Can deadlock when called with interrupts disabled.
@@ -617,8 +628,8 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 * send smp call function interrupt to this cpu and as such deadlocks
 * can't happen.
 */
-   WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
-&& !oops_in_progress && !early_boot_irqs_disabled);
+   if (cpu_online(this_cpu) && !oops_in_progress && 
!early_boot_irqs_disabled)
+   lockdep_assert_irqs_enabled();
 
/*
 * When @wait we can deadlock when we interrupt between llist_add() and
@@ -628,60 +639,65 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 */
WARN_ON_ONCE(!in_task());
 
-   /* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
+   /* Check if we need local execution. */
+   if ((scf_flags & SCF_RUN_LOCAL) && cpumask_test_cpu(this_cpu, mask))
+   run_local = true;
+
+   /* Check if we need remote execution, i.e., any CPU excluding this one. 
*/
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+   if (cpu < nr_cpu_ids)
+   run_remote = true;
 
-   /* No online cpus?  We're done. */
-   if (cpu >= nr_cpu_ids)
-   return;
-
-   /* Do we have another CPU which isn't us? */
-   next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-   if (next_cpu == this_cpu)
-   next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
-   /* Fastpath: do that cpu by itself. */
-   if (next_cpu >= nr_cpu_ids) {
-   if (!cond_func || cond_func(cpu, info))
-   smp_call_function_single(cpu, func, info, wait);
-   return;
-   }
-
-   cfd = this_cpu_ptr(_data);
-
-   cpumask_and(cfd->cpumask, mask, cpu_online_mask);
-   __cpumask_clear_cpu(this_cpu, cfd->cpumask);
+   if (run_remote) {
+   cfd = this_cpu_ptr(_data);
+   cpumask_and(cfd->cpumask, mask, cpu_online_mask);
+   __cpumask_clear_cpu(this_

[PATCH v5 2/8] x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

2021-02-09 Thread Nadav Amit

From: Nadav Amit 

The unification of these two functions allows to use them in the updated
SMP infrastrucutre.

To do so, remove the reason argument from flush_tlb_func_local(), add
a member to struct tlb_flush_info that says which CPU initiated the
flush and act accordingly. Optimize the size of flush_tlb_info while we
are at it.

Unfortunately, this prevents us from using a constant tlb_flush_info for
arch_tlbbatch_flush(), but in a later stage we may be able to inline
tlb_flush_info into the IPI data, so it should not have an impact
eventually.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h |  5 +-
 arch/x86/mm/tlb.c   | 81 +++--
 2 files changed, 39 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..a7a598af116d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -201,8 +201,9 @@ struct flush_tlb_info {
unsigned long   start;
unsigned long   end;
u64 new_tlb_gen;
-   unsigned intstride_shift;
-   boolfreed_tables;
+   unsigned intinitiating_cpu;
+   u8  stride_shift;
+   u8  freed_tables;
 };
 
 void flush_tlb_local(void);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 569ac1d57f55..bf12371db6c4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -439,7 +439,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 */
 
-   /* We don't want flush_tlb_func_* to run concurrently with us. */
+   /* We don't want flush_tlb_func() to run concurrently with us. */
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());
 
@@ -647,14 +647,13 @@ void initialize_tlbstate_and_flush(void)
 }
 
 /*
- * flush_tlb_func_common()'s memory ordering requirement is that any
+ * flush_tlb_func()'s memory ordering requirement is that any
  * TLB fills that happen after we flush the TLB are ordered after we
  * read active_mm's tlb_gen.  We don't need any explicit barriers
  * because all x86 flush operations are serializing and the
  * atomic64_read operation won't be reordered by the compiler.
  */
-static void flush_tlb_func_common(const struct flush_tlb_info *f,
- bool local, enum tlb_flush_reason reason)
+static void flush_tlb_func(void *info)
 {
/*
 * We have three different tlb_gen values in here.  They are:
@@ -665,14 +664,26 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * - f->new_tlb_gen: the generation that the requester of the flush
 *   wants us to catch up to.
 */
+   const struct flush_tlb_info *f = info;
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
u64 local_tlb_gen = 
this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+   bool local = smp_processor_id() == f->initiating_cpu;
+   unsigned long nr_invalidate = 0;
 
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
 
+   if (!local) {
+   inc_irq_stat(irq_tlb_count);
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+
+   /* Can only happen on remote CPUs */
+   if (f->mm && f->mm != loaded_mm)
+   return;
+   }
+
if (unlikely(loaded_mm == _mm))
return;
 
@@ -700,8 +711,7 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * be handled can catch us all the way up, leaving no work for
 * the second flush.
 */
-   trace_tlb_flush(reason, 0);
-   return;
+   goto done;
}
 
WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
@@ -748,46 +758,34 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {
/* Partial flush */
-   unsigned long nr_invalidate = (f->end - f->start) >> 
f->stride_shift;
unsigned long addr = f->start;
 
+   nr_invalidate = (f->end - f->start) >> f->stride_shift;
+
while (addr < f->end) {
flush_tlb_one_user(addr);
addr += 1UL << f->stride_shift;

Re: [RFC 01/20] mm/tlb: fix fullmm semantics

2021-02-03 Thread Nadav Amit

> On Feb 3, 2021, at 1:44 AM, Will Deacon  wrote:
> 
> On Tue, Feb 02, 2021 at 01:35:38PM -0800, Nadav Amit wrote:
>>> On Feb 2, 2021, at 3:00 AM, Peter Zijlstra  wrote:
>>> 
>>> On Tue, Feb 02, 2021 at 01:32:36AM -0800, Nadav Amit wrote:
>>>>> On Feb 1, 2021, at 3:36 AM, Peter Zijlstra  wrote:
>>>>> 
>>>>> 
>>>>> https://lkml.kernel.org/r/20210127235347.1402-1-w...@kernel.org
>>>> 
>>>> I have seen this series, and applied my patches on it.
>>>> 
>>>> Despite Will’s patches, there were still inconsistencies between fullmm
>>>> and need_flush_all.
>>>> 
>>>> Am I missing something?
>>> 
>>> I wasn't aware you were on top. I'll look again.
>> 
>> Looking on arm64’s tlb_flush() makes me think that there is currently a bug
>> that this patch fixes. Arm64’s tlb_flush() does:
>> 
>>   /*
>>* If we're tearing down the address space then we only care about
>>* invalidating the walk-cache, since the ASID allocator won't
>>* reallocate our ASID without invalidating the entire TLB.
>>*/
>>   if (tlb->fullmm) {
>>   if (!last_level)
>>   flush_tlb_mm(tlb->mm);
>>   return;
>>   } 
>> 
>> But currently tlb_mmu_finish() can mistakenly set fullmm incorrectly (if
>> mm_tlb_flush_nested() is true), which might skip the TLB flush.
> 
> But in that case isn't 'freed_tables' set to 1, so 'last_level' will be
> false and we'll do the flush in the code above?

Indeed. You are right. So no rush.

Re: [RFC 01/20] mm/tlb: fix fullmm semantics

2021-02-02 Thread Nadav Amit

> On Feb 2, 2021, at 3:00 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 02, 2021 at 01:32:36AM -0800, Nadav Amit wrote:
>>> On Feb 1, 2021, at 3:36 AM, Peter Zijlstra  wrote:
>>> 
>>> 
>>> https://lkml.kernel.org/r/20210127235347.1402-1-w...@kernel.org
>> 
>> I have seen this series, and applied my patches on it.
>> 
>> Despite Will’s patches, there were still inconsistencies between fullmm
>> and need_flush_all.
>> 
>> Am I missing something?
> 
> I wasn't aware you were on top. I'll look again.

Looking on arm64’s tlb_flush() makes me think that there is currently a bug
that this patch fixes. Arm64’s tlb_flush() does:

   /*
* If we're tearing down the address space then we only care about
* invalidating the walk-cache, since the ASID allocator won't
* reallocate our ASID without invalidating the entire TLB.
*/
   if (tlb->fullmm) {
   if (!last_level)
   flush_tlb_mm(tlb->mm);
   return;
   } 

But currently tlb_mmu_finish() can mistakenly set fullmm incorrectly (if
mm_tlb_flush_nested() is true), which might skip the TLB flush.

Lucky for us, arm64 flushes each VMA separately (which as we discussed
separately may not be necessary), so the only PTEs that might not be flushed
are PTEs that are updated concurrently by another thread that also defer
their flushes. It therefore seems that the implications are more on the
correctness of certain syscalls (e.g., madvise(DONT_NEED)) without
implications on security or memory corruptions.

Let me know if you want me to send this patch separately with an updated
commit log for faster inclusion.

Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity

2021-02-02 Thread Nadav Amit

> On Feb 1, 2021, at 4:14 PM, Andy Lutomirski  wrote:
> 
> 
>> On Feb 1, 2021, at 2:04 PM, Nadav Amit  wrote:
>> 
>> Andy’s comments managed to make me realize this code is wrong. We must
>> call inc_mm_tlb_gen(mm) every time.
>> 
>> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
>> cpu_tlbstate on a context-switch. If the process was not running when the
>> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
>> switch_mm_irqs_off() back to the process will not flush the local TLB.
>> 
>> I need to think if there is a better solution. Multiple calls to
>> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
>> instead of one that is specific to the ranges, once the flush actually takes
>> place. On x86 it’s practically a non-issue, since anyhow any update of more
>> than 33-entries or so would cause a full TLB flush, but this is still ugly.
> 
> What if we had a per-mm ring buffer of flushes?  When starting a flush, we 
> would stick the range in the ring buffer and, when flushing, we would read 
> the ring buffer to catch up.  This would mostly replace the flush_tlb_info 
> struct, and it would let us process multiple partial flushes together.

I wanted to sleep on it, and went back and forth on whether it is the right
direction, hence the late response.

I think that what you say make sense. I think that I even tried to do once
something similar for some reason, but my memory plays tricks on me.

So tell me what you think on this ring-based solution. As you said, you keep
per-mm ring of flush_tlb_info. When you queue an entry, you do something
like:

#define RING_ENTRY_INVALID (0)

  gen = inc_mm_tlb_gen(mm);
  struct flush_tlb_info *info = mm->ring[gen % RING_SIZE];
  spin_lock(>ring_lock);
  WRITE_ONCE(info->new_tlb_gen, RING_ENTRY_INVALID);
  smp_wmb();
  info->start = start;
  info->end = end;
  info->stride_shift = stride_shift;
  info->freed_tables = freed_tables;
  smp_store_release(>new_tlb_gen, gen);
  spin_unlock(>ring_lock);

When you flush you use the entry generation as a sequence lock. On overflow
of the ring (i.e., sequence number mismatch) you perform a full flush:

  for (gen = mm->tlb_gen_completed; gen < mm->tlb_gen; gen++) {
struct flush_tlb_info *info = >ring[gen % RING_SIZE];

// detect overflow and invalid entries
if (smp_load_acquire(info->new_tlb_gen) != gen)
goto full_flush;

start = min(start, info->start);
end = max(end, info->end);
stride_shift = min(stride_shift, info->stride_shift);
freed_tables |= info.freed_tables;
smp_rmb();

// seqlock-like check that the information was not updated 
if (READ_ONCE(info->new_tlb_gen) != gen)
goto full_flush;
  }

On x86 I suspect that performing a full TLB flush would anyhow be the best
thing to do if there is more than a single entry. I am also not sure that it
makes sense to check the ring from flush_tlb_func_common() (i.e., in each
IPI handler) as it might cause cache thrashing.

Instead it may be better to do so from flush_tlb_mm_range(), when the
flushes are initiated, and use an aggregated flush_tlb_info for the flush.

It may also be better to have the ring arch-independent, so it would
resemble more of mmu_gather (the parts about the TLB flush information,
without the freed pages stuff).

We can detect deferred TLB flushes either by storing “deferred_gen” in the
page-tables/VMA (as I did) or by going over the ring, from tlb_gen_completed
to tlb_gen, and checking for an overlap. I think page-tables would be most
efficient/scalable, but perhaps going over the ring would be easier to
understand logic.

Makes sense? Thoughts?

Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-02 Thread Nadav Amit

> On Feb 2, 2021, at 1:31 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
>> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
>> each VMA. I suspect it is not intentional.
> 
> ARM is one of those that look at the VM_EXEC bit to explicitly flush
> ITLB IIRC, so it has to.

Hmm… I don’t think Arm is doing that. At least arm64 does not use the
default tlb_flush(), and it does not seem to consider VM_EXEC (at least in
this path):

static inline void tlb_flush(struct mmu_gather *tlb)
{
struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
bool last_level = !tlb->freed_tables;
unsigned long stride = tlb_get_unmap_size(tlb);
int tlb_level = tlb_get_level(tlb);

/*
 * If we're tearing down the address space then we only care about
 * invalidating the walk-cache, since the ASID allocator won't
 * reallocate our ASID without invalidating the entire TLB.
 */
if (tlb->mm_exiting) {
if (!last_level)
flush_tlb_mm(tlb->mm);
return;
}   

__flush_tlb_range(, tlb->start, tlb->end, stride,
  last_level, tlb_level);
}

Re: [RFC 01/20] mm/tlb: fix fullmm semantics

2021-02-02 Thread Nadav Amit

> On Feb 1, 2021, at 3:36 AM, Peter Zijlstra  wrote:
> 
> 
> https://lkml.kernel.org/r/20210127235347.1402-1-w...@kernel.org

I have seen this series, and applied my patches on it.

Despite Will’s patches, there were still inconsistencies between fullmm
and need_flush_all.

Am I missing something?

Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-01 Thread Nadav Amit

> On Feb 1, 2021, at 10:41 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
>> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
>> How about:
>> 
>>  CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH
> 
> Yes please, have to have descriptive names.

Point taken. I will fix it.

> 
> I didn't quite see why this was much of an improvement though. Maybe 
> follow up patches take advantage of it? I didn't see how they all fit 
> together.

They do, but I realized as I said in other emails that I have a serious bug
in the deferred invalidation scheme.

Having said that, I think there is an advantage of having an explicit config
option instead of relying on whether tlb_end_vma is defined. For instance,
Arm does not define tlb_end_vma, and consequently it flushes the TLB after
each VMA. I suspect it is not intentional.

Re: [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()

2021-02-01 Thread Nadav Amit

> On Feb 1, 2021, at 5:19 AM, Peter Zijlstra  wrote:
> 
> On Sat, Jan 30, 2021 at 04:11:25PM -0800, Nadav Amit wrote:
>> +#define tlb_start_ptes(tlb) \
>> +do {\
>> +struct mmu_gather *_tlb = (tlb);\
>> +\
>> +flush_tlb_batched_pending(_tlb->mm);\
>> +} while (0)
>> +
>> +static inline void tlb_end_ptes(struct mmu_gather *tlb) { }
> 
>>  tlb_change_page_size(tlb, PAGE_SIZE);
>>  orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
>> -flush_tlb_batched_pending(mm);
>> +tlb_start_ptes(tlb);
>>  arch_enter_lazy_mmu_mode();
>>  for (; addr < end; pte++, addr += PAGE_SIZE) {
>>  ptent = *pte;
>> @@ -468,6 +468,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>  }
>> 
>>  arch_leave_lazy_mmu_mode();
>> +tlb_end_ptes(tlb);
>>  pte_unmap_unlock(orig_pte, ptl);
>>  if (pageout)
>>  reclaim_pages(_list);
> 
> I don't like how you're dubbling up on arch_*_lazy_mmu_mode(). It seems
> to me those should be folded into tlb_{start,end}_ptes().
> 
> Alternatively, the even more work approach would be to, add an optional
> @tlb argument to pte_offset_map_lock()/pte_unmap_unlock() and friends.

Not too fund of the “even more approach”. I still have debts I need to
pay to the kernel community on old patches that didn’t make it through.

I will fold arch_*_lazy_mmu_mode() as you suggested. Admittedly, I do not
understand this arch_*_lazy_mmu_mode() very well - I would have assumed
they would be needed only when PTEs are established, and in other cases
the arch code will hook directly to the TLB flushing interface.

However, based on the code, it seems that powerpc does not even flush PTEs
that are established (only removed/demoted). Probably I am missing
something. I will just blindly fold it.

Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity

2021-02-01 Thread Nadav Amit

> On Jan 30, 2021, at 4:11 PM, Nadav Amit  wrote:
> 
> From: Nadav Amit 
> 
> Currently, deferred TLB flushes are detected in the mm granularity: if
> there is any deferred TLB flush in the entire address space due to NUMA
> migration, pte_accessible() in x86 would return true, and
> ptep_clear_flush() would require a TLB flush. This would happen even if
> the PTE resides in a completely different vma.

[ snip ]

> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
> +{
> + struct mm_struct *mm = tlb->mm;
> + u64 mm_gen;
> +
> + /*
> +  * Any change of PTE before calling __track_deferred_tlb_flush() must be
> +  * performed using RMW atomic operation that provides a memory barriers,
> +  * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
> +  * written before the current generation is read, synchronizing
> +  * (implicitly) with flush_tlb_mm_range().
> +  */
> + smp_mb__after_atomic();
> +
> + mm_gen = atomic64_read(>tlb_gen);
> +
> + /*
> +  * This condition checks for both first deferred TLB flush and for other
> +  * TLB pending or executed TLB flushes after the last table that we
> +  * updated. In the latter case, we are going to skip a generation, which
> +  * would lead to a full TLB flush. This should therefore not cause
> +  * correctness issues, and should not induce overheads, since anyhow in
> +  * TLB storms it is better to perform full TLB flush.
> +  */
> + if (mm_gen != tlb->defer_gen) {
> + VM_BUG_ON(mm_gen < tlb->defer_gen);
> +
> + tlb->defer_gen = inc_mm_tlb_gen(mm);
> + }
> +}

Andy’s comments managed to make me realize this code is wrong. We must
call inc_mm_tlb_gen(mm) every time.

Otherwise, a CPU that saw the old tlb_gen and updated it in its local
cpu_tlbstate on a context-switch. If the process was not running when the
TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
switch_mm_irqs_off() back to the process will not flush the local TLB.

I need to think if there is a better solution. Multiple calls to
inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
instead of one that is specific to the ranges, once the flush actually takes
place. On x86 it’s practically a non-issue, since anyhow any update of more
than 33-entries or so would cause a full TLB flush, but this is still ugly.

Re: [RFC 01/20] mm/tlb: fix fullmm semantics

2021-01-31 Thread Nadav Amit

> On Jan 30, 2021, at 6:57 PM, Andy Lutomirski  wrote:
> 
> On Sat, Jan 30, 2021 at 5:19 PM Nadav Amit  wrote:
>>> On Jan 30, 2021, at 5:02 PM, Andy Lutomirski  wrote:
>>> 
>>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>>>> From: Nadav Amit 
>>>> 
>>>> fullmm in mmu_gather is supposed to indicate that the mm is torn-down
>>>> (e.g., on process exit) and can therefore allow certain optimizations.
>>>> However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
>>>> the TLB should be fully flushed.
>>> 
>>> Maybe also rename fullmm?
>> 
>> Possible. How about mm_torn_down?
> 
> Sure.  Or mm_exiting, perhaps?

mm_exiting indeed sounds better.

Re: [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()

2021-01-31 Thread Nadav Amit

> On Jan 31, 2021, at 2:07 AM, Damian Tometzki  wrote:
> 
> On Sat, 30. Jan 16:11, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Introduce tlb_start_ptes() and tlb_end_ptes() which would be called
>> before and after PTEs are updated and TLB flushes are deferred. This
>> will be later be used for fine granualrity deferred TLB flushing
>> detection.
>> 
>> In the meanwhile, move flush_tlb_batched_pending() into
>> tlb_start_ptes(). It was not called from mapping_dirty_helpers by
>> wp_pte() and clean_record_pte(), which might be a bug.
>> 
>> No additional functional change is intended.
>> 
>> Signed-off-by: Nadav Amit 
>> Cc: Andrea Arcangeli 
>> Cc: Andrew Morton 
>> Cc: Andy Lutomirski 
>> Cc: Dave Hansen 
>> Cc: Peter Zijlstra 
>> Cc: Thomas Gleixner 
>> Cc: Will Deacon 
>> Cc: Yu Zhao 
>> Cc: Nick Piggin 
>> Cc: x...@kernel.org
>> ---
>> fs/proc/task_mmu.c |  2 ++
>> include/asm-generic/tlb.h  | 18 ++
>> mm/madvise.c   |  6 --
>> mm/mapping_dirty_helpers.c | 15 +--
>> mm/memory.c|  2 ++
>> mm/mprotect.c  |  3 ++-
>> 6 files changed, 41 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 4cd048ffa0f6..d0cce961fa5c 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1168,6 +1168,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned 
>> long addr,
>>  return 0;
>> 
>>  pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
>> +tlb_start_ptes(>tlb);
>>  for (; addr != end; pte++, addr += PAGE_SIZE) {
>>  ptent = *pte;
>> 
>> @@ -1190,6 +1191,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned 
>> long addr,
>>  tlb_flush_pte_range(>tlb, addr, PAGE_SIZE);
>>  ClearPageReferenced(page);
>>  }
>> +tlb_end_ptes(>tlb);
>>  pte_unmap_unlock(pte - 1, ptl);
>>  cond_resched();
>>  return 0;
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 041be2ef4426..10690763090a 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -58,6 +58,11 @@
>>  *Defaults to flushing at tlb_end_vma() to reset the range; helps when
>>  *there's large holes between the VMAs.
>>  *
>> + *  - tlb_start_ptes() / tlb_end_ptes; makr the start / end of PTEs change.
> 
> Hello Nadav,
> 
> short nid makr/mark

Thanks! I will fix it.

Re: [RFC 08/20] mm: store completed TLB generation

2021-01-31 Thread Nadav Amit

> On Jan 31, 2021, at 12:32 PM, Andy Lutomirski  wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> From: Nadav Amit 
>> 
>> To detect deferred TLB flushes in fine granularity, we need to keep
>> track on the completed TLB flush generation for each mm.
>> 
>> Add logic to track for each mm the tlb_gen_completed, which tracks the
>> completed TLB generation. It is the arch responsibility to call
>> mark_mm_tlb_gen_done() whenever a TLB flush is completed.
>> 
>> Start the generation numbers from 1 instead of 0. This would allow later
>> to detect whether flushes of a certain generation were completed.
> 
> Can you elaborate on how this helps?

I guess it should have gone to patch 15.

The relevant code it interacts with is in read_defer_tlb_flush_gen(). It
allows to use a single check to see “outdated” deferred TLB gen. Initially
tlb->defer_gen is zero. We are going to do inc_mm_tlb_gen() both on the
first time we defer TLB entries and whenever we see mm_gen is newer than
tlb->defer_gen:

+   mm_gen = atomic64_read(>tlb_gen);
+
+   /*
+* This condition checks for both first deferred TLB flush and for other
+* TLB pending or executed TLB flushes after the last table that we
+* updated. In the latter case, we are going to skip a generation, which
+* would lead to a full TLB flush. This should therefore not cause
+* correctness issues, and should not induce overheads, since anyhow in
+* TLB storms it is better to perform full TLB flush.
+*/
+   if (mm_gen != tlb->defer_gen) {
+   VM_BUG_ON(mm_gen < tlb->defer_gen);
+
+   tlb->defer_gen = inc_mm_tlb_gen(mm);
+   }


> 
> I think you should document that tlb_gen_completed only means that no
> outdated TLB entries will be observably used.  In the x86
> implementation it's possible for older TLB entries to still exist,
> unused, in TLBs of cpus running other mms.

You mean entries that be later flushed during switch_mm_irqs_off(), right? I
think that overall my comments need some work. Yes.

> How does this work with arch_tlbbatch_flush()?

completed_gen is not updated by arch_tlbbatch_flush(), since I couldn’t find
a way to combine them. completed_gen might not catch up with tlb_gen in this
case until another TLB flush takes place. I do not see correctness issue,
but it might result in redundant TLB flush.

>> Signed-off-by: Nadav Amit 
>> Cc: Andrea Arcangeli 
>> Cc: Andrew Morton 
>> Cc: Andy Lutomirski 
>> Cc: Dave Hansen 
>> Cc: Peter Zijlstra 
>> Cc: Thomas Gleixner 
>> Cc: Will Deacon 
>> Cc: Yu Zhao 
>> Cc: Nick Piggin 
>> Cc: x...@kernel.org
>> ---
>> arch/x86/mm/tlb.c | 10 ++
>> include/asm-generic/tlb.h | 33 +
>> include/linux/mm_types.h  | 15 ++-
>> 3 files changed, 57 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 7ab21430be41..d17b5575531e 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -14,6 +14,7 @@
>> #include 
>> #include 
>> #include 
>> +#include 
>> 
>> #include "mm_internal.h"
>> 
>> @@ -915,6 +916,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned 
>> long start,
>>if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>>flush_tlb_others(mm_cpumask(mm), info);
>> 
>> +   /* Update the completed generation */
>> +   mark_mm_tlb_gen_done(mm, new_tlb_gen);
>> +
>>put_flush_tlb_info();
>>put_cpu();
>> }
>> @@ -1147,6 +1151,12 @@ void arch_tlbbatch_flush(struct 
>> arch_tlbflush_unmap_batch *batch)
>> 
>>cpumask_clear(>cpumask);
>> 
>> +   /*
>> +* We cannot call mark_mm_tlb_gen_done() since we do not know which
>> +* mm's should be flushed. This may lead to some unwarranted TLB
>> +* flushes, but not to correction problems.
>> +*/
>> +
>>put_cpu();
>> }
>> 
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 517c89398c83..427bfcc6cdec 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -513,6 +513,39 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, 
>> struct vm_area_struct *vm
>> }
>> #endif
>> 
>> +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
>> +
>> +/*
>> + * Helper function to update a generation to have a new value, as long as 
>> new
>> + * value is greater or equal to

Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Nadav Amit

> On Jan 31, 2021, at 4:10 AM, Andrew Cooper  wrote:
> 
> On 31/01/2021 01:07, Andy Lutomirski wrote:
>> Adding Andrew Cooper, who has a distressingly extensive understanding
>> of the x86 PTE magic.
> 
> Pretty sure it is all learning things the hard way...
> 
>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 632d5a677d3f..b7473d2c9a1f 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather 
>>> *tlb,
>>>ptent = pte_mkwrite(ptent);
>>>}
>>>ptep_modify_prot_commit(vma, addr, pte, oldpte, 
>>> ptent);
>>> -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> +   if (pte_may_need_flush(oldpte, ptent))
>>> +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> 
> You're choosing to avoid the flush, based on A/D bits read ahead of the
> actual modification of the PTE.
> 
> In this example, another thread can write into the range (sets A and D),
> and get a suitable TLB entry which goes unflushed while the rest of the
> kernel thinks the memory is write-protected and clean.
> 
> The only safe way to do this is to use XCHG/etc to modify the PTE, and
> base flush calculations on the results.  Atomic operations are ordered
> with A/D updates from pagewalks on other CPUs, even on AMD where A
> updates are explicitly not ordered with regular memory reads, for
> performance reasons.

Thanks Andrew for the feedback, but I think the patch does it exactly in
this safe manner that you describe (at least on native x86, but I see a
similar path elsewhere as well):

oldpte = ptep_modify_prot_start()
-> __ptep_modify_prot_start()
-> ptep_get_and_clear
-> native_ptep_get_and_clear()
-> xchg()

Note that the xchg() will clear the PTE (i.e., making it non-present), and
no further updates of A/D are possible until ptep_modify_prot_commit() is
called.

On non-SMP setups this is not atomic (no xchg), but since we hold the lock,
we should be safe.

I guess you are right and a pte_may_need_flush() deserves a comment to
clarify that oldpte must be obtained by an atomic operation to ensure no A/D
bits are lost (as you say).

Yet, I do not see a correctness problem. Am I missing something?

Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-31 Thread Nadav Amit

> On Jan 30, 2021, at 11:57 PM, Nadav Amit  wrote:
> 
>> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin  wrote:
>> 
>> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>>> From: Nadav Amit 
>>> 
>>> There are currently (at least?) 5 different TLB batching schemes in the
>>> kernel:
>>> 
>>> 1. Using mmu_gather (e.g., zap_page_range()).
>>> 
>>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>>  ongoing deferred TLB flush and flushing the entire range eventually
>>>  (e.g., change_protection_range()).
>>> 
>>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>>> 
>>> 4. Batching per-table flushes (move_ptes()).
>>> 
>>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>>  flushing when (try_to_unmap_one() on x86).
>>> 
>>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>>> potentially TLB-shootdown storms, for instance during batched
>>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>>> defers TLB flushes.
>>> 
>>> More aggressive TLB batching may be possible, but this patch-set does
>>> not add such batching. The proposed changes would enable such batching
>>> in a later time.
>>> 
>>> Admittedly, I do not understand how things are not broken today, which
>>> frightens me to make further batching before getting things in order.
>>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>>> for each page-table (but not in greater granularity). Can't
>>> ClearPageDirty() be called before the flush, causing writes after
>>> ClearPageDirty() and before the flush to be lost?
>> 
>> Because it's holding the page table lock which stops page_mkclean from 
>> cleaning the page. Or am I misunderstanding the question?
> 
> Thanks. I understood this part. Looking again at the code, I now understand
> my confusion: I forgot that the reverse mapping is removed after the PTE is
> zapped.
> 
> Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
> by performing set_page_dirty() for the batched pages when needed in
> tlb_finish_mmu() [ i.e., by marking for each batched page whether
> set_page_dirty() should be issued for that page while collecting them ].

Correcting myself (I hope): no we cannot do so, since the buffers might be
remove from the page at that point.

Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-31 Thread Nadav Amit

> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>> From: Nadav Amit 
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
>> 
>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>> potentially TLB-shootdown storms, for instance during batched
>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>> defers TLB flushes.
>> 
>> More aggressive TLB batching may be possible, but this patch-set does
>> not add such batching. The proposed changes would enable such batching
>> in a later time.
>> 
>> Admittedly, I do not understand how things are not broken today, which
>> frightens me to make further batching before getting things in order.
>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>> for each page-table (but not in greater granularity). Can't
>> ClearPageDirty() be called before the flush, causing writes after
>> ClearPageDirty() and before the flush to be lost?
> 
> Because it's holding the page table lock which stops page_mkclean from 
> cleaning the page. Or am I misunderstanding the question?

Thanks. I understood this part. Looking again at the code, I now understand
my confusion: I forgot that the reverse mapping is removed after the PTE is
zapped.

Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
by performing set_page_dirty() for the batched pages when needed in
tlb_finish_mmu() [ i.e., by marking for each batched page whether
set_page_dirty() should be issued for that page while collecting them ].

> I'll go through the patches a bit more closely when they all come 
> through. Sparc and powerpc of course need the arch lazy mode to get 
> per-page/pte information for operations that are not freeing pages, 
> which is what mmu gather is designed for.

IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
where no previous PTE was set, right?

> I wouldn't mind using a similar API so it's less of a black box when 
> reading generic code, but it might not quite fit the mmu gather API
> exactly (most of these paths don't want a full mmu_gather on stack).

I see your point. It may be possible to create two mmu_gather structs: a
small one that only holds the flush information and another that also holds
the pages. 

>> This patch-set therefore performs the following changes:
>> 
>> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>>   instead of {inc|dec}_tlb_flush_pending().
>> 
>> 2. Avoid TLB flushes if PTE permission is not demoted.
>> 
>> 3. Cleans up mmu_gather to be less arch-dependant.
>> 
>> 4. Uses mm's generations to track in finer granularity, either per-VMA
>>   or per page-table, whether a pending mmu_gather operation is
>>   outstanding. This should allow to avoid some TLB flushes when KSM or
>>   memory reclamation takes place while another operation such as
>>   munmap() or mprotect() is running.
>> 
>> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>>   broken to track in a bitmap which CPUs have outstanding TLB flushes
>>   instead of having a flag.
> 
> Putting fixes first, and cleanups and independent patches (like #2) next
> would help with getting stuff merged and backported.

I tried to do it mostly this way. There are some theoretical races which
I did not manage (or try hard enough) to create, so I did not include in
the “fixes” section. I will restructure the patch-set according to the
feedback.

Thanks,
Nadav

Re: [RFC 01/20] mm/tlb: fix fullmm semantics

2021-01-30 Thread Nadav Amit

> On Jan 30, 2021, at 5:02 PM, Andy Lutomirski  wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> From: Nadav Amit 
>> 
>> fullmm in mmu_gather is supposed to indicate that the mm is torn-down
>> (e.g., on process exit) and can therefore allow certain optimizations.
>> However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
>> the TLB should be fully flushed.
> 
> Maybe also rename fullmm?

Possible. How about mm_torn_down?

I should have also changed the comment in tlb_finish_mmu().

Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-30 Thread Nadav Amit

> On Jan 30, 2021, at 5:07 PM, Andy Lutomirski  wrote:
> 
> Adding Andrew Cooper, who has a distressingly extensive understanding
> of the x86 PTE magic.
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> From: Nadav Amit 
>> 
>> Currently, using mprotect() to unprotect a memory region or uffd to
>> unprotect a memory region causes a TLB flush. At least on x86, as
>> protection is promoted, no TLB flush is needed.
>> 
>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>> flush is needed based on the old PTE and the new one. Implement an x86
>> pte_may_need_flush().
>> 
>> For x86, besides the simple logic that PTE protection promotion or
>> changes of software bits does require a flush, also add logic that
>> considers the dirty-bit. If the dirty-bit is clear and write-protect is
>> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
>> on write, and if the bit is clear, the PTE is reread.
>> 
>> Signed-off-by: Nadav Amit 
>> Cc: Andrea Arcangeli 
>> Cc: Andrew Morton 
>> Cc: Andy Lutomirski 
>> Cc: Dave Hansen 
>> Cc: Peter Zijlstra 
>> Cc: Thomas Gleixner 
>> Cc: Will Deacon 
>> Cc: Yu Zhao 
>> Cc: Nick Piggin 
>> Cc: x...@kernel.org
>> ---
>> arch/x86/include/asm/tlbflush.h | 44 +
>> include/asm-generic/tlb.h   |  4 +++
>> mm/mprotect.c   |  3 ++-
>> 3 files changed, 50 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h 
>> b/arch/x86/include/asm/tlbflush.h
>> index 8c87a2e0b660..a617dc0a9b06 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
>> arch_tlbflush_unmap_batch *batch,
>> 
>> extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>> 
>> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
>> +{
>> +   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
>> +_PAGE_SOFTW3 | _PAGE_ACCESSED;
> 
> Why is accessed ignored?  Surely clearing the accessed bit needs a
> flush if the old PTE is present.

I am just following the current scheme in the kernel (x86):

int ptep_clear_flush_young(struct vm_area_struct *vma,
   unsigned long address, pte_t *ptep)
{
/*
 * On x86 CPUs, clearing the accessed bit without a TLB flush
 * doesn't cause data corruption. [ It could cause incorrect
 * page aging and the (mistaken) reclaim of hot pages, but the
 * chance of that should be relatively low. ]
 *
 * So as a performance optimization don't flush the TLB when
 * clearing the accessed bit, it will eventually be flushed by
 * a context switch or a VM operation anyway. [ In the rare
 * event of it not getting flushed for a long time the delay
 * shouldn't really matter because there's no real memory
 * pressure for swapout to react to. ]
 */
return ptep_test_and_clear_young(vma, address, ptep);
}


> 
>> +   const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
>> +   pteval_t oldval = pte_val(oldpte);
>> +   pteval_t newval = pte_val(newpte);
>> +   pteval_t diff = oldval ^ newval;
>> +   pteval_t disable_mask = 0;
>> +
>> +   if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
>> +   disable_mask = _PAGE_NX;
>> +
>> +   /* new is non-present: need only if old is present */
>> +   if (pte_none(newpte))
>> +   return !pte_none(oldpte);
>> +
>> +   /*
>> +* If, excluding the ignored bits, only RW and dirty are cleared and 
>> the
>> +* old PTE does not have the dirty-bit set, we can avoid a flush. 
>> This
>> +* is possible since x86 architecture set the dirty bit atomically 
>> while
> 
> s/set/sets/
> 
>> +* it caches the PTE in the TLB.
>> +*
>> +* The condition considers any change to RW and dirty as not 
>> requiring
>> +* flush if the old PTE is not dirty or not writable for 
>> simplification
>> +* of the code and to consider (unlikely) cases of changing 
>> dirty-bit of
>> +* write-protected PTE.
>> +*/
>> +   if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
>> +   (!(pte_dirty(oldpte) || !pte_write(oldpte
>> +   return false;
> 
> This

Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-30 Thread Nadav Amit

> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski  wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> From: Nadav Amit 
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
> 
> Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

Yes.

[RFC 18/20] mm: make mm_cpumask() volatile

2021-01-30 Thread Nadav Amit

From: Nadav Amit 

mm_cpumask() is volatile: a bit might be turned on or off at any given
moment, and it is not protected by any lock. While the kernel coding
guidelines are very prohibitive against the use of volatile, not marking
mm_cpumask() as volatile seems wrong.

Cpumask and bitmap manipulation functions may work fine, as they are
allowed to use either the new or old value. Apparently they do, as no
bugs were reported. However, the fact that mm_cpumask() is not volatile
might lead to theoretical bugs due to compiler optimizations.

For example, cpumask_next() uses _find_next_bit(). A compiler might add
to _find_next_bit() invented loads that would cause __ffs() to run on
different value than the one read before. Consequently, if something
like that happens, the result might be a CPU that was neither set on the
old nor the new mask. I could not find what might go wrong in such a
case, but it seems as an improper result.

Mark mm_cpumask() result as volatile and propagate the "volatile"
qualifier according to the compiler shouts.

Signed-off-by: Nadav Amit 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: x...@kernel.org
---
 arch/arm/include/asm/bitops.h |  4 ++--
 arch/x86/hyperv/mmu.c |  2 +-
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/tlbflush.h   |  2 +-
 arch/x86/mm/tlb.c |  4 ++--
 arch/x86/xen/mmu_pv.c |  2 +-
 include/asm-generic/bitops/find.h |  8 
 include/linux/bitmap.h| 16 +++
 include/linux/cpumask.h   | 28 +--
 include/linux/mm_types.h  |  4 ++--
 include/linux/smp.h   |  6 +++---
 kernel/smp.c  |  8 
 lib/bitmap.c  |  8 
 lib/cpumask.c |  8 
 lib/find_bit.c| 10 +-
 15 files changed, 56 insertions(+), 56 deletions(-)

diff --git a/arch/arm/include/asm/bitops.h b/arch/arm/include/asm/bitops.h
index c92e42a5c8f7..c8690c0ff15a 100644
--- a/arch/arm/include/asm/bitops.h
+++ b/arch/arm/include/asm/bitops.h
@@ -162,8 +162,8 @@ extern int _test_and_change_bit(int nr, volatile unsigned 
long * p);
  */
 extern int _find_first_zero_bit_le(const unsigned long *p, unsigned size);
 extern int _find_next_zero_bit_le(const unsigned long *p, int size, int 
offset);
-extern int _find_first_bit_le(const unsigned long *p, unsigned size);
-extern int _find_next_bit_le(const unsigned long *p, int size, int offset);
+extern int _find_first_bit_le(const volatile unsigned long *p, unsigned size);
+extern int _find_next_bit_le(const volatile unsigned long *p, int size, int 
offset);
 
 /*
  * Big endian assembly bitops.  nr = 0 -> byte 3 bit 0.
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350c1fb0..76ce8a0f19ef 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,7 +52,7 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
+static void hyperv_flush_tlb_others(const volatile struct cpumask *cpus,
const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index b6b02b7c19cc..35b5696aedc7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -201,7 +201,7 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
+   void (*flush_tlb_others)(const volatile struct cpumask *cpus,
 const struct flush_tlb_info *info);
 
void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 296a00545056..a4e7c90d11a8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -208,7 +208,7 @@ struct flush_tlb_info {
 void flush_tlb_local(void);
 void flush_tlb_one_user(unsigned long addr);
 void flush_tlb_one_kernel(unsigned long addr);
-void flush_tlb_others(const struct cpumask *cpumask,
+void flush_tlb_others(const volatile struct cpumask *cpumask,
  const struct flush_tlb_info *info);
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 48f4b56fc4a7..ba85d6bb4988 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -796,7 +796,7 @@ static bool tlb_is_not_lazy(int cpu, void *data)
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
-STATIC_NOPV void native_flush_tlb_others(const s

[RFC 16/20] mm/tlb: per-page table generation tracking

2021-01-30 Thread Nadav Amit

From: Nadav Amit 

Detecting deferred TLB flushes per-VMA has two drawbacks:

1. It requires an atomic cmpxchg to record mm's TLB generation at the
time of the last TLB flush, as two deferred TLB flushes on the same VMA
can race.

2. It might be in coarse granularity for large VMAs.

On 64-bit architectures that have available space on page-struct, we can
resolve these two drawbacks by recording the TLB generation at the time
of the last deferred flush in page-struct of page-table whose TLB
flushes were deferred.

Introduce a new CONFIG_PER_TABLE_DEFERRED_FLUSHES config option. Record
when enabled the deferred TLB flush generation on page-struct, which is
protected by the page-table lock.

Signed-off-by: Nadav Amit 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: x...@kernel.org
---
 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/pgtable.h | 23 ++--
 fs/proc/task_mmu.c |  6 ++--
 include/asm-generic/tlb.h  | 65 ++
 include/linux/mm.h | 13 +++
 include/linux/mm_types.h   | 22 
 init/Kconfig   |  7 
 mm/huge_memory.c   |  2 +-
 mm/mapping_dirty_helpers.c |  4 +--
 mm/mprotect.c  |  2 +-
 10 files changed, 113 insertions(+), 32 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d56b0f5cb00c..dfc6ee9dbe9c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -250,6 +250,7 @@ config X86
select X86_FEATURE_NAMESif PROC_FS
select PROC_PID_ARCH_STATUS if PROC_FS
select MAPPING_DIRTY_HELPERS
+   select PER_TABLE_DEFERRED_FLUSHES   if X86_64
imply IMA_SECURE_AND_OR_TRUSTED_BOOTif EFI
 
 config INSTRUCTION_DECODER
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a0e069c15dbc..b380a849be90 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -774,17 +774,18 @@ static inline int pte_devmap(pte_t a)
 }
 #endif
 
-#define pte_accessible pte_accessible
-static inline bool pte_accessible(struct vm_area_struct *vma, pte_t *a)
-{
-   if (pte_flags(*a) & _PAGE_PRESENT)
-   return true;
-
-   if ((pte_flags(*a) & _PAGE_PROTNONE) && pte_tlb_flush_pending(vma, a))
-   return true;
-
-   return false;
-}
+#define pte_accessible(vma, a) \
+   ({  \
+   pte_t *_a = (a);\
+   bool r = false; \
+   \
+   if (pte_flags(*_a) & _PAGE_PRESENT) \
+   r = true;   \
+   else\
+   r = ((pte_flags(*_a) & _PAGE_PROTNONE) &&   \
+pte_tlb_flush_pending((vma), _a)); \
+   r;  \
+   })
 
 static inline int pmd_present(pmd_t pmd)
 {
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d0cce961fa5c..00e116feb62c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1157,7 +1157,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long 
addr,
/* Clear accessed and referenced bits. */
pmdp_test_and_clear_young(vma, addr, pmd);
test_and_clear_page_young(page);
-   tlb_flush_pmd_range(>tlb, addr, HPAGE_PMD_SIZE);
+   tlb_flush_pmd_range(>tlb, pmd, addr, HPAGE_PMD_SIZE);
ClearPageReferenced(page);
 out:
spin_unlock(ptl);
@@ -1174,7 +1174,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long 
addr,
 
if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
clear_soft_dirty(vma, addr, pte);
-   tlb_flush_pte_range(>tlb, addr, PAGE_SIZE);
+   tlb_flush_pte_range(>tlb, pte, addr, PAGE_SIZE);
continue;
}
 
@@ -1188,7 +1188,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long 
addr,
/* Clear accessed and referenced bits. */
ptep_test_and_clear_young(vma, addr, pte);
test_and_clear_page_young(page);
-   tlb_flush_pte_range(>tlb, addr, PAGE_SIZE);
+   tlb_flush_pte_range(>tlb, pte, addr, PAGE_SIZE);
ClearPageReferenced(page);
}
tlb_end_ptes(>tlb);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f25d2d955076..74dbb56d816d 100644
--- a/include/asm-

[RFC 19/20] lib/cpumask: introduce cpumask_atomic_or()

2021-01-30 Thread Nadav Amit

From: Nadav Amit 

Introduce cpumask_atomic_or() and bitmask_atomic_or() to allow to
perform atomic or operations atomically on cpumasks. This will be used
by the next patch.

To be more efficient, skip atomic operations when no changes are needed.

Signed-off-by: Nadav Amit 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: x...@kernel.org
---
 include/linux/bitmap.h  |  5 +
 include/linux/cpumask.h | 12 
 lib/bitmap.c| 25 +
 3 files changed, 42 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 769b7a98e12f..c9a9b784b244 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,6 +76,7 @@
  *  bitmap_to_arr32(buf, src, nbits)Copy nbits from buf to u32[] 
dst
  *  bitmap_get_value8(map, start)   Get 8bit value from map at 
start
  *  bitmap_set_value8(map, value, start)Set 8bit value to map at start
+ *  bitmap_atomic_or(dst, src, nbits)  *dst |= *src (atomically)
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -577,6 +578,10 @@ static inline void bitmap_set_value8(unsigned long *map, 
unsigned long value,
map[index] |= value << offset;
 }
 
+extern void bitmap_atomic_or(volatile unsigned long *dst,
+   const volatile unsigned long *bitmap, unsigned int bits);
+
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 3d7e418aa113..0567d73a0192 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -699,6 +699,18 @@ static inline unsigned int cpumask_size(void)
return BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long);
 }
 
+/**
+ * cpumask_atomic_or - *dstp |= *srcp (*dstp is set atomically)
+ * @dstp: the cpumask result (and source which is or'd)
+ * @srcp: the source input
+ */
+static inline void cpumask_atomic_or(volatile struct cpumask *dstp,
+const volatile struct cpumask *srcp)
+{
+   bitmap_atomic_or(cpumask_bits(dstp), cpumask_bits(srcp),
+nr_cpumask_bits);
+}
+
 /*
  * cpumask_var_t: struct cpumask for stack usage.
  *
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 6df7b13727d3..50f1842ff891 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -1310,3 +1310,28 @@ void bitmap_to_arr32(u32 *buf, const unsigned long 
*bitmap, unsigned int nbits)
 EXPORT_SYMBOL(bitmap_to_arr32);
 
 #endif
+
+void bitmap_atomic_or(volatile unsigned long *dst,
+ const volatile unsigned long *bitmap, unsigned int bits)
+{
+   unsigned int k;
+   unsigned int nr = BITS_TO_LONGS(bits);
+
+   for (k = 0; k < nr; k++) {
+   unsigned long src = bitmap[k];
+
+   /*
+* Skip atomic operations when no bits are changed. Do not use
+* bitmap[k] directly to avoid redundant loads as bitmap
+* variable is volatile.
+*/
+   if (!(src & ~dst[k]))
+   continue;
+
+   if (BITS_PER_LONG == 64)
+   atomic64_or(src, (atomic64_t*)[k]);
+   else
+   atomic_or(src, (atomic_t*)[k]);
+   }
+}
+EXPORT_SYMBOL(bitmap_atomic_or);
-- 
2.25.1

[RFC 20/20] mm/rmap: avoid potential races

2021-01-30 Thread Nadav Amit

From: Nadav Amit 

flush_tlb_batched_pending() appears to have a theoretical race:
tlb_flush_batched is being cleared after the TLB flush, and if in
between another core calls set_tlb_ubc_flush_pending() and sets the
pending TLB flush indication, this indication might be lost. Holding the
page-table lock when SPLIT_LOCK is set cannot eliminate this race.

The current batched TLB invalidation scheme therefore does not seem
viable or easily repairable.

Introduce a new scheme, in which a cpumask is maintained for pending
batched TLB flushes. When a full TLB flush is performed clear the
corresponding bit on the CPU the performs the TLB flush.

This scheme is only suitable for architectures that use IPIs for TLB
shootdowns. As x86 is the only architecture that currently uses batched
TLB flushes, this is not an issue.

Signed-off-by: Nadav Amit 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: x...@kernel.org
---
 arch/x86/include/asm/tlbbatch.h | 15 
 arch/x86/include/asm/tlbflush.h |  2 +-
 arch/x86/mm/tlb.c   | 18 ++-
 include/linux/mm.h  |  7 ++
 include/linux/mm_types_task.h   | 13 ---
 mm/rmap.c   | 41 -
 6 files changed, 40 insertions(+), 56 deletions(-)
 delete mode 100644 arch/x86/include/asm/tlbbatch.h

diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
deleted file mode 100644
index 1ad56eb3e8a8..
--- a/arch/x86/include/asm/tlbbatch.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ARCH_X86_TLBBATCH_H
-#define _ARCH_X86_TLBBATCH_H
-
-#include 
-
-struct arch_tlbflush_unmap_batch {
-   /*
-* Each bit set is a CPU that potentially has a TLB entry for one of
-* the PFNs being flushed..
-*/
-   struct cpumask cpumask;
-};
-
-#endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index a4e7c90d11a8..0e681a565b78 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,7 +240,7 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
-extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_flush(void);
 
 static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ba85d6bb4988..f7304d45e6b9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -760,8 +760,15 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, 
nr_invalidate);
trace_tlb_flush(reason, nr_invalidate);
} else {
+   int cpu = smp_processor_id();
+
/* Full flush. */
flush_tlb_local();
+
+   /* If there are batched TLB flushes, mark they are done */
+   if (cpumask_test_cpu(cpu, _flush_batched_cpumask))
+   cpumask_clear_cpu(cpu, _flush_batched_cpumask);
+
if (local)
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
trace_tlb_flush(reason, TLB_FLUSH_ALL);
@@ -1143,21 +1150,20 @@ static const struct flush_tlb_info full_flush_tlb_info 
= {
.end = TLB_FLUSH_ALL,
 };
 
-void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+void arch_tlbbatch_flush(void)
 {
int cpu = get_cpu();
 
-   if (cpumask_test_cpu(cpu, >cpumask)) {
+   if (cpumask_test_cpu(cpu, _flush_batched_cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
flush_tlb_func_local(_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
local_irq_enable();
}
 
-   if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids)
-   flush_tlb_others(>cpumask, _flush_tlb_info);
-
-   cpumask_clear(>cpumask);
+   if (cpumask_any_but(_flush_batched_cpumask, cpu) < nr_cpu_ids)
+   flush_tlb_others(_flush_batched_cpumask,
+_flush_tlb_info);
 
/*
 * We cannot call mark_mm_tlb_gen_done() since we do not know which
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a8a5bf82bd03..e4985cf6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,5 +3197,12 @@ unsigned long wp_shared_mapping_range(struct 
address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+extern volatile cpumask_t tlb_flush_batched_cpumask;
+void tlb_batch_init(void);
+#else
+static inline void tlb_batch_init(void) { }
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff

[RFC 17/20] mm/tlb: updated completed deferred TLB flush conditionally

2021-01-30 Thread Nadav Amit

From: Nadav Amit 

If all the deferred TLB flushes were completed, there is no need to
update the completed TLB flush. This update requires an atomic cmpxchg,
so we would like to skip it.

To do so, save for each mm the last TLB generation in which TLB flushes
were deferred. While saving this information requires another atomic
cmpxchg, assume that deferred TLB flushes are less frequent than TLB
flushes.

Signed-off-by: Nadav Amit 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: x...@kernel.org
---
 include/asm-generic/tlb.h | 23 ++-
 include/linux/mm_types.h  |  5 +
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 74dbb56d816d..a41af03fbede 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -536,6 +536,14 @@ static inline void tlb_update_generation(atomic64_t *gen, 
u64 new_gen)
 
 static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
 {
+   /*
+* If we all the deferred TLB generations were completed, we can skip
+* the update of tlb_gen_completed and save a few cycles on cmpxchg.
+*/
+   if (atomic64_read(>tlb_gen_deferred) ==
+   atomic64_read(>tlb_gen_completed))
+   return;
+
/*
 * Update the completed generation to the new generation if the new
 * generation is greater than the previous one.
@@ -546,7 +554,7 @@ static inline void mark_mm_tlb_gen_done(struct mm_struct 
*mm, u64 gen)
 static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
 {
struct mm_struct *mm = tlb->mm;
-   u64 mm_gen;
+   u64 mm_gen, new_gen;
 
/*
 * Any change of PTE before calling __track_deferred_tlb_flush() must be
@@ -567,11 +575,16 @@ static inline void read_defer_tlb_flush_gen(struct 
mmu_gather *tlb)
 * correctness issues, and should not induce overheads, since anyhow in
 * TLB storms it is better to perform full TLB flush.
 */
-   if (mm_gen != tlb->defer_gen) {
-   VM_BUG_ON(mm_gen < tlb->defer_gen);
+   if (mm_gen == tlb->defer_gen)
+   return;
 
-   tlb->defer_gen = inc_mm_tlb_gen(mm);
-   }
+   VM_BUG_ON(mm_gen < tlb->defer_gen);
+
+   new_gen = inc_mm_tlb_gen(mm);
+   tlb->defer_gen = new_gen;
+
+   /* Update mm->tlb_gen_deferred */
+   tlb_update_generation(>tlb_gen_deferred, new_gen);
 }
 
 #ifndef CONFIG_PER_TABLE_DEFERRED_FLUSHES
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cae9e8bbf8e6..4122a9b8b56f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -578,6 +578,11 @@ struct mm_struct {
 */
atomic64_t tlb_gen;
 
+   /*
+* The last TLB generation which was deferred.
+*/
+   atomic64_t tlb_gen_deferred;
+
/*
 * TLB generation which is guarnateed to be flushed, including
 * all the PTE changes that were performed before tlb_gen was
-- 
2.25.1

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2169 matches

Mail list logo