[PATCH v2] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-14 Thread Longpeng(Mike)
The translation caches may preserve obsolete data when the
mapping size is changed, suppose the following sequence which
can reveal the problem with high probability.

1.mmap(4GB,MAP_HUGETLB)
2.
  while (1) {
   (a)DMA MAP   0,0xa
   (b)DMA UNMAP 0,0xa
   (c)DMA MAP   0,0xc000
 * DMA read IOVA 0 may failure here (Not present)
 * if the problem occurs.
   (d)DMA UNMAP 0,0xc000
  }

The page table(only focus on IOVA 0) after (a) is:
 PML4: 0x19db5c1003   entry:0x899bdcd2f000
  PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
   PDE: 0x1a30a72003  entry:0x89b39cacb000
PTE: 0x21d200803  entry:0x89b3b0a72000

The page table after (b) is:
 PML4: 0x19db5c1003   entry:0x899bdcd2f000
  PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
   PDE: 0x1a30a72003  entry:0x89b39cacb000
PTE: 0x0  entry:0x89b3b0a72000

The page table after (c) is:
 PML4: 0x19db5c1003   entry:0x899bdcd2f000
  PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
   PDE: 0x21d200883   entry:0x89b39cacb000 (*)

Because the PDE entry after (b) is present, it won't be
flushed even if the iommu driver flush cache when unmap,
so the obsolete data may be preserved in cache, which
would cause the wrong translation at end.

However, we can see the PDE entry is finally switch to
2M-superpage mapping, but it does not transform
to 0x21d200883 directly:

1. PDE: 0x1a30a72003
2. __domain_mapping
 dma_pte_free_pagetable
   Set the PDE entry to ZERO
 Set the PDE entry to 0x21d200883

So we must flush the cache after the entry switch to ZERO
to avoid the obsolete info be preserved.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Nadav Amit 
Cc: Alex Williamson 
Cc: Joerg Roedel 
Cc: Kevin Tian 
Cc: Gonglei (Arei) 

Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating 
superpage")
Cc:  # v3.0+
Link: 
https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
Suggested-by: Lu Baolu 
Signed-off-by: Longpeng(Mike) 
---
v1 -> v2:
  - add Joerg
  - reconstruct the solution base on the Baolu's suggestion
---
 drivers/iommu/intel/iommu.c | 52 +
 1 file changed, 38 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..881c9f2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2289,6 +2289,41 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
return level;
 }
 
+/*
+ * Ensure that old small page tables are removed to make room for superpage(s).
+ * We're going to add new large pages, so make sure we don't remove their 
parent
+ * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared.
+ */
+static void switch_to_super_page(struct dmar_domain *domain,
+unsigned long start_pfn,
+unsigned long end_pfn, int level)
+{
+   unsigned long lvl_pages = lvl_to_nr_pages(level);
+   struct dma_pte *pte = NULL;
+   int i;
+
+   while (start_pfn <= end_pfn) {
+   if (!pte)
+   pte = pfn_to_dma_pte(domain, start_pfn, );
+
+   if (dma_pte_present(pte)) {
+   dma_pte_free_pagetable(domain, start_pfn,
+  start_pfn + lvl_pages - 1,
+  level + 1);
+
+   for_each_domain_iommu(i, domain)
+   iommu_flush_iotlb_psi(g_iommus[i], domain,
+ start_pfn, lvl_pages,
+ 0, 0);
+   }
+
+   pte++;
+   start_pfn += lvl_pages;
+   if (first_pte_in_page(pte))
+   pte = NULL;
+   }
+}
+
 static int
 __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 unsigned long phys_pfn, unsigned long nr_pages, int prot)
@@ -2329,22 +2364,11 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
return -ENOMEM;
/* It is large page*/
if (largepage_lvl > 1) {
-   unsigned long nr_superpages, end_pfn;
+   unsigned long end_pfn;
 
pteval |= DMA_PTE_LARGE_PAGE;
-   lvl_pages = lvl_to_nr_pages(largepage_lvl);
-
-   nr_superpages = nr_pages / lvl_pages;
-   end_pfn = iov_pfn + nr_superpages * lvl_pages - 
1;
-
-   /*
-* Ensure that old small page tables are
-* removed to make room for superpage(s).
-* We're adding new large 

RE: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-08 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Baolu,

> -Original Message-
> From: Lu Baolu [mailto:baolu...@linux.intel.com]
> Sent: Thursday, April 8, 2021 12:32 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; io...@lists.linux-foundation.org;
> linux-kernel@vger.kernel.org
> Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav
> Amit ; Alex Williamson ;
> Kevin Tian ; Gonglei (Arei) ;
> sta...@vger.kernel.org
> Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating 
> superpage
> 
> Hi Longpeng,
> 
> On 4/7/21 2:35 PM, Longpeng (Mike, Cloud Infrastructure Service Product
> Dept.) wrote:
> > Hi Baolu,
> >
> >> -Original Message-
> >> From: Lu Baolu [mailto:baolu...@linux.intel.com]
> >> Sent: Friday, April 2, 2021 12:44 PM
> >> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >> ; io...@lists.linux-foundation.org;
> >> linux-kernel@vger.kernel.org
> >> Cc: baolu...@linux.intel.com; David Woodhouse ;
> >> Nadav Amit ; Alex Williamson
> >> ; Kevin Tian ;
> >> Gonglei (Arei) ; sta...@vger.kernel.org
> >> Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating
> >> superpage
> >>
> >> Hi Longpeng,
> >>
> >> On 4/1/21 3:18 PM, Longpeng(Mike) wrote:
> >>> diff --git a/drivers/iommu/intel/iommu.c
> >>> b/drivers/iommu/intel/iommu.c index ee09323..cbcb434 100644
> >>> --- a/drivers/iommu/intel/iommu.c
> >>> +++ b/drivers/iommu/intel/iommu.c
> >>> @@ -2342,9 +2342,20 @@ static inline int
> >>> hardware_largepage_caps(struct
> >> dmar_domain *domain,
> >>>* removed to make room for 
> >>> superpage(s).
> >>>* We're adding new large pages, so 
> >>> make sure
> >>>* we don't remove their parent tables.
> >>> +  *
> >>> +  * We also need to flush the iotlb before 
> >>> creating
> >>> +  * superpage to ensure it does not perserves any
> >>> +  * obsolete info.
> >>>*/
> >>> - dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
> >>> -largepage_lvl + 1);
> >>> + if (dma_pte_present(pte)) {
> >>
> >> The dma_pte_free_pagetable() clears a batch of PTEs. So checking
> >> current PTE is insufficient. How about removing this check and always
> >> performing cache invalidation?
> >>
> >
> > Um...the PTE here may be present( e.g. 4K mapping --> superpage mapping )
> orNOT-present ( e.g. create a totally new superpage mapping ), but we only 
> need to
> call free_pagetable and flush_iotlb in the former case, right ?
> 
> But this code covers multiple PTEs and perhaps crosses the page boundary.
> 
> How about moving this code into a separated function and check PTE presence
> there. A sample code could look like below: [compiled but not tested!]
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index
> d334f5b4e382..0e04d450c38a 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -2300,6 +2300,41 @@ static inline int hardware_largepage_caps(struct
> dmar_domain *domain,
>  return level;
>   }
> 
> +/*
> + * Ensure that old small page tables are removed to make room for
> superpage(s).
> + * We're going to add new large pages, so make sure we don't remove
> their parent
> + * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared.
> + */
> +static void switch_to_super_page(struct dmar_domain *domain,
> +unsigned long start_pfn,
> +unsigned long end_pfn, int level) {

Maybe "swith_to" will lead people to think "remove old and then setup new", so 
how about something like "remove_room_for_super_page" or 
"prepare_for_super_page" ?

> +   unsigned long lvl_pages = lvl_to_nr_pages(level);
> +   struct dma_pte *pte = NULL;
> +   int i;
> +
> +   while (start_pfn <= end_pfn) {

start_pfn < end_pfn ?

> +   if (!pte)
> +   pte = pfn_to_dma_pte(domain, start_pfn, );
> +
> +   if (dma_pte_present(pte)) {
> +   dma_pte_free_pagetable(domain, start_pfn,
> + 

RE: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-07 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Baolu,

> -Original Message-
> From: Lu Baolu [mailto:baolu...@linux.intel.com]
> Sent: Friday, April 2, 2021 12:44 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; io...@lists.linux-foundation.org;
> linux-kernel@vger.kernel.org
> Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav
> Amit ; Alex Williamson ;
> Kevin Tian ; Gonglei (Arei) ;
> sta...@vger.kernel.org
> Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating 
> superpage
> 
> Hi Longpeng,
> 
> On 4/1/21 3:18 PM, Longpeng(Mike) wrote:
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index ee09323..cbcb434 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct
> dmar_domain *domain,
> >  * removed to make room for superpage(s).
> >  * We're adding new large pages, so make sure
> >  * we don't remove their parent tables.
> > +*
> > +* We also need to flush the iotlb before 
> > creating
> > +* superpage to ensure it does not perserves any
> > +* obsolete info.
> >  */
> > -   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
> > -  largepage_lvl + 1);
> > +   if (dma_pte_present(pte)) {
> 
> The dma_pte_free_pagetable() clears a batch of PTEs. So checking current PTE 
> is
> insufficient. How about removing this check and always performing cache
> invalidation?
> 

Um...the PTE here may be present( e.g. 4K mapping --> superpage mapping ) or 
NOT-present ( e.g. create a totally new superpage mapping ), but we only need 
to call free_pagetable and flush_iotlb in the former case, right ?

> > +   int i;
> > +
> > +   dma_pte_free_pagetable(domain, iov_pfn, 
> > end_pfn,
> > +  largepage_lvl + 
> > 1);
> > +   for_each_domain_iommu(i, domain)
> > +   
> > iommu_flush_iotlb_psi(g_iommus[i], domain,
> > + iov_pfn, 
> > nr_pages, 0, 0);
> > +
> 
> Best regards,
> baolu


Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Baolu,

在 2021/4/2 11:06, Lu Baolu 写道:
> Hi Longpeng,
> 
> On 4/1/21 3:18 PM, Longpeng(Mike) wrote:
>> The translation caches may preserve obsolete data when the
>> mapping size is changed, suppose the following sequence which
>> can reveal the problem with high probability.
>>
>> 1.mmap(4GB,MAP_HUGETLB)
>> 2.
>>    while (1) {
>>     (a)    DMA MAP   0,0xa
>>     (b)    DMA UNMAP 0,0xa
>>     (c)    DMA MAP   0,0xc000
>>   * DMA read IOVA 0 may failure here (Not present)
>>   * if the problem occurs.
>>     (d)    DMA UNMAP 0,0xc000
>>    }
>>
>> The page table(only focus on IOVA 0) after (a) is:
>>   PML4: 0x19db5c1003   entry:0x899bdcd2f000
>>    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
>>     PDE: 0x1a30a72003  entry:0x89b39cacb000
>>  PTE: 0x21d200803  entry:0x89b3b0a72000
>>
>> The page table after (b) is:
>>   PML4: 0x19db5c1003   entry:0x899bdcd2f000
>>    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
>>     PDE: 0x1a30a72003  entry:0x89b39cacb000
>>  PTE: 0x0  entry:0x89b3b0a72000
>>
>> The page table after (c) is:
>>   PML4: 0x19db5c1003   entry:0x899bdcd2f000
>>    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
>>     PDE: 0x21d200883   entry:0x89b39cacb000 (*)
>>
>> Because the PDE entry after (b) is present, it won't be
>> flushed even if the iommu driver flush cache when unmap,
>> so the obsolete data may be preserved in cache, which
>> would cause the wrong translation at end.
>>
>> However, we can see the PDE entry is finally switch to
>> 2M-superpage mapping, but it does not transform
>> to 0x21d200883 directly:
>>
>> 1. PDE: 0x1a30a72003
>> 2. __domain_mapping
>>   dma_pte_free_pagetable
>>     Set the PDE entry to ZERO
>>   Set the PDE entry to 0x21d200883
>>
>> So we must flush the cache after the entry switch to ZERO
>> to avoid the obsolete info be preserved.
>>
>> Cc: David Woodhouse 
>> Cc: Lu Baolu 
>> Cc: Nadav Amit 
>> Cc: Alex Williamson 
>> Cc: Kevin Tian 
>> Cc: Gonglei (Arei) 
>>
>> Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating
>> superpage")
>> Cc:  # v3.0+
>> Link:
>> https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
>>
>> Suggested-by: Lu Baolu 
>> Signed-off-by: Longpeng(Mike) 
>> ---
>>   drivers/iommu/intel/iommu.c | 15 +--
>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index ee09323..cbcb434 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct
>> dmar_domain *domain,
>>    * removed to make room for superpage(s).
>>    * We're adding new large pages, so make sure
>>    * we don't remove their parent tables.
>> + *
>> + * We also need to flush the iotlb before creating
>> + * superpage to ensure it does not perserves any
>> + * obsolete info.
>>    */
>> -    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
>> -   largepage_lvl + 1);
>> +    if (dma_pte_present(pte)) {
>> +    int i;
>> +
>> +    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
>> +   largepage_lvl + 1);
>> +    for_each_domain_iommu(i, domain)
>> +    iommu_flush_iotlb_psi(g_iommus[i], domain,
>> +  iov_pfn, nr_pages, 0, 0);
> 
> Thanks for patch!
> 
> How about making the flushed page size accurate? For example,
> 
> @@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned 
> long
> iov_pfn,
>     dma_pte_free_pagetable(domain, 
> iov_pfn,
> end_pfn,
> 
> largepage_lvl + 1);
>     for_each_domain_iommu(i, domain)
> - iommu_flush_iotlb_psi(g_iommus[i], domain,
> - iov_pfn, nr_pages, 0, 0);
> + iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn,
> + ALIGN_DOWN(nr_pages, lvl_pages), 0, 0);
> 
Yes, make sense.

Maybe another alternative is 'end_pfn - iova_pfn + 1', it's readable because we
free pagetable with (iova_pfn, end_pfn) above. Which one do you prefer?

> 
>> +    }
>>   } else {
>>   pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE;
>>   }
>>
> 
> Best regards,
> baolu
> .


[PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Longpeng(Mike)
The translation caches may preserve obsolete data when the
mapping size is changed, suppose the following sequence which
can reveal the problem with high probability.

1.mmap(4GB,MAP_HUGETLB)
2.
  while (1) {
   (a)DMA MAP   0,0xa
   (b)DMA UNMAP 0,0xa
   (c)DMA MAP   0,0xc000
 * DMA read IOVA 0 may failure here (Not present)
 * if the problem occurs.
   (d)DMA UNMAP 0,0xc000
  }

The page table(only focus on IOVA 0) after (a) is:
 PML4: 0x19db5c1003   entry:0x899bdcd2f000
  PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
   PDE: 0x1a30a72003  entry:0x89b39cacb000
PTE: 0x21d200803  entry:0x89b3b0a72000

The page table after (b) is:
 PML4: 0x19db5c1003   entry:0x899bdcd2f000
  PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
   PDE: 0x1a30a72003  entry:0x89b39cacb000
PTE: 0x0  entry:0x89b3b0a72000

The page table after (c) is:
 PML4: 0x19db5c1003   entry:0x899bdcd2f000
  PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
   PDE: 0x21d200883   entry:0x89b39cacb000 (*)

Because the PDE entry after (b) is present, it won't be
flushed even if the iommu driver flush cache when unmap,
so the obsolete data may be preserved in cache, which
would cause the wrong translation at end.

However, we can see the PDE entry is finally switch to
2M-superpage mapping, but it does not transform
to 0x21d200883 directly:

1. PDE: 0x1a30a72003
2. __domain_mapping
 dma_pte_free_pagetable
   Set the PDE entry to ZERO
 Set the PDE entry to 0x21d200883

So we must flush the cache after the entry switch to ZERO
to avoid the obsolete info be preserved.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Nadav Amit 
Cc: Alex Williamson 
Cc: Kevin Tian 
Cc: Gonglei (Arei) 

Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating 
superpage")
Cc:  # v3.0+
Link: 
https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
Suggested-by: Lu Baolu 
Signed-off-by: Longpeng(Mike) 
---
 drivers/iommu/intel/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
 * removed to make room for superpage(s).
 * We're adding new large pages, so make sure
 * we don't remove their parent tables.
+*
+* We also need to flush the iotlb before 
creating
+* superpage to ensure it does not perserves any
+* obsolete info.
 */
-   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-  largepage_lvl + 1);
+   if (dma_pte_present(pte)) {
+   int i;
+
+   dma_pte_free_pagetable(domain, iov_pfn, 
end_pfn,
+  largepage_lvl + 
1);
+   for_each_domain_iommu(i, domain)
+   
iommu_flush_iotlb_psi(g_iommus[i], domain,
+ iov_pfn, 
nr_pages, 0, 0);
+   }
} else {
pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE;
}
-- 
1.8.3.1



RE: A problem of Intel IOMMU hardware ?

2021-03-21 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


> -Original Message-
> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> Sent: Monday, March 22, 2021 7:51 AM
> To: 'Nadav Amit' 
> Cc: Tian, Kevin ; chenjiashang
> ; David Woodhouse ;
> io...@lists.linux-foundation.org; LKML ;
> alex.william...@redhat.com; Gonglei (Arei) ;
> w...@kernel.org; 'Lu Baolu' ; 'Joerg Roedel'
> 
> Subject: RE: A problem of Intel IOMMU hardware ?
> 
> Hi Nadav,
> 
> > -Original Message-
> > From: Nadav Amit [mailto:nadav.a...@gmail.com]
> > Sent: Friday, March 19, 2021 12:46 AM
> > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > 
> > Cc: Tian, Kevin ; chenjiashang
> > ; David Woodhouse ;
> > io...@lists.linux-foundation.org; LKML ;
> > alex.william...@redhat.com; Gonglei (Arei) ;
> > w...@kernel.org
> > Subject: Re: A problem of Intel IOMMU hardware ?
> >
> >
> >
> > > On Mar 18, 2021, at 2:25 AM, Longpeng (Mike, Cloud Infrastructure
> > > Service
> > Product Dept.)  wrote:
> > >
> > >
> > >
> > >> -Original Message-
> > >> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > >> Sent: Thursday, March 18, 2021 4:56 PM
> > >> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >> ; Nadav Amit 
> > >> Cc: chenjiashang ; David Woodhouse
> > >> ; io...@lists.linux-foundation.org; LKML
> > >> ; alex.william...@redhat.com; Gonglei
> > >> (Arei) ; w...@kernel.org
> > >> Subject: RE: A problem of Intel IOMMU hardware ?
> > >>
> > >>> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >>> 
> > >>>
> > >>>> -Original Message-
> > >>>> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > >>>> Sent: Thursday, March 18, 2021 4:27 PM
> > >>>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >>>> ; Nadav Amit 
> > >>>> Cc: chenjiashang ; David Woodhouse
> > >>>> ; io...@lists.linux-foundation.org; LKML
> > >>>> ; alex.william...@redhat.com;
> > >>>> Gonglei
> > >>> (Arei)
> > >>>> ; w...@kernel.org
> > >>>> Subject: RE: A problem of Intel IOMMU hardware ?
> > >>>>
> > >>>>> From: iommu  On Behalf
> > >>>>> Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >>>>>
> > >>>>>> 2. Consider ensuring that the problem is not somehow related to
> > >>>>>> queued invalidations. Try to use __iommu_flush_iotlb() instead
> > >>>>>> of
> > >>>> qi_flush_iotlb().
> > >>>>>>
> > >>>>>
> > >>>>> I tried to force to use __iommu_flush_iotlb(), but maybe
> > >>>>> something wrong, the system crashed, so I prefer to lower the
> > >>>>> priority of this
> > >>> operation.
> > >>>>>
> > >>>>
> > >>>> The VT-d spec clearly says that register-based invalidation can
> > >>>> be used only
> > >>> when
> > >>>> queued-invalidations are not enabled. Intel-IOMMU driver doesn't
> > >>>> provide
> > >>> an
> > >>>> option to disable queued-invalidation though, when the hardware
> > >>>> is
> > >>> capable. If you
> > >>>> really want to try, tweak the code in intel_iommu_init_qi.
> > >>>>
> > >>>
> > >>> Hi Kevin,
> > >>>
> > >>> Thanks to point out this. Do you have any ideas about this problem ?
> > >>> I tried to descript the problem much clear in my reply to Alex,
> > >>> hope you could have a look if you're interested.
> > >>>
> > >>
> > >> btw I saw you used 4.18 kernel in this test. What about latest kernel?
> > >>
> > >
> > > Not test yet. It's hard to upgrade kernel in our environment.
> > >
> > >> Also one way to separate sw/hw bug is to trace the low level
> > >> interface (e.g.,
> > >> qi_flush_iotlb) which actually sends invalidation descriptors to
> > >> the IOMMU hardware. Check the window between b) and c) and see
> > >> whether the software does the right thing as expected the

RE: A problem of Intel IOMMU hardware ?

2021-03-21 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Nadav,

> -Original Message-
> From: Nadav Amit [mailto:nadav.a...@gmail.com]
> Sent: Friday, March 19, 2021 12:46 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> Cc: Tian, Kevin ; chenjiashang
> ; David Woodhouse ;
> io...@lists.linux-foundation.org; LKML ;
> alex.william...@redhat.com; Gonglei (Arei) ;
> w...@kernel.org
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> 
> 
> > On Mar 18, 2021, at 2:25 AM, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.)  wrote:
> >
> >
> >
> >> -Original Message-
> >> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> >> Sent: Thursday, March 18, 2021 4:56 PM
> >> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >> ; Nadav Amit 
> >> Cc: chenjiashang ; David Woodhouse
> >> ; io...@lists.linux-foundation.org; LKML
> >> ; alex.william...@redhat.com; Gonglei
> >> (Arei) ; w...@kernel.org
> >> Subject: RE: A problem of Intel IOMMU hardware ?
> >>
> >>> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >>> 
> >>>
> >>>> -Original Message-
> >>>> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> >>>> Sent: Thursday, March 18, 2021 4:27 PM
> >>>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >>>> ; Nadav Amit 
> >>>> Cc: chenjiashang ; David Woodhouse
> >>>> ; io...@lists.linux-foundation.org; LKML
> >>>> ; alex.william...@redhat.com; Gonglei
> >>> (Arei)
> >>>> ; w...@kernel.org
> >>>> Subject: RE: A problem of Intel IOMMU hardware ?
> >>>>
> >>>>> From: iommu  On Behalf
> >>>>> Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >>>>>
> >>>>>> 2. Consider ensuring that the problem is not somehow related to
> >>>>>> queued invalidations. Try to use __iommu_flush_iotlb() instead of
> >>>> qi_flush_iotlb().
> >>>>>>
> >>>>>
> >>>>> I tried to force to use __iommu_flush_iotlb(), but maybe something
> >>>>> wrong, the system crashed, so I prefer to lower the priority of
> >>>>> this
> >>> operation.
> >>>>>
> >>>>
> >>>> The VT-d spec clearly says that register-based invalidation can be
> >>>> used only
> >>> when
> >>>> queued-invalidations are not enabled. Intel-IOMMU driver doesn't
> >>>> provide
> >>> an
> >>>> option to disable queued-invalidation though, when the hardware is
> >>> capable. If you
> >>>> really want to try, tweak the code in intel_iommu_init_qi.
> >>>>
> >>>
> >>> Hi Kevin,
> >>>
> >>> Thanks to point out this. Do you have any ideas about this problem ?
> >>> I tried to descript the problem much clear in my reply to Alex, hope
> >>> you could have a look if you're interested.
> >>>
> >>
> >> btw I saw you used 4.18 kernel in this test. What about latest kernel?
> >>
> >
> > Not test yet. It's hard to upgrade kernel in our environment.
> >
> >> Also one way to separate sw/hw bug is to trace the low level
> >> interface (e.g.,
> >> qi_flush_iotlb) which actually sends invalidation descriptors to the
> >> IOMMU hardware. Check the window between b) and c) and see whether
> >> the software does the right thing as expected there.
> >>
> >
> > We add some log in iommu driver these days, the software seems fine.
> > But we didn't look inside the qi_submit_sync yet, I'll try it tonight.
> 
> So here is my guess:
> 
> Intel probably used as a basis for the IOTLB an implementation of some other
> (regular) TLB design.
> 
> Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):
> 
> "Software wishing to prevent this uncertainty should not write to a
> paging-structure entry in a way that would change, for any linear address, 
> both the
> page size and either the page frame, access rights, or other attributes.”
> 
> 
> Now the aforementioned uncertainty is a bit different (multiple
> *valid* translations of a single address). Yet, perhaps this is yet another 
> thing that
> might happen.
> 
> From a brief look on the handling of MMU (not IOMMU) hugepages in Linux, 
> indeed
> the PMD is first 

RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Thursday, March 18, 2021 4:56 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; Nadav Amit 
> Cc: chenjiashang ; David Woodhouse
> ; io...@lists.linux-foundation.org; LKML
> ; alex.william...@redhat.com; Gonglei (Arei)
> ; w...@kernel.org
> Subject: RE: A problem of Intel IOMMU hardware ?
> 
> > From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > 
> >
> > > -Original Message-
> > > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > > Sent: Thursday, March 18, 2021 4:27 PM
> > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > ; Nadav Amit 
> > > Cc: chenjiashang ; David Woodhouse
> > > ; io...@lists.linux-foundation.org; LKML
> > > ; alex.william...@redhat.com; Gonglei
> > (Arei)
> > > ; w...@kernel.org
> > > Subject: RE: A problem of Intel IOMMU hardware ?
> > >
> > > > From: iommu  On Behalf
> > > > Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > >
> > > > > 2. Consider ensuring that the problem is not somehow related to
> > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead
> > > > > of
> > > qi_flush_iotlb().
> > > > >
> > > >
> > > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > > wrong, the system crashed, so I prefer to lower the priority of
> > > > this
> > operation.
> > > >
> > >
> > > The VT-d spec clearly says that register-based invalidation can be
> > > used only
> > when
> > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't
> > > provide
> > an
> > > option to disable queued-invalidation though, when the hardware is
> > capable. If you
> > > really want to try, tweak the code in intel_iommu_init_qi.
> > >
> >
> > Hi Kevin,
> >
> > Thanks to point out this. Do you have any ideas about this problem ? I
> > tried to descript the problem much clear in my reply to Alex, hope you
> > could have a look if you're interested.
> >
> 
> btw I saw you used 4.18 kernel in this test. What about latest kernel?
> 

Not test yet. It's hard to upgrade kernel in our environment.

> Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
> qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
> hardware. Check the window between b) and c) and see whether the software does
> the right thing as expected there.
> 

We add some log in iommu driver these days, the software seems fine. But we
didn't look inside the qi_submit_sync yet, I'll try it tonight.

> Thanks
> Kevin


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Thursday, March 18, 2021 4:43 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; Nadav Amit 
> Cc: chenjiashang ; David Woodhouse
> ; io...@lists.linux-foundation.org; LKML
> ; alex.william...@redhat.com; Gonglei (Arei)
> ; w...@kernel.org
> Subject: RE: A problem of Intel IOMMU hardware ?
> 
> > From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > 
> >
> >
> > > -Original Message-
> > > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > > Sent: Thursday, March 18, 2021 4:27 PM
> > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > ; Nadav Amit 
> > > Cc: chenjiashang ; David Woodhouse
> > > ; io...@lists.linux-foundation.org; LKML
> > > ; alex.william...@redhat.com; Gonglei
> > (Arei)
> > > ; w...@kernel.org
> > > Subject: RE: A problem of Intel IOMMU hardware ?
> > >
> > > > From: iommu  On Behalf
> > > > Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > >
> > > > > 2. Consider ensuring that the problem is not somehow related to
> > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead
> > > > > of
> > > qi_flush_iotlb().
> > > > >
> > > >
> > > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > > wrong, the system crashed, so I prefer to lower the priority of
> > > > this
> > operation.
> > > >
> > >
> > > The VT-d spec clearly says that register-based invalidation can be
> > > used only
> > when
> > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't
> > > provide
> > an
> > > option to disable queued-invalidation though, when the hardware is
> > capable. If you
> > > really want to try, tweak the code in intel_iommu_init_qi.
> > >
> >
> > Hi Kevin,
> >
> > Thanks to point out this. Do you have any ideas about this problem ? I
> > tried to descript the problem much clear in my reply to Alex, hope you
> > could have a look if you're interested.
> >
> 
> I agree with Nadav. Looks this implies some stale paging structure cache 
> entry (e.g.
> PMD) is not invalidated properly. It's better if Baolu can reproduce this 
> problem in
> his local environment and then do more debug to identify whether it's a 
> software or
> hardware defect.
> 
> btw what is the device under test? Does it support ATS?
> 

The device is our offload card, it does not support ATS cap.

> Thanks
> Kevin


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Thursday, March 18, 2021 4:27 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; Nadav Amit 
> Cc: chenjiashang ; David Woodhouse
> ; io...@lists.linux-foundation.org; LKML
> ; alex.william...@redhat.com; Gonglei (Arei)
> ; w...@kernel.org
> Subject: RE: A problem of Intel IOMMU hardware ?
> 
> > From: iommu  On Behalf Of
> > Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >
> > > 2. Consider ensuring that the problem is not somehow related to
> > > queued invalidations. Try to use __iommu_flush_iotlb() instead of
> qi_flush_iotlb().
> > >
> >
> > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > wrong, the system crashed, so I prefer to lower the priority of this 
> > operation.
> >
> 
> The VT-d spec clearly says that register-based invalidation can be used only 
> when
> queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide an
> option to disable queued-invalidation though, when the hardware is capable. 
> If you
> really want to try, tweak the code in intel_iommu_init_qi.
> 

Hi Kevin,

Thanks to point out this. Do you have any ideas about this problem ? I tried
to descript the problem much clear in my reply to Alex, hope you could have
a look if you're interested.

> Thanks
> Kevin


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Nadav,

> -Original Message-
> From: Nadav Amit [mailto:nadav.a...@gmail.com]
> Sent: Thursday, March 18, 2021 2:13 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> Cc: David Woodhouse ; Lu Baolu
> ; Joerg Roedel ; w...@kernel.org;
> alex.william...@redhat.com; chenjiashang ;
> io...@lists.linux-foundation.org; Gonglei (Arei) ;
> LKML 
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> 
> 
> > On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.)  wrote:
> >
> > Hi Nadav,
> >
> >> -Original Message-
> >> From: Nadav Amit [mailto:nadav.a...@gmail.com]
> >>>  reproduce the problem with high probability (~50%).
> >>
> >> I saw Lu replied, and he is much more knowledgable than I am (I was
> >> just intrigued by your email).
> >>
> >> However, if I were you I would try also to remove some
> >> “optimizations” to look for the root-cause (e.g., use domain specific
> invalidations instead of page-specific).
> >>
> >
> > Good suggestion! But we did it these days, we tried to use global 
> > invalidations as
> follow:
> > iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > DMA_TLB_DSI_FLUSH);
> > But can not resolve the problem.
> >
> >> The first thing that comes to my mind is the invalidation hint (ih)
> >> in iommu_flush_iotlb_psi(). I would remove it to see whether you get
> >> the failure without it.
> >
> > We also notice the IH, but the IH is always ZERO in our case, as the spec 
> > says:
> > '''
> > Paging-structure-cache entries caching second-level mappings
> > associated with the specified domain-id and the
> > second-level-input-address range are invalidated, if the Invalidation
> > Hint
> > (IH) field is Clear.
> > '''
> >
> > It seems the software is everything fine, so we've no choice but to suspect 
> > the
> hardware.
> 
> Ok, I am pretty much out of ideas. I have two more suggestions, but they are 
> much
> less likely to help. Yet, they can further help to rule out software bugs:
> 
> 1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE()
> to prevent split-write, which might potentially cause “invalid” (partially
> cleared) PTE to be stored in the TLB. Having said that, the subsequent IOTLB 
> flush
> should have prevented the problem.
> 

Yes, use WRITE_ONCE is much safer, however I was just testing the following 
code,
it didn't resolved my problem.

static inline void dma_clear_pte(struct dma_pte *pte)
{
WRITE_ONCE(pte->val, 0ULL);
}

> 2. Consider ensuring that the problem is not somehow related to queued
> invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb().
> 

I tried to force to use __iommu_flush_iotlb(), but maybe something wrong,
the system crashed, so I prefer to lower the priority of this operation.

> Regards,
> Nadav


RE: A problem of Intel IOMMU hardware ?

2021-03-17 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi guys,

I provide more information, please see below

> -Original Message-
> From: Lu Baolu [mailto:baolu...@linux.intel.com]
> Sent: Thursday, March 18, 2021 10:59 AM
> To: Alex Williamson 
> Cc: baolu...@linux.intel.com; Longpeng (Mike, Cloud Infrastructure Service 
> Product
> Dept.) ; dw...@infradead.org; j...@8bytes.org;
> w...@kernel.org; io...@lists.linux-foundation.org; LKML
> ; Gonglei (Arei) ;
> chenjiashang 
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> Hi Alex,
> 
> On 3/17/21 11:18 PM, Alex Williamson wrote:
> >>>   {MAP,   0x0, 0xc000}, - (b)
> >>>   use GDB to pause at here, and then DMA read
> >>> IOVA=0,
> >> IOVA 0 seems to be a special one. Have you verified with other
> >> addresses than IOVA 0?
> > It is???  That would be a problem.
> >
> 
> No problem from hardware point of view as far as I can see. Just thought about
> software might handle it specially.
> 

We simplify the reproducer, use the following map/unmap sequences can also 
reproduce the problem.

1. use 2M hugetlbfs to mmap 4G memory

2. run the while loop:
While (1) {
DMA MAP (0, 0xa) - - - - - - - - - - - - - -(a)
DMA UNMAP (0, 0xa) - - - - - - - - - - - (b)
  Operation-1 : dump DMAR table
DMA MAP (0, 0xc000) - - - - - - - - - - -(c)
  Operation-2 :
 use GDB to pause at here, then DMA read IOVA=0,
 sometimes DMA success (as expected),
 but sometimes DMA error (report not-present).
  Operation-3 : dump DMAR table
  Operation-4 (when DMA error) : please see below
DMA UNMAP (0, 0xc000) - - - - - - - - -(d)
}

The DMAR table of Operation-1 is (only show the entries about IOVA 0):

PML4: 0x  1a34fbb003
  PDPE: 0x  1a34fbb003
   PDE: 0x  1a34fbf003
PTE: 0x   0

And the table of Operation-3 is:

PML4: 0x  1a34fbb003
  PDPE: 0x  1a34fbb003
   PDE: 0x   15ec00883 < - - 2M superpage

So we can see the IOVA 0 is mapped, but the DMA read is error:

dmar_fault: 131757 callbacks suppressed
DRHD: handling fault status reg 402
[DMA Read] Request device [86:05.6] fault addr 0 [fault reason 06] PTE Read 
access is not set
[DMA Read] Request device [86:05.6] fault addr 0 [fault reason 06] PTE Read 
access is not set
DRHD: handling fault status reg 600
DRHD: handling fault status reg 602
[DMA Read] Request device [86:05.6] fault addr 0 [fault reason 06] PTE Read 
access is not set
[DMA Read] Request device [86:05.6] fault addr 0 [fault reason 06] PTE Read 
access is not set
[DMA Read] Request device [86:05.6] fault addr 0 [fault reason 06] PTE Read 
access is not set

NOTE, the magical thing happen...(*Operation-4*) we write the PTE
of Operation-1 from 0 to 0x3 which means can Read/Write, and then
we trigger DMA read again, it success and return the data of HPA 0 !!

Why we modify the older page table would make sense ? As we
have discussed previously, the cache flush part of the driver is correct,
it call flush_iotlb after (b) and no need to flush after (c). But the result
of the experiment shows the older page table or older caches is effective
actually.

Any ideas ?

> Best regards,
> baolu


RE: A problem of Intel IOMMU hardware ?

2021-03-17 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Baolu,

> -Original Message-
> From: Lu Baolu [mailto:baolu...@linux.intel.com]
> Sent: Wednesday, March 17, 2021 1:17 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; dw...@infradead.org; j...@8bytes.org;
> w...@kernel.org; alex.william...@redhat.com
> Cc: baolu...@linux.intel.com; io...@lists.linux-foundation.org; LKML
> ; Gonglei (Arei) ;
> chenjiashang 
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> Hi Longpeng,
> 
> On 3/17/21 11:16 AM, Longpeng (Mike, Cloud Infrastructure Service Product 
> Dept.)
> wrote:
> > Hi guys,
> >
> > We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a
> > special situation, it would cause DMA fails or get wrong data.
> >
> > The reproducer (based on Alex's vfio testsuite[1]) is in attachment,
> > it can reproduce the problem with high probability (~50%).
> >
> > The machine we used is:
> > processor   : 47
> > vendor_id   : GenuineIntel
> > cpu family  : 6
> > model   : 85
> > model name  : Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
> > stepping: 4
> > microcode   : 0x269
> >
> > And the iommu capability reported is:
> > ver 1:0 cap 8d2078c106f0466 ecap f020df (caching mode = 0 ,
> > page-selective invalidation = 1)
> >
> > (The problem is also on 'Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz'
> > and
> > 'Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz')
> >
> > We run the reproducer on Linux 4.18 and it works as follow:
> >
> > Step 1. alloc 4G *2M-hugetlb* memory (N.B. no problem with 4K-page
> > mapping)
> 
> I don't understand 2M-hugetlb here means exactly. The IOMMU hardware
> supports both 2M and 1G super page. The mapping physical memory is 4G.
> Why couldn't it use 1G super page?
> 

We use hugetlbfs(support 1G and 2M, but we choose 2M in our case) to request
the memory in userspace: 
vaddr = (unsigned long)mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | 
*MAP_HUGETLB*, 0, 0);

Yep, IOMMU support both 2M and 1G superpage, we just haven't to test the 1G case
yet, because our productions use 2M hugetlbfs page.

> > Step 2. DMA Map 4G memory
> > Step 3.
> >  while (1) {
> >  {UNMAP, 0x0, 0xa},  (a)
> >  {UNMAP, 0xc, 0xbff4},
> 
> Have these two ranges been mapped before? Does the IOMMU driver complains
> when you trying to unmap a range which has never been mapped? The IOMMU
> driver implicitly assumes that mapping and unmapping are paired.
> 

Of course yes, please Step-2, we DMA mapped all the memory(4G) before the while 
loop.
The driver never complained during MAP and UNMAP operations.

> >  {MAP,   0x0, 0xc000}, - (b)
> >  use GDB to pause at here, and then DMA read IOVA=0,
> 
> IOVA 0 seems to be a special one. Have you verified with other addresses than
> IOVA 0?
> 

Yes, we also test IOVA=0x1000, it has problem too.

But one of the differeces between (0x0, 0xa) and (0x0, 0xc000) is the 
former
can only use 4K mapping in DMA pagetable but the latter uses 2M mapping. Is it 
possible
the hardware cache management works something wrong in this case?

> >  sometimes DMA success (as expected),
> >  but sometimes DMA error (report not-present).
> >  {UNMAP, 0x0, 0xc000}, - (c)
> >  {MAP,   0x0, 0xa},
> >  {MAP,   0xc, 0xbff4},
> >  }
> >
> > The DMA read operations sholud success between (b) and (c), it should
> > NOT report not-present at least!
> >
> > After analysis the problem, we think maybe it's caused by the Intel iommu 
> > iotlb.
> > It seems the DMA Remapping hardware still uses the IOTLB or other caches of
> (a).
> >
> > When do DMA unmap at (a), the iotlb will be flush:
> >  intel_iommu_unmap
> >  domain_unmap
> >  iommu_flush_iotlb_psi
> >
> > When do DMA map at (b), no need to flush the iotlb according to the
> > capability of this iommu:
> >  intel_iommu_map
> >  domain_pfn_mapping
> >  domain_mapping
> >  __mapping_notify_one
> >  if (cap_caching_mode(iommu->cap)) // FALSE
> >  iommu_flush_iotlb_psi
> 
> That's true. The iotlb flushing is not needed in case of PTE been changed from
> non-present to present unless caching mode.
> 

Yes, I also think the driver code is correct. But it's so confused that the 
problem
is disappear if we force it to flush here.

> > But the problem will disappear if we FORCE flush here. So we suspect
> > the iommu hardware.
> >
> > Do you have any suggestion ?
> 
> Best regards,
> baolu


RE: A problem of Intel IOMMU hardware ?

2021-03-17 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Nadav,

> -Original Message-
> From: Nadav Amit [mailto:nadav.a...@gmail.com]
> Sent: Wednesday, March 17, 2021 1:46 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> Cc: David Woodhouse ; Lu Baolu
> ; Joerg Roedel ; w...@kernel.org;
> alex.william...@redhat.com; chenjiashang ;
> io...@lists.linux-foundation.org; Gonglei (Arei) ;
> LKML 
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> 
> 
> > On Mar 16, 2021, at 8:16 PM, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.)  wrote:
> >
> > Hi guys,
> >
> > We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a
> > special situation, it would cause DMA fails or get wrong data.
> >
> > The reproducer (based on Alex's vfio testsuite[1]) is in attachment,
> > it can reproduce the problem with high probability (~50%).
> 
> I saw Lu replied, and he is much more knowledgable than I am (I was just 
> intrigued
> by your email).
> 
> However, if I were you I would try also to remove some “optimizations” to 
> look for
> the root-cause (e.g., use domain specific invalidations instead of 
> page-specific).
> 

Good suggestion! But we did it these days, we tried to use global invalidations 
as follow:
iommu->flush.flush_iotlb(iommu, did, 0, 0,
DMA_TLB_DSI_FLUSH);
But can not resolve the problem.

> The first thing that comes to my mind is the invalidation hint (ih) in
> iommu_flush_iotlb_psi(). I would remove it to see whether you get the failure
> without it.

We also notice the IH, but the IH is always ZERO in our case, as the spec says:
'''
Paging-structure-cache entries caching second-level mappings associated with 
the specified
domain-id and the second-level-input-address range are invalidated, if the 
Invalidation Hint
(IH) field is Clear.
'''

It seems the software is everything fine, so we've no choice but to suspect the 
hardware.


A problem of Intel IOMMU hardware ?

2021-03-16 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi guys,

We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a special
situation, it would cause DMA fails or get wrong data.

The reproducer (based on Alex's vfio testsuite[1]) is in attachment, it can
reproduce the problem with high probability (~50%).

The machine we used is:
processor   : 47
vendor_id   : GenuineIntel
cpu family  : 6
model   : 85
model name  : Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
stepping: 4
microcode   : 0x269

And the iommu capability reported is:
ver 1:0 cap 8d2078c106f0466 ecap f020df
(caching mode = 0 , page-selective invalidation = 1)

(The problem is also on 'Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz' and
'Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz')

We run the reproducer on Linux 4.18 and it works as follow:

Step 1. alloc 4G *2M-hugetlb* memory (N.B. no problem with 4K-page mapping)
Step 2. DMA Map 4G memory
Step 3.
while (1) {
{UNMAP, 0x0, 0xa},  (a)
{UNMAP, 0xc, 0xbff4},
{MAP,   0x0, 0xc000}, - (b)
use GDB to pause at here, and then DMA read IOVA=0,
sometimes DMA success (as expected),
but sometimes DMA error (report not-present).
{UNMAP, 0x0, 0xc000}, - (c)
{MAP,   0x0, 0xa},
{MAP,   0xc, 0xbff4},
}

The DMA read operations sholud success between (b) and (c), it should NOT report
not-present at least!

After analysis the problem, we think maybe it's caused by the Intel iommu iotlb.
It seems the DMA Remapping hardware still uses the IOTLB or other caches of (a).

When do DMA unmap at (a), the iotlb will be flush:
intel_iommu_unmap
domain_unmap
iommu_flush_iotlb_psi

When do DMA map at (b), no need to flush the iotlb according to the capability
of this iommu:
intel_iommu_map
domain_pfn_mapping
domain_mapping
__mapping_notify_one
if (cap_caching_mode(iommu->cap)) // FALSE
iommu_flush_iotlb_psi
But the problem will disappear if we FORCE flush here. So we suspect the iommu
hardware.

Do you have any suggestion ?







/*
 * VFIO API definition
 *
 * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
 * Author: Alex Williamson 
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 */
#ifndef _UAPIVFIO_H
#define _UAPIVFIO_H

#include 
#include 

#define VFIO_API_VERSION0


/* Kernel & User level defines for VFIO IOCTLs. */

/* Extensions */

#define VFIO_TYPE1_IOMMU1

/*
 * The IOCTL interface is designed for extensibility by embedding the
 * structure length (argsz) and flags into structures passed between
 * kernel and userspace.  We therefore use the _IO() macro for these
 * defines to avoid implicitly embedding a size into the ioctl request.
 * As structure fields are added, argsz will increase to match and flag
 * bits will be defined to indicate additional fields with valid data.
 * It's *always* the caller's responsibility to indicate the size of
 * the structure passed by setting argsz appropriately.
 */

#define VFIO_TYPE   (';')
#define VFIO_BASE   100

/*  IOCTLs for VFIO file descriptor (/dev/vfio/vfio)  */

/**
 * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
 *
 * Report the version of the VFIO API.  This allows us to bump the entire
 * API version should we later need to add or change features in incompatible
 * ways.
 * Return: VFIO_API_VERSION
 * Availability: Always
 */
#define VFIO_GET_API_VERSION_IO(VFIO_TYPE, VFIO_BASE + 0)

/**
 * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __u32)
 *
 * Check whether an extension is supported.
 * Return: 0 if not supported, 1 (or some other positive integer) if supported.
 * Availability: Always
 */
#define VFIO_CHECK_EXTENSION_IO(VFIO_TYPE, VFIO_BASE + 1)

/**
 * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
 *
 * Set the iommu to the given type.  The type must be supported by an
 * iommu driver as verified by calling CHECK_EXTENSION using the same
 * type.  A group must be set to this file descriptor before this
 * ioctl is available.  The IOMMU interfaces enabled by this call are
 * specific to the value set.
 * Return: 0 on success, -errno on failure
 * Availability: When VFIO group attached
 */
#define VFIO_SET_IOMMU  _IO(VFIO_TYPE, VFIO_BASE + 2)

/*  IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP)  */

/**
 * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
 *  struct vfio_group_status)
 *
 * Retrieve information about the group.  Fills in provided
 * struct vfio_group_info.  Caller sets argsz.
 

Re: [PATCH] nitro_enclaves: set master in the procedure of NE probe

2021-01-24 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)



在 2021/1/20 18:27, Paraschiv, Andra-Irina 写道:
> 
> 
> On 19/01/2021 05:30, Longpeng(Mike) wrote:
>> According the PCI spec:
>>    Bus Master Enable – Controls the ability of a PCI Express
>>    Endpoint to issue Memory and I/O Read/Write Requests, and
>>    the ability of a Root or Switch Port to forward Memory and
>>    I/O Read/Write Requests in the Upstream direction
>>
>> Set BusMaster to make the driver to be PCI conformant.
> 
> Could update the commit title and message body to reflect more the why and 
> what for the change. One option can be:
> 
> nitro_enclaves: Set Bus Master for the NE PCI device
> 
> Enable Bus Master for the NE PCI device, according to the PCI spec
> for submitting memory or I/O requests:
>   Bus Master Enable ...
> 
> 
> 
> Please include the changelog in the commit message for the next revision(s).
> 
> + Greg in CC, as the patches for the Nitro Enclaves kernel driver are first 
> included in the char misc tree, then in the linux next and finally in the 
> mainline.
> 
Will update the commit message in V2.

>>
>> Signed-off-by: Longpeng(Mike) 
>> ---
>>   drivers/virt/nitro_enclaves/ne_pci_dev.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/virt/nitro_enclaves/ne_pci_dev.c 
>> b/drivers/virt/nitro_enclaves/ne_pci_dev.c
>> index b9c1de4..143207e 100644
>> --- a/drivers/virt/nitro_enclaves/ne_pci_dev.c
>> +++ b/drivers/virt/nitro_enclaves/ne_pci_dev.c
>> @@ -480,6 +480,8 @@ static int ne_pci_probe(struct pci_dev *pdev, const 
>> struct pci_device_id *id)
>>  goto free_ne_pci_dev;
>>  }
>>
>> +   pci_set_master(pdev);
> 
> I was looking if we need the reverse for this - pci_clear_master() [1] - on 
> the error and remove / shutdown codebase paths, but pci_disable_device() 
> seems to include the bus master disable logic [2][3].
> 
No need to call pci_clear_master.

> Thanks,
> Andra
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c?h=v5.11-rc4#n4312
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c?h=v5.11-rc4#n2104
> [3] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c?h=v5.11-rc4#n4242
> 
>> +
>>  rc = pci_request_regions_exclusive(pdev, "nitro_enclaves");
>>  if (rc < 0) {
>>  dev_err(>dev, "Error in pci request regions 
>> [rc=%d]\n", rc);
>> -- 
>> 1.8.3.1
>>
> 
> 
> 
> 
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar 
> Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in 
> Romania. Registration number J22/2621/2005.


[PATCH] nitro_enclaves: set master in the procedure of NE probe

2021-01-18 Thread Longpeng(Mike)
According the PCI spec:
  Bus Master Enable – Controls the ability of a PCI Express
  Endpoint to issue Memory and I/O Read/Write Requests, and
  the ability of a Root or Switch Port to forward Memory and
  I/O Read/Write Requests in the Upstream direction

Set BusMaster to make the driver to be PCI conformant.

Signed-off-by: Longpeng(Mike) 
---
 drivers/virt/nitro_enclaves/ne_pci_dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virt/nitro_enclaves/ne_pci_dev.c 
b/drivers/virt/nitro_enclaves/ne_pci_dev.c
index b9c1de4..143207e 100644
--- a/drivers/virt/nitro_enclaves/ne_pci_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_pci_dev.c
@@ -480,6 +480,8 @@ static int ne_pci_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
goto free_ne_pci_dev;
}
 
+   pci_set_master(pdev);
+
rc = pci_request_regions_exclusive(pdev, "nitro_enclaves");
if (rc < 0) {
dev_err(>dev, "Error in pci request regions [rc=%d]\n", 
rc);
-- 
1.8.3.1



Re: [PATCH v3 1/3] crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()

2020-06-10 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)



On 2020/6/5 22:10, Sasha Levin wrote:
> <20200123101000.GB24255@Red>
> References: <20200602070501.2023-2-longpe...@huawei.com>
> <20200123101000.GB24255@Red>
> 
> Hi
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag
> fixing commit: dbaf0624ffa5 ("crypto: add virtio-crypto driver").
> 
> The bot has tested the following trees: v5.6.15, v5.4.43, v4.19.125, 
> v4.14.182.
> 
> v5.6.15: Build OK!
> v5.4.43: Failed to apply! Possible dependencies:
> eee1d6fca0a0 ("crypto: virtio - switch to skcipher API")
> 
> v4.19.125: Failed to apply! Possible dependencies:
> eee1d6fca0a0 ("crypto: virtio - switch to skcipher API")
> 
> v4.14.182: Failed to apply! Possible dependencies:
> 500e6807ce93 ("crypto: virtio - implement missing support for output IVs")
> 67189375bb3a ("crypto: virtio - convert to new crypto engine API")
> d0d859bb87ac ("crypto: virtio - Register an algo only if it's supported")
> e02b8b43f55a ("crypto: virtio - pr_err() strings should end with 
> newlines")
> eee1d6fca0a0 ("crypto: virtio - switch to skcipher API")
> 
> 
> NOTE: The patch will not be queued to stable trees until it is upstream.
> 
> How should we proceed with this patch?
> 
I've tried to adapt my patch to these stable tree, but it seems there're some
other bugs. so I think the best way to resolve these conflicts is to apply the
dependent patches detected.

If we apply these dependent patches, then the conflicts of other two patches:
 crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req
 crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req
will also be gone.

---
Regards,
Longpeng(Mike)


[PATCH v3 1/3] crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()

2020-06-02 Thread Longpeng(Mike)
The system will crash when the users insmod crypto/tcrypt.ko with mode=38
( testing "cts(cbc(aes))" ).

Usually the next entry of one sg will be @sg@ + 1, but if this sg element
is part of a chained scatterlist, it could jump to the start of a new
scatterlist array. Fix it by sg_next() on calculation of src/dst
scatterlist.

Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org
Message-Id: <20200123101000.GB24255@Red>
Signed-off-by: Gonglei 
Signed-off-by: Longpeng(Mike) 
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index fd045e64..5f82435 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -350,13 +350,18 @@ static int virtio_crypto_skcipher_setkey(struct 
crypto_skcipher *tfm,
int err;
unsigned long flags;
struct scatterlist outhdr, iv_sg, status_sg, **sgs;
-   int i;
u64 dst_len;
unsigned int num_out = 0, num_in = 0;
int sg_total;
uint8_t *iv;
+   struct scatterlist *sg;
 
src_nents = sg_nents_for_len(req->src, req->cryptlen);
+   if (src_nents < 0) {
+   pr_err("Invalid number of src SG.\n");
+   return src_nents;
+   }
+
dst_nents = sg_nents(req->dst);
 
pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: 
%d)\n",
@@ -442,12 +447,12 @@ static int virtio_crypto_skcipher_setkey(struct 
crypto_skcipher *tfm,
vc_sym_req->iv = iv;
 
/* Source data */
-   for (i = 0; i < src_nents; i++)
-   sgs[num_out++] = >src[i];
+   for (sg = req->src; src_nents; sg = sg_next(sg), src_nents--)
+   sgs[num_out++] = sg;
 
/* Destination data */
-   for (i = 0; i < dst_nents; i++)
-   sgs[num_out + num_in++] = >dst[i];
+   for (sg = req->dst; sg; sg = sg_next(sg))
+   sgs[num_out + num_in++] = sg;
 
/* Status */
sg_init_one(_sg, _req->status, sizeof(vc_req->status));
-- 
1.8.3.1



[PATCH v3 2/3] crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()

2020-06-02 Thread Longpeng(Mike)
The system'll crash when the users insmod crypto/tcrypto.ko with mode=155
( testing "authenc(hmac(sha1),cbc(aes))" ). It's caused by reuse the memory
of request structure.

In crypto_authenc_init_tfm(), the reqsize is set to:
  [PART 1] sizeof(authenc_request_ctx) +
  [PART 2] ictx->reqoff +
  [PART 3] MAX(ahash part, skcipher part)
and the 'PART 3' is used by both ahash and skcipher in turn.

When the virtio_crypto driver finish skcipher req, it'll call ->complete
callback(in crypto_finalize_skcipher_request) and then free its
resources whose pointers are recorded in 'skcipher parts'.

However, the ->complete is 'crypto_authenc_encrypt_done' in this case,
it will use the 'ahash part' of the request and change its content,
so virtio_crypto driver will get the wrong pointer after ->complete
finish and mistakenly free some other's memory. So the system will crash
when these memory will be used again.

The resources which need to be cleaned up are not used any more. But the
pointers of these resources may be changed in the function
"crypto_finalize_skcipher_request". Thus release specific resources before
calling this function.

Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin 
Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org
Message-Id: <20200123101000.GB24255@Red>
Acked-by: Gonglei 
Signed-off-by: Longpeng(Mike) 
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index 5f82435..52261b6 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -582,10 +582,11 @@ static void virtio_crypto_skcipher_finalize_req(
scatterwalk_map_and_copy(req->iv, req->dst,
 req->cryptlen - AES_BLOCK_SIZE,
 AES_BLOCK_SIZE, 0);
-   crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
-  req, err);
kzfree(vc_sym_req->iv);
virtcrypto_clear_request(_sym_req->base);
+
+   crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
+  req, err);
 }
 
 static struct virtio_crypto_algo virtio_crypto_algs[] = { {
-- 
1.8.3.1



[PATCH v3 0/3] crypto: virtio: Fix three issues

2020-06-02 Thread Longpeng(Mike)
Patch 1 & 2: fix two crash issues, Link: https://lkml.org/lkml/2020/1/23/205
Patch 3: fix another functional issue

Changes since v2:
 - put another bugfix together

Changes since v1:
 - remove some redundant checks [Jason]
 - normalize the commit message [Markus]

Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org

Longpeng(Mike) (3):
  crypto: virtio: Fix src/dst scatterlist calculation in
__virtio_crypto_skcipher_do_req()
  crypto: virtio: Fix use-after-free in
virtio_crypto_skcipher_finalize_req()
  crypto: virtio: Fix dest length calculation in
__virtio_crypto_skcipher_do_req()

 drivers/crypto/virtio/virtio_crypto_algs.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

-- 
1.8.3.1



[PATCH v3 3/3] crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req()

2020-06-02 Thread Longpeng(Mike)
The src/dst length is not aligned with AES_BLOCK_SIZE(which is 16) in some
testcases in tcrypto.ko.

For example, the src/dst length of one of cts(cbc(aes))'s testcase is 17, the
crypto_virtio driver will set @src_data_len=16 but @dst_data_len=17 in this
case and get a wrong at then end.

  SRC: pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp (17 bytes)
  EXP: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc pp (17 bytes)
  DST: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00 (pollute the last 
bytes)
  (pp: plaintext  cc:ciphertext)

Fix this issue by limit the length of dest buffer.

Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org
Signed-off-by: Longpeng(Mike) 
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index 52261b6..cb8a6ea 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -407,6 +407,7 @@ static int virtio_crypto_skcipher_setkey(struct 
crypto_skcipher *tfm,
goto free;
}
 
+   dst_len = min_t(unsigned int, req->cryptlen, dst_len);
pr_debug("virtio_crypto: src_len: %u, dst_len: %llu\n",
req->cryptlen, dst_len);
 
-- 
1.8.3.1



Re: [PATCH v2 2/2] crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()

2020-06-01 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)



On 2020/5/31 17:21, Michael S. Tsirkin wrote:
> On Tue, May 26, 2020 at 02:11:37PM +, Sasha Levin wrote:
>> <20200123101000.GB24255@Red>
>> References: <20200526031956.1897-3-longpe...@huawei.com>
>> <20200123101000.GB24255@Red>
>>
>> Hi
>>
>> [This is an automated email]
>>
>> This commit has been processed because it contains a "Fixes:" tag
>> fixing commit: dbaf0624ffa5 ("crypto: add virtio-crypto driver").
>>
>> The bot has tested the following trees: v5.6.14, v5.4.42, v4.19.124, 
>> v4.14.181.
>>
>> v5.6.14: Build OK!
>> v5.4.42: Failed to apply! Possible dependencies:
>> eee1d6fca0a0 ("crypto: virtio - switch to skcipher API")
>>
>> v4.19.124: Failed to apply! Possible dependencies:
>> eee1d6fca0a0 ("crypto: virtio - switch to skcipher API")
>>
>> v4.14.181: Failed to apply! Possible dependencies:
>> 500e6807ce93 ("crypto: virtio - implement missing support for output 
>> IVs")
>> 67189375bb3a ("crypto: virtio - convert to new crypto engine API")
>> d0d859bb87ac ("crypto: virtio - Register an algo only if it's supported")
>> e02b8b43f55a ("crypto: virtio - pr_err() strings should end with 
>> newlines")
>> eee1d6fca0a0 ("crypto: virtio - switch to skcipher API")
>>
>>
>> NOTE: The patch will not be queued to stable trees until it is upstream.
>>
>> How should we proceed with this patch?
> 
> Mike could you comment on backporting?
> 
Hi Michael,

I will send V3, so I will resolve these conflicts later. :)

>> -- 
>> Thanks
>> Sasha
> 
> .
> 
---
Regards,
Longpeng(Mike)


Re: [v2 2/2] crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()

2020-05-26 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)



On 2020/5/26 17:01, Markus Elfring wrote:
>>>> … Thus release specific resources before
>>>
>>> Is there a need to improve also this information another bit?
>>>
>> You mean the last two paragraph is redundant ?
> 
> No.
> 
> I became curious if you would like to choose a more helpful information
> according to the wording “specific resources”.
> 
> Regards,
> Markus
> 
Hi Markus,

I respect your work, but please let us to focus on the code itself. I think
experts in this area know what these patches want to solve after look at the 
code.

I hope experts in the thread could review the code when you free, thanks :)

---
Regards,
Longpeng(Mike)


Re: [PATCH v2 2/2] crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()

2020-05-26 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Markus,

On 2020/5/26 15:19, Markus Elfring wrote:
>> The system'll crash when the users insmod crypto/tcrypto.ko with mode=155
>> ( testing "authenc(hmac(sha1),cbc(aes))" ). It's caused by reuse the memory
>> of request structure.
> 
> Wording adjustments:
> * … system will crash …
> * … It is caused by reusing the …
> 
> 
>> when these memory will be used again.
> 
> when this memory …
> 
OK.

> 
>> … Thus release specific resources before
> 
> Is there a need to improve also this information another bit?
> 
You mean the last two paragraph is redundant ?
'''
When the virtio_crypto driver finish skcipher req, it'll call ->complete
callback(in crypto_finalize_skcipher_request) and then free its
resources whose pointers are recorded in 'skcipher parts'.

However, the ->complete is 'crypto_authenc_encrypt_done' in this case,
it will use the 'ahash part' of the request and change its content,
so virtio_crypto driver will get the wrong pointer after ->complete
finish and mistakenly free some other's memory. So the system will crash
when these memory will be used again.

The resources which need to be cleaned up are not used any more. But the
pointers of these resources may be changed in the function
"crypto_finalize_skcipher_request". Thus release specific resources before
calling this function.
'''

How about:
'''
When the virtio_crypto driver finish the skcipher request, it will call the
function "crypto_finalize_skcipher_request()" and then free the resources whose
pointers are stored in the 'skcipher parts', but the pointers of these resources
 may be changed in that function. Thus fix it by releasing these resources
befored calling the function "crypto_finalize_skcipher_request()".
'''


> Regards,
> Markus
> 
---
Regards,
Longpeng(Mike)


Re: [PATCH v2 1/2] crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()

2020-05-26 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Markus,

On 2020/5/26 15:03, Markus Elfring wrote:
>> Fix it by sg_next() on calculation of src/dst scatterlist.
> 
> Wording adjustment:
> … by calling the function “sg_next” …
> 
OK, thanks.

> 
> …
>> +++ b/drivers/crypto/virtio/virtio_crypto_algs.c
>> @@ -350,13 +350,18 @@ __virtio_crypto_skcipher_do_req(struct 
>> virtio_crypto_sym_request *vc_sym_req,
> …
>>  src_nents = sg_nents_for_len(req->src, req->cryptlen);
>> +if (src_nents < 0) {
>> +pr_err("Invalid number of src SG.\n");
>> +return src_nents;
>> +}
>> +
>>  dst_nents = sg_nents(req->dst);
> …
> 
> I suggest to move the addition of such input parameter validation
> to a separate update step.
> 
Um...The 'src_nents' will be used as a loop condition, so validate it here is
needed ?

'''
/* Source data */
-   for (i = 0; i < src_nents; i++)
-   sgs[num_out++] = >src[i];
+   for (sg = req->src; src_nents; sg = sg_next(sg), src_nents--)
+   sgs[num_out++] = sg;
'''

> Regards,
> Markus
> 

-- 
---
Regards,
Longpeng(Mike)


[PATCH v2 2/2] crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()

2020-05-25 Thread Longpeng(Mike)
The system'll crash when the users insmod crypto/tcrypto.ko with mode=155
( testing "authenc(hmac(sha1),cbc(aes))" ). It's caused by reuse the memory
of request structure.

In crypto_authenc_init_tfm(), the reqsize is set to:
  [PART 1] sizeof(authenc_request_ctx) +
  [PART 2] ictx->reqoff +
  [PART 3] MAX(ahash part, skcipher part)
and the 'PART 3' is used by both ahash and skcipher in turn.

When the virtio_crypto driver finish skcipher req, it'll call ->complete
callback(in crypto_finalize_skcipher_request) and then free its
resources whose pointers are recorded in 'skcipher parts'.

However, the ->complete is 'crypto_authenc_encrypt_done' in this case,
it will use the 'ahash part' of the request and change its content,
so virtio_crypto driver will get the wrong pointer after ->complete
finish and mistakenly free some other's memory. So the system will crash
when these memory will be used again.

The resources which need to be cleaned up are not used any more. But the
pointers of these resources may be changed in the function
"crypto_finalize_skcipher_request". Thus release specific resources before
calling this function.

Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin 
Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: Markus Elfring 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org
Message-Id: <20200123101000.GB24255@Red>
Signed-off-by: Longpeng(Mike) 
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index 5f8243563009..52261b6c247e 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -582,10 +582,11 @@ static void virtio_crypto_skcipher_finalize_req(
scatterwalk_map_and_copy(req->iv, req->dst,
 req->cryptlen - AES_BLOCK_SIZE,
 AES_BLOCK_SIZE, 0);
-   crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
-  req, err);
kzfree(vc_sym_req->iv);
virtcrypto_clear_request(_sym_req->base);
+
+   crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
+  req, err);
 }
 
 static struct virtio_crypto_algo virtio_crypto_algs[] = { {
-- 
2.23.0



[PATCH v2 1/2] crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()

2020-05-25 Thread Longpeng(Mike)
The system will crash when the users insmod crypto/tcrypt.ko with mode=38
( testing "cts(cbc(aes))" ).

Usually the next entry of one sg will be @sg@ + 1, but if this sg element
is part of a chained scatterlist, it could jump to the start of a new
scatterlist array. Fix it by sg_next() on calculation of src/dst
scatterlist.

Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver")
Reported-by: LABBE Corentin 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: Markus Elfring 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org
Message-Id: <20200123101000.GB24255@Red>
Signed-off-by: Gonglei 
Signed-off-by: Longpeng(Mike) 
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index fd045e64972a..5f8243563009 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -350,13 +350,18 @@ __virtio_crypto_skcipher_do_req(struct 
virtio_crypto_sym_request *vc_sym_req,
int err;
unsigned long flags;
struct scatterlist outhdr, iv_sg, status_sg, **sgs;
-   int i;
u64 dst_len;
unsigned int num_out = 0, num_in = 0;
int sg_total;
uint8_t *iv;
+   struct scatterlist *sg;
 
src_nents = sg_nents_for_len(req->src, req->cryptlen);
+   if (src_nents < 0) {
+   pr_err("Invalid number of src SG.\n");
+   return src_nents;
+   }
+
dst_nents = sg_nents(req->dst);
 
pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: 
%d)\n",
@@ -442,12 +447,12 @@ __virtio_crypto_skcipher_do_req(struct 
virtio_crypto_sym_request *vc_sym_req,
vc_sym_req->iv = iv;
 
/* Source data */
-   for (i = 0; i < src_nents; i++)
-   sgs[num_out++] = >src[i];
+   for (sg = req->src; src_nents; sg = sg_next(sg), src_nents--)
+   sgs[num_out++] = sg;
 
/* Destination data */
-   for (i = 0; i < dst_nents; i++)
-   sgs[num_out + num_in++] = >dst[i];
+   for (sg = req->dst; sg; sg = sg_next(sg))
+   sgs[num_out + num_in++] = sg;
 
/* Status */
sg_init_one(_sg, _req->status, sizeof(vc_req->status));
-- 
2.23.0



[PATCH v2 0/2] crypto: virtio: Fix two crash issue

2020-05-25 Thread Longpeng(Mike)
Link: https://lkml.org/lkml/2020/1/23/205

Changes since v1:
 - remove some redundant checks [Jason]
 - normalize the commit message [Markus]

Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: Markus Elfring 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org

Longpeng(Mike) (2):
  crypto: virtio: Fix src/dst scatterlist calculation in
__virtio_crypto_skcipher_do_req()
  crypto: virtio: Fix use-after-free in
virtio_crypto_skcipher_finalize_req()

 drivers/crypto/virtio/virtio_crypto_algs.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

-- 
2.23.0



Re: [2/2] crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()

2020-05-25 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)



On 2020/5/25 15:36, Markus Elfring wrote:
>> Could you help me to make the sentence better?
> 
> How do you think about a wording variant like the following?
> 
>   So the system will crash when this memory will be used again.
> 
Uh, it's much better, thanks.

> 
>>> * You proposed to move a call of the function 
>>> “crypto_finalize_skcipher_request”.
>>>   How does this change fit to the mentioned position?
>>>
>> The resources which need to be freed is not used anymore, but the pointers
>> of these resources may be changed in the function
>> "crypto_finalize_skcipher_request", so free these resources before call the
>> function is suitable.
> 
> Another alternative:
>   The resources which need to be cleaned up are not used any more.
>   But the pointers of these resources may be changed in the
>   function “crypto_finalize_skcipher_request”.
>   Thus release specific resources before calling this function.
> 
Oh great! Thanks.

> Regards,
> Markus
> 

-- 
---
Regards,
Longpeng(Mike)


Re: [PATCH 2/2] crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()

2020-05-25 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Markus,

On 2020/5/25 14:30, Markus Elfring wrote:
>> … So the system will crash
>> at last when this memory be used again.
> 
> I would prefer a wording with less typos here.
> 
Could you help me to make the sentence better?

> 
>> We can free the resources before calling ->complete to fix this issue.
> 
> * An imperative wording can be nicer.
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=9cb1fd0efd195590b828b9b865421ad345a4a145#n151
> 
I'll try.

> * You proposed to move a call of the function 
> “crypto_finalize_skcipher_request”.
>   How does this change fit to the mentioned position?
> 
The resources which need to be freed is not used anymore, but the pointers
of these resources may be changed in the function
"crypto_finalize_skcipher_request", so free these resources before call the
function is suitable.

> * Would you like to add the tag “Fixes” to the commit message?
>
OK.

> Regards,
> Markus
> 

-- 
---
Regards,
Longpeng(Mike)


Re: [PATCH 1/2] crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()

2020-05-25 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Markus,

On 2020/5/25 14:05, Markus Elfring wrote:
>> The system will crash when we insmod crypto/tcrypt.ko whit mode=38.
> 
> * I suggest to use the word “with” in this sentence.
> 
OK, it's a typo.

> * Will it be helpful to explain the passed mode number?
> 
> 
>> BTW I add a check for sg_nents_for_len() its return value since
>> sg_nents_for_len() function could fail.
> 
> Please reconsider also development consequences for this suggestion.
> Will a separate update step be more appropriate for the addition of
> an input parameter validation?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=9cb1fd0efd195590b828b9b865421ad345a4a145#n138
> 
> Would you like to add the tag “Fixes” to the commit message?
>
Will take all of your suggestions in v2, thanks.

> Regards,
> Markus
> 

-- 
---
Regards,
Longpeng(Mike)


Re: [PATCH 1/2] crypto: virtio: fix src/dst scatterlist calculation

2020-05-24 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Jason,

On 2020/5/25 11:12, Jason Wang wrote:
> 
> On 2020/5/25 上午8:56, Longpeng(Mike) wrote:
>> The system will crash when we insmod crypto/tcrypt.ko whit mode=38.
>>
>> Usually the next entry of one sg will be @sg@ + 1, but if this sg element
>> is part of a chained scatterlist, it could jump to the start of a new
>> scatterlist array. Let's fix it by sg_next() on calculation of src/dst
>> scatterlist.
>>
>> BTW I add a check for sg_nents_for_len() its return value since
>> sg_nents_for_len() function could fail.
>>
>> Cc: Gonglei 
>> Cc: Herbert Xu 
>> Cc: "Michael S. Tsirkin" 
>> Cc: Jason Wang 
>> Cc: "David S. Miller" 
>> Cc: virtualizat...@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
>>
>> Reported-by: LABBE Corentin 
>> Signed-off-by: Gonglei 
>> Signed-off-by: Longpeng(Mike) 
>> ---
>>   drivers/crypto/virtio/virtio_crypto_algs.c | 14 ++
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c
>> b/drivers/crypto/virtio/virtio_crypto_algs.c
>> index 372babb44112..2fa1129f96d6 100644
>> --- a/drivers/crypto/virtio/virtio_crypto_algs.c
>> +++ b/drivers/crypto/virtio/virtio_crypto_algs.c
>> @@ -359,8 +359,14 @@ __virtio_crypto_skcipher_do_req(struct
>> virtio_crypto_sym_request *vc_sym_req,
>>   unsigned int num_out = 0, num_in = 0;
>>   int sg_total;
>>   uint8_t *iv;
>> +    struct scatterlist *sg;
>>     src_nents = sg_nents_for_len(req->src, req->cryptlen);
>> +    if (src_nents < 0) {
>> +    pr_err("Invalid number of src SG.\n");
>> +    return src_nents;
>> +    }
>> +
>>   dst_nents = sg_nents(req->dst);
>>     pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: 
>> %d)\n",
>> @@ -446,12 +452,12 @@ __virtio_crypto_skcipher_do_req(struct
>> virtio_crypto_sym_request *vc_sym_req,
>>   vc_sym_req->iv = iv;
>>     /* Source data */
>> -    for (i = 0; i < src_nents; i++)
>> -    sgs[num_out++] = >src[i];
>> +    for (sg = req->src, i = 0; sg && i < src_nents; sg = sg_next(sg), i++)
> 
> 
> Any reason sg is checked here?
> 
> I believe it should be checked in sg_nents_for_len().
> 
Do you means:
for (sg = req->src, i = 0; i < src_nents; sg = sg_next(sg), i++) ?

> 
>> +    sgs[num_out++] = sg;
>>     /* Destination data */
>> -    for (i = 0; i < dst_nents; i++)
>> -    sgs[num_out + num_in++] = >dst[i];
>> +    for (sg = req->dst, i = 0; sg && i < dst_nents; sg = sg_next(sg), i++)
>> +    sgs[num_out + num_in++] = sg;
> 
> 
> I believe sg should be checked in sg_nents().
>
How about
for (sg = req->dst; sg; sg = sg_next(sg)) ?

> Thanks
> 
> 
>>     /* Status */
>>   sg_init_one(_sg, _req->status, sizeof(vc_req->status));
> 
> .
> 

-- 
---
Regards,
Longpeng(Mike)


[PATCH 2/2] crypto: virtio: fix an memory use-after-free bug

2020-05-24 Thread Longpeng(Mike)
The system'll crash when we insmod crypto/tcrypto.ko with mode=155.

After dig into this case, I find it's caused by reuse the request
memory.

In crypto_authenc_init_tfm, we'll set the reqsize to:
  [PART 1]sizeof(authenc_request_ctx) +
  [PART 2]ictx->reqoff +
  [PART 3]MAX(ahash part, skcipher part)
and the 'PART 3' will be used by both ahash and skcipher.

When virtio_crypto driver finish skcipher req, it'll call ->complete
callback(in crypto_finalize_skcipher_request) and then free its
resources which pointers are recorded in 'skcipher parts'.

However, the ->complete is 'crypto_authenc_encrypt_done' in this case,
it will use the 'ahash part' of the request and change its content,
so virtio_crypto driver will get the wrong pointer after ->complete
finish and mistakenly free some other memory. So the system will crash
at last when this memory be used again.

We can free the resources before calling ->complete to fix this issue.

Cc: Gonglei 
Cc: Herbert Xu 

  
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org

Reported-by: LABBE Corentin 
Signed-off-by: Longpeng(Mike) 
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index 2fa1129f96d6..3800356fb764 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -587,10 +587,11 @@ static void virtio_crypto_skcipher_finalize_req(
scatterwalk_map_and_copy(req->iv, req->dst,
 req->cryptlen - AES_BLOCK_SIZE,
 AES_BLOCK_SIZE, 0);
-   crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
-  req, err);
kzfree(vc_sym_req->iv);
virtcrypto_clear_request(_sym_req->base);
+
+   crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
+  req, err);
 }
 
 static struct virtio_crypto_algo virtio_crypto_algs[] = { {
-- 
2.17.1



[PATCH 0/2] crypto: virtio: fix two crash issue

2020-05-24 Thread Longpeng(Mike)
Link: https://lkml.org/lkml/2020/1/23/205

Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org

Longpeng(Mike) (2):
  crypto: virtio: fix src/dst scatterlist calculation
  crypto: virtio: fix an memory use-after-free bug

 drivers/crypto/virtio/virtio_crypto_algs.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

-- 
2.17.1



[PATCH 1/2] crypto: virtio: fix src/dst scatterlist calculation

2020-05-24 Thread Longpeng(Mike)
The system will crash when we insmod crypto/tcrypt.ko whit mode=38.

Usually the next entry of one sg will be @sg@ + 1, but if this sg element
is part of a chained scatterlist, it could jump to the start of a new
scatterlist array. Let's fix it by sg_next() on calculation of src/dst
scatterlist.

BTW I add a check for sg_nents_for_len() its return value since
sg_nents_for_len() function could fail.

Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org

Reported-by: LABBE Corentin 
Signed-off-by: Gonglei 
Signed-off-by: Longpeng(Mike) 
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index 372babb44112..2fa1129f96d6 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -359,8 +359,14 @@ __virtio_crypto_skcipher_do_req(struct 
virtio_crypto_sym_request *vc_sym_req,
unsigned int num_out = 0, num_in = 0;
int sg_total;
uint8_t *iv;
+   struct scatterlist *sg;
 
src_nents = sg_nents_for_len(req->src, req->cryptlen);
+   if (src_nents < 0) {
+   pr_err("Invalid number of src SG.\n");
+   return src_nents;
+   }
+
dst_nents = sg_nents(req->dst);
 
pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: 
%d)\n",
@@ -446,12 +452,12 @@ __virtio_crypto_skcipher_do_req(struct 
virtio_crypto_sym_request *vc_sym_req,
vc_sym_req->iv = iv;
 
/* Source data */
-   for (i = 0; i < src_nents; i++)
-   sgs[num_out++] = >src[i];
+   for (sg = req->src, i = 0; sg && i < src_nents; sg = sg_next(sg), i++)
+   sgs[num_out++] = sg;
 
/* Destination data */
-   for (i = 0; i < dst_nents; i++)
-   sgs[num_out + num_in++] = >dst[i];
+   for (sg = req->dst, i = 0; sg && i < dst_nents; sg = sg_next(sg), i++)
+   sgs[num_out + num_in++] = sg;
 
/* Status */
sg_init_one(_sg, _req->status, sizeof(vc_req->status));
-- 
2.17.1



[PATCH] virtio_pci: fix a NULL pointer reference in vp_del_vqs

2019-03-08 Thread Longpeng(Mike)
From: Longpeng 

If the msix_affinity_masks is alloced failed, then we'll
try to free some resources in vp_free_vectors() that may
access it directly.

We met the following stack in our production:
[   29.296767] BUG: unable to handle kernel NULL pointer dereference at  (null)
[   29.311151] IP: [] vp_free_vectors+0x6a/0x150 [virtio_pci]
[   29.324787] PGD 0
[   29.333224] Oops:  [#1] SMP
[...]
[   29.425175] RIP: 0010:[]  [] 
vp_free_vectors+0x6a/0x150 [virtio_pci]
[   29.441405] RSP: 0018:9a55c2dcfa10  EFLAGS: 00010206
[   29.453491] RAX:  RBX: 9a55c322c400 RCX: 
[   29.467488] RDX:  RSI:  RDI: 9a55c322c400
[   29.481461] RBP: 9a55c2dcfa20 R08:  R09: c1b6806ff020
[   29.495427] R10: 0e95 R11: 00aa R12: 
[   29.509414] R13: 0001 R14: 9a55bd2d9e98 R15: 9a55c322c400
[   29.523407] FS:  7fdcba69f8c0() GS:9a55c284() 
knlGS:
[   29.538472] CS:  0010 DS:  ES:  CR0: 80050033
[   29.551621] CR2:  CR3: 3ce52000 CR4: 003607a0
[   29.565886] DR0:  DR1:  DR2: 
[   29.580055] DR3:  DR6: fffe0ff0 DR7: 0400
[   29.594122] Call Trace:
[   29.603446]  [] vp_request_msix_vectors+0xe2/0x260 
[virtio_pci]
[   29.618017]  [] vp_try_to_find_vqs+0x95/0x3b0 [virtio_pci]
[   29.632152]  [] vp_find_vqs+0x37/0xb0 [virtio_pci]
[   29.645582]  [] init_vq+0x153/0x260 [virtio_blk]
[   29.658831]  [] virtblk_probe+0xe8/0x87f [virtio_blk]
[...]

Cc: Gonglei 
Signed-off-by: Longpeng 
---
 drivers/virtio/virtio_pci_common.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index d0584c0..7a0398b 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -255,9 +255,11 @@ void vp_del_vqs(struct virtio_device *vdev)
for (i = 0; i < vp_dev->msix_used_vectors; ++i)
free_irq(pci_irq_vector(vp_dev->pci_dev, i), vp_dev);
 
-   for (i = 0; i < vp_dev->msix_vectors; i++)
-   if (vp_dev->msix_affinity_masks[i])
-   free_cpumask_var(vp_dev->msix_affinity_masks[i]);
+   if (vp_dev->msix_affinity_masks) {
+   for (i = 0; i < vp_dev->msix_vectors; i++)
+   if (vp_dev->msix_affinity_masks[i])
+   
free_cpumask_var(vp_dev->msix_affinity_masks[i]);
+   }
 
if (vp_dev->msix_enabled) {
/* Disable the vector used for configuration */
-- 
1.8.3.1




[ RFC ] Set quota on VM cause large schedule latency of vcpu

2018-07-17 Thread Longpeng (Mike)
Virtual machine has cgroup hierarchies as follow:

   root
|
  vm_tg
  (cfs_rq)
  /\
(se)(se)
tg_Atg_B
  (cfs_rq)(cfs_rq)
/  \
  (se)  (se)
  ab

'a' and 'b' are two vcpus of the VM.

We set cfs quota on vm_tg, and the schedule latency of vcpu(a/b) may become very
large, up to more than 2S.
We use perf sched to capture the latency ( perf sched record -a sleep 10;
perf sched lat -p --sort=max ) and the result is as follow:

Task | Runtime ms | Switches | Average delay ms | Maximum delay ms |

CPU 0/KVM| 260.261 ms |   50 | avg:   82.017 ms | max: 2510.990 ms |
...

We test the latest kernel and the result is the same.
We add some tracepoints, found the following sequence will cause the issue:

1) 'a' is only task of tg_A, when 'a' go to sleep (e.g. vcpu halt), tg_A is
dequeued, and tg_A->se->load.weight = MIN_SHARES.

2) 'b' continue running, then trigger throttle. tg_A->cfs_rq->throttle_count=1

3) Something wakeup 'a' (e.g. vcpu receive a virq). When enqueue tg_A,
tg_A->se->load.weight can't be updated because tg_A->cfs_rq->throttle_count=1

4) After one cfs quota period, vm_tg is unthrottled

5) 'a' is running

6) After one tick, when update tg_A->se's vruntime, tg_A->se->load.weight is
still MIN_SHARES, lead tg_A->se's vruntime has grown a large value.

7) That will cause 'a' to have a large schedule latency.


We *rudely* remove the check which cause tg_A->se->load.weight didn't reweight
in step-3 as follow and the problem disappear:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be..348ccd6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3016,9 +3016,6 @@ static void update_cfs_group(struct sched_entity *se)
if (!gcfs_rq)
return;

-   if (throttled_hierarchy(gcfs_rq))
-   return;
-
#ifndef CONFIG_SMP
runnable = shares = READ_ONCE(gcfs_rq->tg->shares);


So do guys you have any suggestion on this problem ? Is there a better way fix
this problem ?

-- 
Regards,
Longpeng(Mike)



[ RFC ] Set quota on VM cause large schedule latency of vcpu

2018-07-17 Thread Longpeng (Mike)
Virtual machine has cgroup hierarchies as follow:

   root
|
  vm_tg
  (cfs_rq)
  /\
(se)(se)
tg_Atg_B
  (cfs_rq)(cfs_rq)
/  \
  (se)  (se)
  ab

'a' and 'b' are two vcpus of the VM.

We set cfs quota on vm_tg, and the schedule latency of vcpu(a/b) may become very
large, up to more than 2S.
We use perf sched to capture the latency ( perf sched record -a sleep 10;
perf sched lat -p --sort=max ) and the result is as follow:

Task | Runtime ms | Switches | Average delay ms | Maximum delay ms |

CPU 0/KVM| 260.261 ms |   50 | avg:   82.017 ms | max: 2510.990 ms |
...

We test the latest kernel and the result is the same.
We add some tracepoints, found the following sequence will cause the issue:

1) 'a' is only task of tg_A, when 'a' go to sleep (e.g. vcpu halt), tg_A is
dequeued, and tg_A->se->load.weight = MIN_SHARES.

2) 'b' continue running, then trigger throttle. tg_A->cfs_rq->throttle_count=1

3) Something wakeup 'a' (e.g. vcpu receive a virq). When enqueue tg_A,
tg_A->se->load.weight can't be updated because tg_A->cfs_rq->throttle_count=1

4) After one cfs quota period, vm_tg is unthrottled

5) 'a' is running

6) After one tick, when update tg_A->se's vruntime, tg_A->se->load.weight is
still MIN_SHARES, lead tg_A->se's vruntime has grown a large value.

7) That will cause 'a' to have a large schedule latency.


We *rudely* remove the check which cause tg_A->se->load.weight didn't reweight
in step-3 as follow and the problem disappear:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be..348ccd6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3016,9 +3016,6 @@ static void update_cfs_group(struct sched_entity *se)
if (!gcfs_rq)
return;

-   if (throttled_hierarchy(gcfs_rq))
-   return;
-
#ifndef CONFIG_SMP
runnable = shares = READ_ONCE(gcfs_rq->tg->shares);


So do guys you have any suggestion on this problem ? Is there a better way fix
this problem ?

-- 
Regards,
Longpeng(Mike)



[PATCH] kvm: x86: remove efer_reload entry in kvm_vcpu_stat

2018-01-26 Thread Longpeng(Mike)
The efer_reload is never used since
commit 26bb0981b3ff ("KVM: VMX: Use shared msr infrastructure"),
so remove it.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/x86/include/asm/kvm_host.h | 1 -
 arch/x86/kvm/x86.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5167984..b24b34d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -883,7 +883,6 @@ struct kvm_vcpu_stat {
u64 request_irq_exits;
u64 irq_exits;
u64 host_state_reload;
-   u64 efer_reload;
u64 fpu_reload;
u64 insn_emulation;
u64 insn_emulation_fail;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c53298d..6573526 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -177,7 +177,6 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "request_irq", VCPU_STAT(request_irq_exits) },
{ "irq_exits", VCPU_STAT(irq_exits) },
{ "host_state_reload", VCPU_STAT(host_state_reload) },
-   { "efer_reload", VCPU_STAT(efer_reload) },
{ "fpu_reload", VCPU_STAT(fpu_reload) },
{ "insn_emulation", VCPU_STAT(insn_emulation) },
{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
-- 
1.8.3.1




[PATCH] kvm: x86: remove efer_reload entry in kvm_vcpu_stat

2018-01-26 Thread Longpeng(Mike)
The efer_reload is never used since
commit 26bb0981b3ff ("KVM: VMX: Use shared msr infrastructure"),
so remove it.

Signed-off-by: Longpeng(Mike) 
---
 arch/x86/include/asm/kvm_host.h | 1 -
 arch/x86/kvm/x86.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5167984..b24b34d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -883,7 +883,6 @@ struct kvm_vcpu_stat {
u64 request_irq_exits;
u64 irq_exits;
u64 host_state_reload;
-   u64 efer_reload;
u64 fpu_reload;
u64 insn_emulation;
u64 insn_emulation_fail;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c53298d..6573526 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -177,7 +177,6 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "request_irq", VCPU_STAT(request_irq_exits) },
{ "irq_exits", VCPU_STAT(irq_exits) },
{ "host_state_reload", VCPU_STAT(host_state_reload) },
-   { "efer_reload", VCPU_STAT(efer_reload) },
{ "fpu_reload", VCPU_STAT(fpu_reload) },
{ "insn_emulation", VCPU_STAT(insn_emulation) },
{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
-- 
1.8.3.1




Re: [PATCH 3/8] kvm: vmx: pass MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD down to the guest

2018-01-13 Thread Longpeng (Mike)


On 2018/1/9 20:03, Paolo Bonzini wrote:

> Direct access to MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD is important
> for performance.  Allow load/store of MSR_IA32_SPEC_CTRL, restore guest
> IBRS on VM entry and set it to 0 on VM exit (because Linux does not use
> it yet).
> 
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> ---
>  arch/x86/kvm/vmx.c | 42 ++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 669f5f74857d..ef603692aa98 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -120,6 +120,8 @@
>  module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  #endif
>  
> +static bool __read_mostly have_spec_ctrl;
> +
>  #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
>  #define KVM_VM_CR0_ALWAYS_ON \
> @@ -609,6 +611,8 @@ struct vcpu_vmx {
>   u64   msr_host_kernel_gs_base;
>   u64   msr_guest_kernel_gs_base;
>  #endif
> + u64   spec_ctrl;
> +
>   u32 vm_entry_controls_shadow;
>   u32 vm_exit_controls_shadow;
>   u32 secondary_exec_control;
> @@ -3361,6 +3365,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   case MSR_IA32_TSC:
>   msr_info->data = guest_read_tsc(vcpu);
>   break;
> + case MSR_IA32_SPEC_CTRL:
> + msr_info->data = to_vmx(vcpu)->spec_ctrl;
> + break;
>   case MSR_IA32_SYSENTER_CS:
>   msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
>   break;
> @@ -3500,6 +3507,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   case MSR_IA32_TSC:
>   kvm_write_tsc(vcpu, msr_info);
>   break;
> + case MSR_IA32_SPEC_CTRL:
> + to_vmx(vcpu)->spec_ctrl = data;
> + break;
>   case MSR_IA32_CR_PAT:
>   if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>   if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -7062,6 +7072,17 @@ static __init int hardware_setup(void)
>   goto out;
>   }
>  
> + /*
> +  * FIXME: this is only needed until SPEC_CTRL is supported
> +  * by upstream Linux in cpufeatures, then it can be replaced
> +  * with static_cpu_has.
> +  */
> + have_spec_ctrl = cpu_has_spec_ctrl();
> + if (have_spec_ctrl)
> + pr_info("kvm: SPEC_CTRL available\n");
> + else
> + pr_info("kvm: SPEC_CTRL not available\n");
> +

In this approach, we must reload these modules if we update the microcode later 
?

>   if (boot_cpu_has(X86_FEATURE_NX))
>   kvm_enable_efer_bits(EFER_NX);
>  
> @@ -7131,6 +7152,8 @@ static __init int hardware_setup(void)
>   vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
>   vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
>   vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
> + vmx_disable_intercept_for_msr(MSR_IA32_SPEC_CTRL, false);
> + vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false);
>  
>   memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
>   vmx_msr_bitmap_legacy, PAGE_SIZE);
> @@ -9601,6 +9624,13 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  
>   vmx_arm_hv_timer(vcpu);
>  
> + /*
> +  * MSR_IA32_SPEC_CTRL is restored after the last indirect branch
> +  * before vmentry.
> +  */
> + if (have_spec_ctrl && vmx->spec_ctrl != 0)
> + wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +
>   vmx->__launched = vmx->loaded_vmcs->launched;
>   asm(
>   /* Store host registers */
> @@ -9707,6 +9737,18 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  #endif
> );
>  
> + if (have_spec_ctrl) {
> + rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> + if (vmx->spec_ctrl != 0)
> + wrmsrl(MSR_IA32_SPEC_CTRL, 0);
> + }
> + /*
> +  * Speculative execution past the above wrmsrl might encounter
> +  * an indirect branch and use guest-controlled contents of the
> +  * indirect branch predictor; block it.
> +  */
> + asm("lfence");
> +
>   /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
>   if (vmx->host_debugctlmsr)
>   update_debugctlmsr(vmx->host_debugctlmsr);


-- 
Regards,
Longpeng(Mike)



Re: [PATCH 3/8] kvm: vmx: pass MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD down to the guest

2018-01-13 Thread Longpeng (Mike)


On 2018/1/9 20:03, Paolo Bonzini wrote:

> Direct access to MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD is important
> for performance.  Allow load/store of MSR_IA32_SPEC_CTRL, restore guest
> IBRS on VM entry and set it to 0 on VM exit (because Linux does not use
> it yet).
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/vmx.c | 42 ++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 669f5f74857d..ef603692aa98 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -120,6 +120,8 @@
>  module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  #endif
>  
> +static bool __read_mostly have_spec_ctrl;
> +
>  #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
>  #define KVM_VM_CR0_ALWAYS_ON \
> @@ -609,6 +611,8 @@ struct vcpu_vmx {
>   u64   msr_host_kernel_gs_base;
>   u64   msr_guest_kernel_gs_base;
>  #endif
> + u64   spec_ctrl;
> +
>   u32 vm_entry_controls_shadow;
>   u32 vm_exit_controls_shadow;
>   u32 secondary_exec_control;
> @@ -3361,6 +3365,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   case MSR_IA32_TSC:
>   msr_info->data = guest_read_tsc(vcpu);
>   break;
> + case MSR_IA32_SPEC_CTRL:
> + msr_info->data = to_vmx(vcpu)->spec_ctrl;
> + break;
>   case MSR_IA32_SYSENTER_CS:
>   msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
>   break;
> @@ -3500,6 +3507,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   case MSR_IA32_TSC:
>   kvm_write_tsc(vcpu, msr_info);
>   break;
> + case MSR_IA32_SPEC_CTRL:
> + to_vmx(vcpu)->spec_ctrl = data;
> + break;
>   case MSR_IA32_CR_PAT:
>   if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>   if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -7062,6 +7072,17 @@ static __init int hardware_setup(void)
>   goto out;
>   }
>  
> + /*
> +  * FIXME: this is only needed until SPEC_CTRL is supported
> +  * by upstream Linux in cpufeatures, then it can be replaced
> +  * with static_cpu_has.
> +  */
> + have_spec_ctrl = cpu_has_spec_ctrl();
> + if (have_spec_ctrl)
> + pr_info("kvm: SPEC_CTRL available\n");
> + else
> + pr_info("kvm: SPEC_CTRL not available\n");
> +

In this approach, we must reload these modules if we update the microcode later 
?

>   if (boot_cpu_has(X86_FEATURE_NX))
>   kvm_enable_efer_bits(EFER_NX);
>  
> @@ -7131,6 +7152,8 @@ static __init int hardware_setup(void)
>   vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
>   vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
>   vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
> + vmx_disable_intercept_for_msr(MSR_IA32_SPEC_CTRL, false);
> + vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false);
>  
>   memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
>   vmx_msr_bitmap_legacy, PAGE_SIZE);
> @@ -9601,6 +9624,13 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  
>   vmx_arm_hv_timer(vcpu);
>  
> + /*
> +  * MSR_IA32_SPEC_CTRL is restored after the last indirect branch
> +  * before vmentry.
> +  */
> + if (have_spec_ctrl && vmx->spec_ctrl != 0)
> + wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +
>   vmx->__launched = vmx->loaded_vmcs->launched;
>   asm(
>   /* Store host registers */
> @@ -9707,6 +9737,18 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  #endif
> );
>  
> + if (have_spec_ctrl) {
> + rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> + if (vmx->spec_ctrl != 0)
> + wrmsrl(MSR_IA32_SPEC_CTRL, 0);
> + }
> + /*
> +  * Speculative execution past the above wrmsrl might encounter
> +  * an indirect branch and use guest-controlled contents of the
> +  * indirect branch predictor; block it.
> +  */
> + asm("lfence");
> +
>   /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
>   if (vmx->host_debugctlmsr)
>   update_debugctlmsr(vmx->host_debugctlmsr);


-- 
Regards,
Longpeng(Mike)



Re: [PATCH CFT 0/4] VT-d PI fixes

2017-09-21 Thread Longpeng (Mike)
Hi Paolo,

We have backported the first three patches and have tested for about 20 days, it
works fine.

So could you consider to merge this series ?

-- 
Regards,
Longpeng(Mike)

On 2017/7/11 17:16, Gonglei (Arei) wrote:

> 
> 
>> -Original Message-
>> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
>> Behalf Of Paolo Bonzini
>> Sent: Tuesday, July 11, 2017 4:56 PM
>> To: Gonglei (Arei)
>> Cc: linux-kernel@vger.kernel.org; k...@vger.kernel.org; longpeng;
>> Huangweidong (C); wangxin (U); Radim Krčmář
>> Subject: Re: [PATCH CFT 0/4] VT-d PI fixes
>>
>> On 07/06/2017 11:33, Gonglei (Arei) wrote:
>>> We are testing your patch, but maybe need some time to report
>>> the results because it's not an inevitable problem.
>>>
>>> Meanwhile we also try to find a possible scenario of non-hotplugging to
>>> explain the double-add warnings.
>>
>> Hi Lei,
>>
>> do you have any updates?  
> 
> Dear Paolo,
> 
> Thanks for kicking me :) 
> 
> TBH, thinking about the reliability of productive project (we use kvm-4.4),
> we applied the patch you used in fedora pastebin, and
> the bug seems gone after one month's testing.
> 
> diff --git a/source/x86/vmx.c b/source/x86/vmx.c
> index 79012cf..efc611f 100644
> --- a/source/x86/vmx.c
> +++ b/source/x86/vmx.c
> @@ -11036,8 +11036,9 @@ static void pi_post_block(struct kvm_vcpu *vcpu)
> unsigned int dest;
> unsigned long flags;
>  
> -   if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> -   !irq_remapping_cap(IRQ_POSTING_CAP))
> +   if ((vcpu->pre_pcpu == -1) &&
> +   (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> +   !irq_remapping_cap(IRQ_POSTING_CAP)))
> return;
> 
>> I would like to get at least the first three
>> patches in 4.13.
>>
> I think they are okay to me for upstream.
> 
> Thanks,
> -Gonglei
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH CFT 0/4] VT-d PI fixes

2017-09-21 Thread Longpeng (Mike)
Hi Paolo,

We have backported the first three patches and have tested for about 20 days, it
works fine.

So could you consider to merge this series ?

-- 
Regards,
Longpeng(Mike)

On 2017/7/11 17:16, Gonglei (Arei) wrote:

> 
> 
>> -Original Message-
>> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
>> Behalf Of Paolo Bonzini
>> Sent: Tuesday, July 11, 2017 4:56 PM
>> To: Gonglei (Arei)
>> Cc: linux-kernel@vger.kernel.org; k...@vger.kernel.org; longpeng;
>> Huangweidong (C); wangxin (U); Radim Krčmář
>> Subject: Re: [PATCH CFT 0/4] VT-d PI fixes
>>
>> On 07/06/2017 11:33, Gonglei (Arei) wrote:
>>> We are testing your patch, but maybe need some time to report
>>> the results because it's not an inevitable problem.
>>>
>>> Meanwhile we also try to find a possible scenario of non-hotplugging to
>>> explain the double-add warnings.
>>
>> Hi Lei,
>>
>> do you have any updates?  
> 
> Dear Paolo,
> 
> Thanks for kicking me :) 
> 
> TBH, thinking about the reliability of productive project (we use kvm-4.4),
> we applied the patch you used in fedora pastebin, and
> the bug seems gone after one month's testing.
> 
> diff --git a/source/x86/vmx.c b/source/x86/vmx.c
> index 79012cf..efc611f 100644
> --- a/source/x86/vmx.c
> +++ b/source/x86/vmx.c
> @@ -11036,8 +11036,9 @@ static void pi_post_block(struct kvm_vcpu *vcpu)
> unsigned int dest;
> unsigned long flags;
>  
> -   if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> -   !irq_remapping_cap(IRQ_POSTING_CAP))
> +   if ((vcpu->pre_pcpu == -1) &&
> +   (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> +   !irq_remapping_cap(IRQ_POSTING_CAP)))
> return;
> 
>> I would like to get at least the first three
>> patches in 4.13.
>>
> I think they are okay to me for upstream.
> 
> Thanks,
> -Gonglei
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH] KVM: VMX: add encapsulation kvm_vcpu_pi_need_handle

2017-09-21 Thread Longpeng (Mike)
Hi Peng,

There are two bugs in current code and Paolo already fixed them, please see:
https://www.spinics.net/lists/kvm/msg150896.html

-- 
Regards,
Longpeng(Mike)

On 2017/9/21 23:14, Peng Hao wrote:

> use kvm_vcpu_pi_need_handle encapsulation simply coede
> 
> Signed-off-by: Peng Hao <peng.h...@zte.com.cn>
> ---
>  arch/x86/kvm/vmx.c | 27 ---
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4253ade..26b99f4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -559,6 +559,13 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)
>   (unsigned long *)_desc->control);
>  }
>  
> +static inline bool kvm_vcpu_pi_need_handle(struct kvm_vcpu *vcpu)
> +{
> + return kvm_arch_has_assigned_device(vcpu->kvm) &&
> + irq_remapping_cap(IRQ_POSTING_CAP)  &&
> + kvm_vcpu_apicv_active(vcpu);
> +}
> +
>  struct vcpu_vmx {
>   struct kvm_vcpu   vcpu;
>   unsigned long host_rsp;
> @@ -2202,9 +2209,7 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int 
> cpu)
>   struct pi_desc old, new;
>   unsigned int dest;
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return;
>  
>   do {
> @@ -2323,9 +2328,7 @@ static void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>  {
>   struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return;
>  
>   /* Set SN when the vCPU is preempted */
> @@ -11691,9 +11694,7 @@ static int pi_pre_block(struct kvm_vcpu *vcpu)
>   struct pi_desc old, new;
>   struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return 0;
>  
>   vcpu->pre_pcpu = vcpu->cpu;
> @@ -11769,9 +11770,7 @@ static void pi_post_block(struct kvm_vcpu *vcpu)
>   unsigned int dest;
>   unsigned long flags;
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return;
>  
>   do {
> @@ -11831,9 +11830,7 @@ static int vmx_update_pi_irte(struct kvm *kvm, 
> unsigned int host_irq,
>   struct vcpu_data vcpu_info;
>   int idx, ret = -EINVAL;
>  
> - if (!kvm_arch_has_assigned_device(kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP) ||
> - !kvm_vcpu_apicv_active(kvm->vcpus[0]))
> + if (!kvm_vcpu_pi_need_handle(kvm->vcpus[0]))
>   return 0;
>  
>   idx = srcu_read_lock(>irq_srcu);


-- 
Regards,
Longpeng(Mike)



Re: [PATCH] KVM: VMX: add encapsulation kvm_vcpu_pi_need_handle

2017-09-21 Thread Longpeng (Mike)
Hi Peng,

There are two bugs in current code and Paolo already fixed them, please see:
https://www.spinics.net/lists/kvm/msg150896.html

-- 
Regards,
Longpeng(Mike)

On 2017/9/21 23:14, Peng Hao wrote:

> use kvm_vcpu_pi_need_handle encapsulation simply coede
> 
> Signed-off-by: Peng Hao 
> ---
>  arch/x86/kvm/vmx.c | 27 ---
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4253ade..26b99f4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -559,6 +559,13 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)
>   (unsigned long *)_desc->control);
>  }
>  
> +static inline bool kvm_vcpu_pi_need_handle(struct kvm_vcpu *vcpu)
> +{
> + return kvm_arch_has_assigned_device(vcpu->kvm) &&
> + irq_remapping_cap(IRQ_POSTING_CAP)  &&
> + kvm_vcpu_apicv_active(vcpu);
> +}
> +
>  struct vcpu_vmx {
>   struct kvm_vcpu   vcpu;
>   unsigned long host_rsp;
> @@ -2202,9 +2209,7 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int 
> cpu)
>   struct pi_desc old, new;
>   unsigned int dest;
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return;
>  
>   do {
> @@ -2323,9 +2328,7 @@ static void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>  {
>   struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return;
>  
>   /* Set SN when the vCPU is preempted */
> @@ -11691,9 +11694,7 @@ static int pi_pre_block(struct kvm_vcpu *vcpu)
>   struct pi_desc old, new;
>   struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return 0;
>  
>   vcpu->pre_pcpu = vcpu->cpu;
> @@ -11769,9 +11770,7 @@ static void pi_post_block(struct kvm_vcpu *vcpu)
>   unsigned int dest;
>   unsigned long flags;
>  
> - if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP)  ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!kvm_vcpu_pi_need_handle(vcpu))
>   return;
>  
>   do {
> @@ -11831,9 +11830,7 @@ static int vmx_update_pi_irte(struct kvm *kvm, 
> unsigned int host_irq,
>   struct vcpu_data vcpu_info;
>   int idx, ret = -EINVAL;
>  
> - if (!kvm_arch_has_assigned_device(kvm) ||
> - !irq_remapping_cap(IRQ_POSTING_CAP) ||
> - !kvm_vcpu_apicv_active(kvm->vcpus[0]))
> + if (!kvm_vcpu_pi_need_handle(kvm->vcpus[0]))
>   return 0;
>  
>   idx = srcu_read_lock(>irq_srcu);


-- 
Regards,
Longpeng(Mike)



[RESEND] Question about the userfaultfd write-protect support

2017-09-11 Thread Longpeng (Mike)
(Add Zhanghailiang and Gonglei)


Hi Andrea,

We've implemented a demo of KVM live memory snapshot based on the userfaultfd
write-protect series in your private
tree(https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/). We did
a little hack on that series to make the demo works.

Zhang had discussed with you about this one or two years ago([1][2]), so do you
have any plan to make this series upstream ? If you will, then what is the time
schedule ? We really expect the userfaultfd write-protect series would be 
merged.

References:
[1] https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00664.html
[2] https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02891.html

-- 
Regards,
Longpeng(Mike)



[RESEND] Question about the userfaultfd write-protect support

2017-09-11 Thread Longpeng (Mike)
(Add Zhanghailiang and Gonglei)


Hi Andrea,

We've implemented a demo of KVM live memory snapshot based on the userfaultfd
write-protect series in your private
tree(https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/). We did
a little hack on that series to make the demo works.

Zhang had discussed with you about this one or two years ago([1][2]), so do you
have any plan to make this series upstream ? If you will, then what is the time
schedule ? We really expect the userfaultfd write-protect series would be 
merged.

References:
[1] https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00664.html
[2] https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02891.html

-- 
Regards,
Longpeng(Mike)



Question about the userfaultfd write-protect support

2017-09-11 Thread Longpeng (Mike)
Hi Andrea,

We've implemented a demo of KVM live memory snapshot based on the userfaultfd
write-protect series in your private
tree(https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/). We did
a little hack on that series to make the demo works.

Zhang had discussed with you about this one or two years ago([1][2]), so do you
have any plan to make this series upstream ? If you will, then what is the time
schedule ? We really expect the userfaultfd write-protect series would be 
merged. :)

References:
[1] https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00664.html
[2] https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02891.html

-- 
Regards,
Longpeng(Mike)



Question about the userfaultfd write-protect support

2017-09-11 Thread Longpeng (Mike)
Hi Andrea,

We've implemented a demo of KVM live memory snapshot based on the userfaultfd
write-protect series in your private
tree(https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/). We did
a little hack on that series to make the demo works.

Zhang had discussed with you about this one or two years ago([1][2]), so do you
have any plan to make this series upstream ? If you will, then what is the time
schedule ? We really expect the userfaultfd write-protect series would be 
merged. :)

References:
[1] https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00664.html
[2] https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02891.html

-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-10 Thread Longpeng (Mike)


On 2017/8/10 21:18, Eric Farman wrote:

> 
> 
> On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:
>>
>>
>> On 2017/8/8 15:41, Cornelia Huck wrote:
>>
>>> On Tue, 8 Aug 2017 12:05:31 +0800
>>> "Longpeng(Mike)" <longpe...@huawei.com> wrote:
>>>
>>>> This is a simple optimization for kvm_vcpu_on_spin, the
>>>> main idea is described in patch-1's commit msg.
>>>
>>> I think this generally looks good now.
>>>
>>>>
>>>> I did some tests base on the RFC version, the result shows
>>>> that it can improves the performance slightly.
>>>
>>> Did you re-run tests on this version?
>>
>>
>> Hi Cornelia,
>>
>> I didn't re-run tests on V2. But the major difference between RFC and V2
>> is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
>> expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.
>>
>> So I think V2's performance is at least the same as RFC or even slightly
>> better. :)
>>
>>>
>>> I would also like to see some s390 numbers; unfortunately I only have a
>>> z/VM environment and any performance numbers would be nearly useless
>>> there. Maybe somebody within IBM with a better setup can run a quick
>>> test?
> 
> Won't swear I didn't screw something up, but here's some quick numbers. Host 
> was
> 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 
> guests,
> each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full kernel compiles with
> kernbench, which took averages of 5 runs of different job sizes (I threw away
> the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus.
> 
> Numbers from VM1 kernbench output, and the delta between runs:
> 
> load -j 3beforeafterdelta
> Elapsed Time183.178182.58-0.598
> User Time534.19531.52-2.67
> System Time32.53833.370.832
> Percent CPU308.83090.2
> Context Switches98484.699001516.4
> Sleeps2273472287521405
> 
> load -j 16beforeafterdelta
> Elapsed Time153.352147.59-5.762
> User Time545.829533.41-12.419
> System Time    34.28934.850.561
> Percent CPU347.63480.4
> Context Switches160518159120-1398
> Sleeps240740240536-204
> 


Thanks Eric!

The `Elapsed Time` is smaller with this series , the result is the same as my
numbers in cover-letter.

> 
>  - Eric
> 
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-10 Thread Longpeng (Mike)


On 2017/8/10 21:18, Eric Farman wrote:

> 
> 
> On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:
>>
>>
>> On 2017/8/8 15:41, Cornelia Huck wrote:
>>
>>> On Tue, 8 Aug 2017 12:05:31 +0800
>>> "Longpeng(Mike)"  wrote:
>>>
>>>> This is a simple optimization for kvm_vcpu_on_spin, the
>>>> main idea is described in patch-1's commit msg.
>>>
>>> I think this generally looks good now.
>>>
>>>>
>>>> I did some tests base on the RFC version, the result shows
>>>> that it can improves the performance slightly.
>>>
>>> Did you re-run tests on this version?
>>
>>
>> Hi Cornelia,
>>
>> I didn't re-run tests on V2. But the major difference between RFC and V2
>> is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
>> expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.
>>
>> So I think V2's performance is at least the same as RFC or even slightly
>> better. :)
>>
>>>
>>> I would also like to see some s390 numbers; unfortunately I only have a
>>> z/VM environment and any performance numbers would be nearly useless
>>> there. Maybe somebody within IBM with a better setup can run a quick
>>> test?
> 
> Won't swear I didn't screw something up, but here's some quick numbers. Host 
> was
> 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 
> guests,
> each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full kernel compiles with
> kernbench, which took averages of 5 runs of different job sizes (I threw away
> the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus.
> 
> Numbers from VM1 kernbench output, and the delta between runs:
> 
> load -j 3beforeafterdelta
> Elapsed Time183.178182.58-0.598
> User Time534.19531.52-2.67
> System Time32.53833.370.832
> Percent CPU308.83090.2
> Context Switches98484.699001516.4
> Sleeps2273472287521405
> 
> load -j 16beforeafterdelta
> Elapsed Time153.352147.59-5.762
> User Time545.829533.41-12.419
> System Time    34.28934.850.561
> Percent CPU347.63480.4
> Context Switches160518159120-1398
> Sleeps240740240536-204
> 


Thanks Eric!

The `Elapsed Time` is smaller with this series , the result is the same as my
numbers in cover-letter.

> 
>  - Eric
> 
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH] KVM: X86: expand ->arch.apic_arb_prio to u64

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 21:57, Paolo Bonzini wrote:

> On 08/08/2017 15:50, Longpeng (Mike) wrote:
>>
>>
>> On 2017/8/8 21:08, Paolo Bonzini wrote:
>>
>>> On 08/08/2017 13:37, Longpeng(Mike) wrote:
>>>> Currently 'apic_arb_prio' is int32_t, it's too short for long
>>>> time running. In our environment, it overflowed and then the
>>>> UBSAN was angry:
>>>>
>>>> signed integer overflow:
>>>> 2147483647 + 1 cannot be represented in type 'int'
>>>> CPU: 22 PID: 31237 Comm: qemu-kvm Tainted: ...
>>>> ...
>>>> Call Trace:
>>>>  [] dump_stack+0x1e/0x20
>>>>  [] ubsan_epilogue+0x12/0x55
>>>>  [] handle_overflow+0x1ba/0x215
>>>>  [] __ubsan_handle_add_overflow+0x2a/0x31
>>>>  [] __apic_accept_irq+0x57a/0x5d0 [kvm]
>>>>  [] kvm_apic_set_irq+0x9f/0xf0 [kvm]
>>>>  [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
>>>>  [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
>>>>  [] kvm_set_msi+0xa9/0x100 [kvm]
>>>>  [] kvm_send_userspace_msi+0x14d/0x1f0 [kvm]
>>>>  [] kvm_vm_ioctl+0x4ee/0xdd0 [kvm]
>>>> ...
>>>>
>>>> We expand it to u64, this is large enough. Suppose the vcpu receives
>>>> 1000 irqs per second, then it won't overflow in 584942417 years.
>>>> ( 18446744073709551615/1000/3600/24/365 = 584942417 )
>>>
>>> Since you only look at the difference, changing it to uint32_t should be
>>> enough.
>>
>>
>> Hi Paolo,
>>
>> I'm afraid uint32_t isn't enough. For 1000 irqs per second, it can only holds
>> 49 days ( although the overflow won't cause any corruption ).
> 
> What matters is only the difference across 2 vCPUs.
> 
> And in fact even 32 bits are probably too many, 16 or even 8 should be
> enough because overflowing arb_prio is a good thing.  If you have
> delivered millions IRQs to VCPU0 (let's say for a day), and then switch
> the interrupt to VCPU1, you don't want to the next day to have
> interrupts going to VCPU1 only.  A short warm-up time (a few seconds?)
> is acceptable, but then you should have interrupts distributed equally
> between VCPU0 and VCPU1.  This can only happen if arb_prio overflows.
> 


I understand now, thanks for your patience. :)

-- 
Regards,
Longpeng(Mike)

> Paolo
> 
>> 4294967295/1000/3600/24 = 49
>>
>>>
>>> Paolo
>>>
>>
>>> .
>>>
>>
>>
> 
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH] KVM: X86: expand ->arch.apic_arb_prio to u64

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 21:57, Paolo Bonzini wrote:

> On 08/08/2017 15:50, Longpeng (Mike) wrote:
>>
>>
>> On 2017/8/8 21:08, Paolo Bonzini wrote:
>>
>>> On 08/08/2017 13:37, Longpeng(Mike) wrote:
>>>> Currently 'apic_arb_prio' is int32_t, it's too short for long
>>>> time running. In our environment, it overflowed and then the
>>>> UBSAN was angry:
>>>>
>>>> signed integer overflow:
>>>> 2147483647 + 1 cannot be represented in type 'int'
>>>> CPU: 22 PID: 31237 Comm: qemu-kvm Tainted: ...
>>>> ...
>>>> Call Trace:
>>>>  [] dump_stack+0x1e/0x20
>>>>  [] ubsan_epilogue+0x12/0x55
>>>>  [] handle_overflow+0x1ba/0x215
>>>>  [] __ubsan_handle_add_overflow+0x2a/0x31
>>>>  [] __apic_accept_irq+0x57a/0x5d0 [kvm]
>>>>  [] kvm_apic_set_irq+0x9f/0xf0 [kvm]
>>>>  [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
>>>>  [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
>>>>  [] kvm_set_msi+0xa9/0x100 [kvm]
>>>>  [] kvm_send_userspace_msi+0x14d/0x1f0 [kvm]
>>>>  [] kvm_vm_ioctl+0x4ee/0xdd0 [kvm]
>>>> ...
>>>>
>>>> We expand it to u64, this is large enough. Suppose the vcpu receives
>>>> 1000 irqs per second, then it won't overflow in 584942417 years.
>>>> ( 18446744073709551615/1000/3600/24/365 = 584942417 )
>>>
>>> Since you only look at the difference, changing it to uint32_t should be
>>> enough.
>>
>>
>> Hi Paolo,
>>
>> I'm afraid uint32_t isn't enough. For 1000 irqs per second, it can only holds
>> 49 days ( although the overflow won't cause any corruption ).
> 
> What matters is only the difference across 2 vCPUs.
> 
> And in fact even 32 bits are probably too many, 16 or even 8 should be
> enough because overflowing arb_prio is a good thing.  If you have
> delivered millions IRQs to VCPU0 (let's say for a day), and then switch
> the interrupt to VCPU1, you don't want to the next day to have
> interrupts going to VCPU1 only.  A short warm-up time (a few seconds?)
> is acceptable, but then you should have interrupts distributed equally
> between VCPU0 and VCPU1.  This can only happen if arb_prio overflows.
> 


I understand now, thanks for your patience. :)

-- 
Regards,
Longpeng(Mike)

> Paolo
> 
>> 4294967295/1000/3600/24 = 49
>>
>>>
>>> Paolo
>>>
>>
>>> .
>>>
>>
>>
> 
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH] KVM: X86: expand ->arch.apic_arb_prio to u64

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 21:08, Paolo Bonzini wrote:

> On 08/08/2017 13:37, Longpeng(Mike) wrote:
>> Currently 'apic_arb_prio' is int32_t, it's too short for long
>> time running. In our environment, it overflowed and then the
>> UBSAN was angry:
>>
>> signed integer overflow:
>> 2147483647 + 1 cannot be represented in type 'int'
>> CPU: 22 PID: 31237 Comm: qemu-kvm Tainted: ...
>> ...
>> Call Trace:
>>  [] dump_stack+0x1e/0x20
>>  [] ubsan_epilogue+0x12/0x55
>>  [] handle_overflow+0x1ba/0x215
>>  [] __ubsan_handle_add_overflow+0x2a/0x31
>>  [] __apic_accept_irq+0x57a/0x5d0 [kvm]
>>  [] kvm_apic_set_irq+0x9f/0xf0 [kvm]
>>  [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
>>  [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
>>  [] kvm_set_msi+0xa9/0x100 [kvm]
>>  [] kvm_send_userspace_msi+0x14d/0x1f0 [kvm]
>>  [] kvm_vm_ioctl+0x4ee/0xdd0 [kvm]
>> ...
>>
>> We expand it to u64, this is large enough. Suppose the vcpu receives
>> 1000 irqs per second, then it won't overflow in 584942417 years.
>> ( 18446744073709551615/1000/3600/24/365 = 584942417 )
> 
> Since you only look at the difference, changing it to uint32_t should be
> enough.


Hi Paolo,

I'm afraid uint32_t isn't enough. For 1000 irqs per second, it can only holds
49 days ( although the overflow won't cause any corruption ).

4294967295/1000/3600/24 = 49

> 
> Paolo
> 

> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH] KVM: X86: expand ->arch.apic_arb_prio to u64

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 21:08, Paolo Bonzini wrote:

> On 08/08/2017 13:37, Longpeng(Mike) wrote:
>> Currently 'apic_arb_prio' is int32_t, it's too short for long
>> time running. In our environment, it overflowed and then the
>> UBSAN was angry:
>>
>> signed integer overflow:
>> 2147483647 + 1 cannot be represented in type 'int'
>> CPU: 22 PID: 31237 Comm: qemu-kvm Tainted: ...
>> ...
>> Call Trace:
>>  [] dump_stack+0x1e/0x20
>>  [] ubsan_epilogue+0x12/0x55
>>  [] handle_overflow+0x1ba/0x215
>>  [] __ubsan_handle_add_overflow+0x2a/0x31
>>  [] __apic_accept_irq+0x57a/0x5d0 [kvm]
>>  [] kvm_apic_set_irq+0x9f/0xf0 [kvm]
>>  [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
>>  [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
>>  [] kvm_set_msi+0xa9/0x100 [kvm]
>>  [] kvm_send_userspace_msi+0x14d/0x1f0 [kvm]
>>  [] kvm_vm_ioctl+0x4ee/0xdd0 [kvm]
>> ...
>>
>> We expand it to u64, this is large enough. Suppose the vcpu receives
>> 1000 irqs per second, then it won't overflow in 584942417 years.
>> ( 18446744073709551615/1000/3600/24/365 = 584942417 )
> 
> Since you only look at the difference, changing it to uint32_t should be
> enough.


Hi Paolo,

I'm afraid uint32_t isn't enough. For 1000 irqs per second, it can only holds
49 days ( although the overflow won't cause any corruption ).

4294967295/1000/3600/24 = 49

> 
> Paolo
> 

> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 19:25, David Hildenbrand wrote:

> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
>>
>> == Geekbench-3.4.1 ==
>> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>  running Geekbench-3.4.1 *10 truns*
>> VM2/VM3/VM4: configure is the same as VM1
>>  stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of each testcase's score:
>> (higher is better)
>>  before  after   improve
>> Inter
>>  single  1176.7  1179.0  0.2%
>>  multi   3459.5  3426.5  -0.9%
>> Float
>>  single  1150.5  1150.9  0.0%
>>  multi   3364.5  3391.9  0.8%
>> Memory(stream)
>>  single  1768.7  1773.1  0.2%
>>  multi   2511.6  2557.2  1.8%
>> Overall
>>  single  1284.2  1286.2  0.2%
>>  multi   3231.4  3238.4  0.2%
>>
>>
>> == kernbench-0.42 ==
>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>> running "kernbench -n 10"
>> VM2/VM3/VM4: configure is the same as VM1
>> stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of 'Elapsed Time':
>> (sooner is better)
>>  before  after   improve
>> load -j4 12.762  12.751  0.1%
>> load -j329.743   8.955   8.1%
>> load -j  9.688   9.229   4.7%
>>
>>
>> Physical Machine:
>>   Architecture:  x86_64
>>   CPU op-mode(s):32-bit, 64-bit
>>   Byte Order:Little Endian
>>   CPU(s):24
>>   On-line CPU(s) list:   0-23
>>   Thread(s) per core:2
>>   Core(s) per socket:6
>>   Socket(s): 2
>>   NUMA node(s):  2
>>   Vendor ID: GenuineIntel
>>   CPU family:6
>>   Model: 45
>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>   Stepping:  7
>>   CPU MHz:   2799.902
>>   BogoMIPS:  5004.67
>>   Virtualization:VT-x
>>   L1d cache: 32K
>>   L1i cache: 32K
>>   L2 cache:  256K
>>   L3 cache:  15360K
>>   NUMA node0 CPU(s):     0-5,12-17
>>   NUMA node1 CPU(s): 6-11,18-23
>>
>> ---
>> Changes since V1:
>>  - split the implementation of s390 & arm. [David]
>>  - refactor the impls according to the suggestion. [Paolo]
>>
>> Changes since RFC:
>>  - only cache result for X86. [David & Cornlia & Paolo]
>>  - add performance numbers. [David]
>>  - impls arm/s390. [Christoffer & David]
>>  - refactor the impls. [me]
>>
>> ---
>> Longpeng(Mike) (4):
>>   KVM: add spinlock optimization framework
>>   KVM: X86: implement the logic for spinlock optimization
>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>
>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>  arch/mips/kvm/mips.c|  6 ++
>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>  arch/s390/kvm/diag.c|  2 +-
>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>  arch/x86/include/asm/kvm_host.h |  5 +
>>  arch/x86/kvm/hyperv.c   |  2 +-
>>  arch/x86/kvm/svm.c  | 10 +-
>>  arch/x86/kvm/vmx.c  | 16 +++-
>>  arch/x86/kvm/x86.c  | 11 +++
>>  include/linux/kvm_host.h|  3 ++-
>>  virt/kvm/arm/arm.c  |  5 +
>>  virt/kvm/kvm_main.c |  4 +++-
>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>
> 
> I am curious, is there any architecture that allows to trigger
> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?


IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
kernel-mode or user-mode.

> 
> I would have guessed that user space should never be allowed to make cpu
> wide decisions (giving up the CPU to the hypervisor).
> 
> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
> only valid from kernel space.


X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
this is as you said "only valid from kernel space"

However, the "PAUSE exiting" can cause user-mode vcpu exit too.

> 
> I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
> "me_in_kernel" basically always true?
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 19:25, David Hildenbrand wrote:

> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
>>
>> == Geekbench-3.4.1 ==
>> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>  running Geekbench-3.4.1 *10 truns*
>> VM2/VM3/VM4: configure is the same as VM1
>>  stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of each testcase's score:
>> (higher is better)
>>  before  after   improve
>> Inter
>>  single  1176.7  1179.0  0.2%
>>  multi   3459.5  3426.5  -0.9%
>> Float
>>  single  1150.5  1150.9  0.0%
>>  multi   3364.5  3391.9  0.8%
>> Memory(stream)
>>  single  1768.7  1773.1  0.2%
>>  multi   2511.6  2557.2  1.8%
>> Overall
>>  single  1284.2  1286.2  0.2%
>>  multi   3231.4  3238.4  0.2%
>>
>>
>> == kernbench-0.42 ==
>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>> running "kernbench -n 10"
>> VM2/VM3/VM4: configure is the same as VM1
>> stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of 'Elapsed Time':
>> (sooner is better)
>>  before  after   improve
>> load -j4 12.762  12.751  0.1%
>> load -j329.743   8.955   8.1%
>> load -j  9.688   9.229   4.7%
>>
>>
>> Physical Machine:
>>   Architecture:  x86_64
>>   CPU op-mode(s):32-bit, 64-bit
>>   Byte Order:Little Endian
>>   CPU(s):24
>>   On-line CPU(s) list:   0-23
>>   Thread(s) per core:2
>>   Core(s) per socket:6
>>   Socket(s): 2
>>   NUMA node(s):  2
>>   Vendor ID: GenuineIntel
>>   CPU family:6
>>   Model: 45
>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>   Stepping:  7
>>   CPU MHz:   2799.902
>>   BogoMIPS:  5004.67
>>   Virtualization:VT-x
>>   L1d cache: 32K
>>   L1i cache: 32K
>>   L2 cache:  256K
>>   L3 cache:  15360K
>>   NUMA node0 CPU(s):     0-5,12-17
>>   NUMA node1 CPU(s): 6-11,18-23
>>
>> ---
>> Changes since V1:
>>  - split the implementation of s390 & arm. [David]
>>  - refactor the impls according to the suggestion. [Paolo]
>>
>> Changes since RFC:
>>  - only cache result for X86. [David & Cornlia & Paolo]
>>  - add performance numbers. [David]
>>  - impls arm/s390. [Christoffer & David]
>>  - refactor the impls. [me]
>>
>> ---
>> Longpeng(Mike) (4):
>>   KVM: add spinlock optimization framework
>>   KVM: X86: implement the logic for spinlock optimization
>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>
>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>  arch/mips/kvm/mips.c|  6 ++
>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>  arch/s390/kvm/diag.c|  2 +-
>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>  arch/x86/include/asm/kvm_host.h |  5 +
>>  arch/x86/kvm/hyperv.c   |  2 +-
>>  arch/x86/kvm/svm.c  | 10 +-
>>  arch/x86/kvm/vmx.c  | 16 +++-
>>  arch/x86/kvm/x86.c  | 11 +++
>>  include/linux/kvm_host.h|  3 ++-
>>  virt/kvm/arm/arm.c  |  5 +
>>  virt/kvm/kvm_main.c |  4 +++-
>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>
> 
> I am curious, is there any architecture that allows to trigger
> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?


IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
kernel-mode or user-mode.

> 
> I would have guessed that user space should never be allowed to make cpu
> wide decisions (giving up the CPU to the hypervisor).
> 
> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
> only valid from kernel space.


X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
this is as you said "only valid from kernel space"

However, the "PAUSE exiting" can cause user-mode vcpu exit too.

> 
> I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
> "me_in_kernel" basically always true?
> 


-- 
Regards,
Longpeng(Mike)



[PATCH] KVM: X86: expand ->arch.apic_arb_prio to u64

2017-08-08 Thread Longpeng(Mike)
Currently 'apic_arb_prio' is int32_t, it's too short for long
time running. In our environment, it overflowed and then the
UBSAN was angry:

signed integer overflow:
2147483647 + 1 cannot be represented in type 'int'
CPU: 22 PID: 31237 Comm: qemu-kvm Tainted: ...
...
Call Trace:
 [] dump_stack+0x1e/0x20
 [] ubsan_epilogue+0x12/0x55
 [] handle_overflow+0x1ba/0x215
 [] __ubsan_handle_add_overflow+0x2a/0x31
 [] __apic_accept_irq+0x57a/0x5d0 [kvm]
 [] kvm_apic_set_irq+0x9f/0xf0 [kvm]
 [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
 [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
 [] kvm_set_msi+0xa9/0x100 [kvm]
 [] kvm_send_userspace_msi+0x14d/0x1f0 [kvm]
 [] kvm_vm_ioctl+0x4ee/0xdd0 [kvm]
...

We expand it to u64, this is large enough. Suppose the vcpu receives
1000 irqs per second, then it won't overflow in 584942417 years.
( 18446744073709551615/1000/3600/24/365 = 584942417 )

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/ioapic.h   | 3 ++-
 arch/x86/kvm/irq_comm.c | 2 +-
 arch/x86/kvm/lapic.c| 6 +++---
 4 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..ce9a5f5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -499,7 +499,7 @@ struct kvm_vcpu_arch {
bool apicv_active;
DECLARE_BITMAP(ioapic_handled_vectors, 256);
unsigned long apic_attention;
-   int32_t apic_arb_prio;
+   u64 apic_arb_prio;
int mp_state;
u64 ia32_misc_enable_msr;
u64 smbase;
diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
index 29ce197..a26deed 100644
--- a/arch/x86/kvm/ioapic.h
+++ b/arch/x86/kvm/ioapic.h
@@ -117,7 +117,8 @@ static inline int ioapic_in_kernel(struct kvm *kvm)
 void kvm_rtc_eoi_tracking_restore_one(struct kvm_vcpu *vcpu);
 bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
int short_hand, unsigned int dest, int dest_mode);
-int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2);
+/* Return true if vcpu1's priority is lower */
+bool kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2);
 void kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, int vector,
int trigger_mode);
 int kvm_ioapic_init(struct kvm *kvm);
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 3cc3b2d..03b1487 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -90,7 +90,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
kvm_lapic *src,
if (!kvm_vector_hashing_enabled()) {
if (!lowest)
lowest = vcpu;
-   else if (kvm_apic_compare_prio(vcpu, lowest) < 
0)
+   else if (kvm_apic_compare_prio(vcpu, lowest))
lowest = vcpu;
} else {
__set_bit(i, dest_vcpu_bitmap);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 589dcc1..1e2b1f2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -840,7 +840,7 @@ static inline bool kvm_apic_map_get_dest_lapic(struct kvm 
*kvm,
if (lowest < 0)
lowest = i;
else if (kvm_apic_compare_prio((*dst)[i]->vcpu,
-   (*dst)[lowest]->vcpu) < 0)
+   (*dst)[lowest]->vcpu))
lowest = i;
}
} else {
@@ -1048,9 +1048,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
return result;
 }
 
-int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
+bool kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 {
-   return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
+   return vcpu1->arch.apic_arb_prio < vcpu2->arch.apic_arb_prio;
 }
 
 static bool kvm_ioapic_handles_vector(struct kvm_lapic *apic, int vector)
-- 
1.8.3.1




[PATCH] KVM: X86: expand ->arch.apic_arb_prio to u64

2017-08-08 Thread Longpeng(Mike)
Currently 'apic_arb_prio' is int32_t, it's too short for long
time running. In our environment, it overflowed and then the
UBSAN was angry:

signed integer overflow:
2147483647 + 1 cannot be represented in type 'int'
CPU: 22 PID: 31237 Comm: qemu-kvm Tainted: ...
...
Call Trace:
 [] dump_stack+0x1e/0x20
 [] ubsan_epilogue+0x12/0x55
 [] handle_overflow+0x1ba/0x215
 [] __ubsan_handle_add_overflow+0x2a/0x31
 [] __apic_accept_irq+0x57a/0x5d0 [kvm]
 [] kvm_apic_set_irq+0x9f/0xf0 [kvm]
 [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
 [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
 [] kvm_set_msi+0xa9/0x100 [kvm]
 [] kvm_send_userspace_msi+0x14d/0x1f0 [kvm]
 [] kvm_vm_ioctl+0x4ee/0xdd0 [kvm]
...

We expand it to u64, this is large enough. Suppose the vcpu receives
1000 irqs per second, then it won't overflow in 584942417 years.
( 18446744073709551615/1000/3600/24/365 = 584942417 )

Signed-off-by: Longpeng(Mike) 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/ioapic.h   | 3 ++-
 arch/x86/kvm/irq_comm.c | 2 +-
 arch/x86/kvm/lapic.c| 6 +++---
 4 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..ce9a5f5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -499,7 +499,7 @@ struct kvm_vcpu_arch {
bool apicv_active;
DECLARE_BITMAP(ioapic_handled_vectors, 256);
unsigned long apic_attention;
-   int32_t apic_arb_prio;
+   u64 apic_arb_prio;
int mp_state;
u64 ia32_misc_enable_msr;
u64 smbase;
diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
index 29ce197..a26deed 100644
--- a/arch/x86/kvm/ioapic.h
+++ b/arch/x86/kvm/ioapic.h
@@ -117,7 +117,8 @@ static inline int ioapic_in_kernel(struct kvm *kvm)
 void kvm_rtc_eoi_tracking_restore_one(struct kvm_vcpu *vcpu);
 bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
int short_hand, unsigned int dest, int dest_mode);
-int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2);
+/* Return true if vcpu1's priority is lower */
+bool kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2);
 void kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, int vector,
int trigger_mode);
 int kvm_ioapic_init(struct kvm *kvm);
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 3cc3b2d..03b1487 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -90,7 +90,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
kvm_lapic *src,
if (!kvm_vector_hashing_enabled()) {
if (!lowest)
lowest = vcpu;
-   else if (kvm_apic_compare_prio(vcpu, lowest) < 
0)
+   else if (kvm_apic_compare_prio(vcpu, lowest))
lowest = vcpu;
} else {
__set_bit(i, dest_vcpu_bitmap);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 589dcc1..1e2b1f2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -840,7 +840,7 @@ static inline bool kvm_apic_map_get_dest_lapic(struct kvm 
*kvm,
if (lowest < 0)
lowest = i;
else if (kvm_apic_compare_prio((*dst)[i]->vcpu,
-   (*dst)[lowest]->vcpu) < 0)
+   (*dst)[lowest]->vcpu))
lowest = i;
}
} else {
@@ -1048,9 +1048,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
return result;
 }
 
-int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
+bool kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 {
-   return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
+   return vcpu1->arch.apic_arb_prio < vcpu2->arch.apic_arb_prio;
 }
 
 static bool kvm_ioapic_handles_vector(struct kvm_lapic *apic, int vector)
-- 
1.8.3.1




Re: [PATCH v2 1/4] KVM: add spinlock optimization framework

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 17:00, Paolo Bonzini wrote:

> On 08/08/2017 10:42, David Hildenbrand wrote:
>>
>>> +bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>>> +{
>>> +   return false;
>>> +}
>>
>> why don't we need an EXPORT_SYMBOL here?
> 
> Is it used outside the KVM module?  I think no architecture actually needs
> to export it.
> 


Hi Paolo & David,

In my original approach, I call kvm_arch_vcpu_in_kernel() in handle_pause(),
without EXPORT_SYMBOL the compiler reports:
 ERROR: "kvm_arch_vcpu_in_kernel" [arch/x86/kvm/kvm-intel.ko] undefined!
 ERROR: "kvm_arch_vcpu_in_kernel" [arch/x86/kvm/kvm-amd.ko] undefined!

But Paolo's approach is significantly better, it's a work of art, thanks a lot.

-- 
Regards,
Longpeng(Mike)

>>> -void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> +void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool me_in_kern)
>>>  {
>>> struct kvm *kvm = me->kvm;
>>> struct kvm_vcpu *vcpu;
>>> @@ -2348,6 +2348,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> continue;
>>> if (swait_active(>wq) && 
>>> !kvm_arch_vcpu_runnable(vcpu))
>>> continue;
>>> +   if (me_in_kern && !kvm_arch_vcpu_in_kernel(vcpu))
>>> +   continue;
>>
>>
>> hm, does this patch compile? (me_in_kern)
> 
> Why not? :)  This is what I have:
> 
>>From d62a40d49f44ff7e789a15416316ef1cba93fa85 Mon Sep 17 00:00:00 2001
> From: "Longpeng(Mike)" <longpe...@huawei.com>
> Date: Tue, 8 Aug 2017 12:05:32 +0800
> Subject: [PATCH 1/4] KVM: add spinlock optimization framework
> 
> If a vcpu exits due to request a user mode spinlock, then
> the spinlock-holder may be preempted in user mode or kernel mode.
> (Note that not all architectures trap spin loops in user mode,
> only AMD x86 and ARM/ARM64 currently do).
> 
> But if a vcpu exits in kernel mode, then the holder must be
> preempted in kernel mode, so we should choose a vcpu in kernel mode
> as a more likely candidate for the lock holder.
> 
> This introduces kvm_arch_vcpu_in_kernel() to decide whether the
> vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
> new argument says the same of the spinning VCPU.
> 
> Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> ---
>  arch/arm/kvm/handle_exit.c   | 2 +-
>  arch/arm64/kvm/handle_exit.c | 2 +-
>  arch/mips/kvm/mips.c | 5 +
>  arch/powerpc/kvm/powerpc.c   | 5 +
>  arch/s390/kvm/diag.c | 2 +-
>  arch/s390/kvm/kvm-s390.c | 5 +
>  arch/x86/kvm/hyperv.c| 2 +-
>  arch/x86/kvm/svm.c   | 2 +-
>  arch/x86/kvm/vmx.c   | 2 +-
>  arch/x86/kvm/x86.c   | 5 +
>  include/linux/kvm_host.h | 3 ++-
>  virt/kvm/arm/arm.c   | 5 +
>  virt/kvm/kvm_main.c  | 4 +++-
>  13 files changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
> index 54442e375354..196122bb6968 100644
> --- a/arch/arm/kvm/handle_exit.c
> +++ b/arch/arm/kvm/handle_exit.c
> @@ -67,7 +67,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>   if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) {
>   trace_kvm_wfx(*vcpu_pc(vcpu), true);
>   vcpu->stat.wfe_exit_stat++;
> - kvm_vcpu_on_spin(vcpu);
> + kvm_vcpu_on_spin(vcpu, false);
>   } else {
>   trace_kvm_wfx(*vcpu_pc(vcpu), false);
>   vcpu->stat.wfi_exit_stat++;
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 17d8a1677a0b..da57622cacca 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -84,7 +84,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>   if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
>   trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
>   vcpu->stat.wfe_exit_stat++;
> - kvm_vcpu_on_spin(vcpu);
> + kvm_vcpu_on_spin(vcpu, false);
>   } else {
>   trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
>   vcpu->stat.wfi_exit_stat++;
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index d4b2ad18eef2..70208bed5a15 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -98,6 +98,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>   return !!(vcpu->arch.pending_exceptions);
>  }
&g

Re: [PATCH v2 1/4] KVM: add spinlock optimization framework

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 17:00, Paolo Bonzini wrote:

> On 08/08/2017 10:42, David Hildenbrand wrote:
>>
>>> +bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>>> +{
>>> +   return false;
>>> +}
>>
>> why don't we need an EXPORT_SYMBOL here?
> 
> Is it used outside the KVM module?  I think no architecture actually needs
> to export it.
> 


Hi Paolo & David,

In my original approach, I call kvm_arch_vcpu_in_kernel() in handle_pause(),
without EXPORT_SYMBOL the compiler reports:
 ERROR: "kvm_arch_vcpu_in_kernel" [arch/x86/kvm/kvm-intel.ko] undefined!
 ERROR: "kvm_arch_vcpu_in_kernel" [arch/x86/kvm/kvm-amd.ko] undefined!

But Paolo's approach is significantly better, it's a work of art, thanks a lot.

-- 
Regards,
Longpeng(Mike)

>>> -void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> +void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool me_in_kern)
>>>  {
>>> struct kvm *kvm = me->kvm;
>>> struct kvm_vcpu *vcpu;
>>> @@ -2348,6 +2348,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> continue;
>>> if (swait_active(>wq) && 
>>> !kvm_arch_vcpu_runnable(vcpu))
>>> continue;
>>> +   if (me_in_kern && !kvm_arch_vcpu_in_kernel(vcpu))
>>> +   continue;
>>
>>
>> hm, does this patch compile? (me_in_kern)
> 
> Why not? :)  This is what I have:
> 
>>From d62a40d49f44ff7e789a15416316ef1cba93fa85 Mon Sep 17 00:00:00 2001
> From: "Longpeng(Mike)" 
> Date: Tue, 8 Aug 2017 12:05:32 +0800
> Subject: [PATCH 1/4] KVM: add spinlock optimization framework
> 
> If a vcpu exits due to request a user mode spinlock, then
> the spinlock-holder may be preempted in user mode or kernel mode.
> (Note that not all architectures trap spin loops in user mode,
> only AMD x86 and ARM/ARM64 currently do).
> 
> But if a vcpu exits in kernel mode, then the holder must be
> preempted in kernel mode, so we should choose a vcpu in kernel mode
> as a more likely candidate for the lock holder.
> 
> This introduces kvm_arch_vcpu_in_kernel() to decide whether the
> vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
> new argument says the same of the spinning VCPU.
> 
> Signed-off-by: Longpeng(Mike) 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/arm/kvm/handle_exit.c   | 2 +-
>  arch/arm64/kvm/handle_exit.c | 2 +-
>  arch/mips/kvm/mips.c | 5 +
>  arch/powerpc/kvm/powerpc.c   | 5 +
>  arch/s390/kvm/diag.c | 2 +-
>  arch/s390/kvm/kvm-s390.c | 5 +
>  arch/x86/kvm/hyperv.c| 2 +-
>  arch/x86/kvm/svm.c   | 2 +-
>  arch/x86/kvm/vmx.c   | 2 +-
>  arch/x86/kvm/x86.c   | 5 +
>  include/linux/kvm_host.h | 3 ++-
>  virt/kvm/arm/arm.c   | 5 +
>  virt/kvm/kvm_main.c  | 4 +++-
>  13 files changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
> index 54442e375354..196122bb6968 100644
> --- a/arch/arm/kvm/handle_exit.c
> +++ b/arch/arm/kvm/handle_exit.c
> @@ -67,7 +67,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>   if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) {
>   trace_kvm_wfx(*vcpu_pc(vcpu), true);
>   vcpu->stat.wfe_exit_stat++;
> - kvm_vcpu_on_spin(vcpu);
> + kvm_vcpu_on_spin(vcpu, false);
>   } else {
>   trace_kvm_wfx(*vcpu_pc(vcpu), false);
>   vcpu->stat.wfi_exit_stat++;
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 17d8a1677a0b..da57622cacca 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -84,7 +84,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>   if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
>   trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
>   vcpu->stat.wfe_exit_stat++;
> - kvm_vcpu_on_spin(vcpu);
> + kvm_vcpu_on_spin(vcpu, false);
>   } else {
>   trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
>   vcpu->stat.wfi_exit_stat++;
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index d4b2ad18eef2..70208bed5a15 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -98,6 +98,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>   return !!(vcpu->arch.pending_exceptions);
>  }
>  
> +bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
> +{
> + return fa

Re: [PATCH v2 2/4] KVM: X86: implement the logic for spinlock optimization

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 15:30, Paolo Bonzini wrote:

> On 08/08/2017 06:05, Longpeng(Mike) wrote:
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index cd0e6e6..dec5e8a 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -1268,7 +1268,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>  
>>  switch (code) {
>>  case HVCALL_NOTIFY_LONG_SPIN_WAIT:
>> -kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
>> +kvm_vcpu_on_spin(vcpu, kvm_x86_ops->spin_in_kernel(vcpu));
>>  break;
>>  case HVCALL_POST_MESSAGE:
>>  case HVCALL_SIGNAL_EVENT:
> 
> This can be true as well.  I can change this on commit.
> 


Thanks,
hope you could help me to fix the same problem in patch-1(s390) too.

> Paolo
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 2/4] KVM: X86: implement the logic for spinlock optimization

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 15:30, Paolo Bonzini wrote:

> On 08/08/2017 06:05, Longpeng(Mike) wrote:
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index cd0e6e6..dec5e8a 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -1268,7 +1268,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>  
>>  switch (code) {
>>  case HVCALL_NOTIFY_LONG_SPIN_WAIT:
>> -kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
>> +kvm_vcpu_on_spin(vcpu, kvm_x86_ops->spin_in_kernel(vcpu));
>>  break;
>>  case HVCALL_POST_MESSAGE:
>>  case HVCALL_SIGNAL_EVENT:
> 
> This can be true as well.  I can change this on commit.
> 


Thanks,
hope you could help me to fix the same problem in patch-1(s390) too.

> Paolo
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 15:41, Cornelia Huck wrote:

> On Tue, 8 Aug 2017 12:05:31 +0800
> "Longpeng(Mike)" <longpe...@huawei.com> wrote:
> 
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
> 
> I think this generally looks good now.
> 
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
> 
> Did you re-run tests on this version?


Hi Cornelia,

I didn't re-run tests on V2. But the major difference between RFC and V2
is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.

So I think V2's performance is at least the same as RFC or even slightly
better. :)

> 
> I would also like to see some s390 numbers; unfortunately I only have a
> z/VM environment and any performance numbers would be nearly useless
> there. Maybe somebody within IBM with a better setup can run a quick
> test?
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 15:41, Cornelia Huck wrote:

> On Tue, 8 Aug 2017 12:05:31 +0800
> "Longpeng(Mike)"  wrote:
> 
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
> 
> I think this generally looks good now.
> 
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
> 
> Did you re-run tests on this version?


Hi Cornelia,

I didn't re-run tests on V2. But the major difference between RFC and V2
is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.

So I think V2's performance is at least the same as RFC or even slightly
better. :)

> 
> I would also like to see some s390 numbers; unfortunately I only have a
> z/VM environment and any performance numbers would be nearly useless
> there. Maybe somebody within IBM with a better setup can run a quick
> test?
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



[PATCH v2 3/4] KVM: s390: implements the kvm_arch_vcpu_in_kernel()

2017-08-07 Thread Longpeng(Mike)
This implements the kvm_arch_vcpu_in_kernel() for s390.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/s390/kvm/kvm-s390.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 0b0c689..e46177b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2449,7 +2449,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
 
-- 
1.8.3.1




[PATCH v2 2/4] KVM: X86: implement the logic for spinlock optimization

2017-08-07 Thread Longpeng(Mike)
1. Implements the kvm_arch_vcpu_in_kernel(), because get_cpl requires
vcpu_load, so we must cache the result(whether the vcpu was preempted
when its cpl=0) in kvm_vcpu_arch.

2. Add ->spin_in_kernel hook, because we can get benefit from VMX.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/hyperv.c   |  2 +-
 arch/x86/kvm/svm.c  |  8 +++-
 arch/x86/kvm/vmx.c  | 16 +++-
 arch/x86/kvm/x86.c  |  7 ++-
 5 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..d2b2d57 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -688,6 +688,9 @@ struct kvm_vcpu_arch {
 
/* GPA available (AMD only) */
bool gpa_available;
+
+   /* be preempted when it's in kernel-mode(cpl=0) */
+   bool preempted_in_kernel;
 };
 
 struct kvm_lpage_info {
@@ -1057,6 +1060,8 @@ struct kvm_x86_ops {
void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
 
void (*setup_mce)(struct kvm_vcpu *vcpu);
+
+   bool (*spin_in_kernel)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index cd0e6e6..dec5e8a 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1268,7 +1268,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 
switch (code) {
case HVCALL_NOTIFY_LONG_SPIN_WAIT:
-   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
+   kvm_vcpu_on_spin(vcpu, kvm_x86_ops->spin_in_kernel(vcpu));
break;
case HVCALL_POST_MESSAGE:
case HVCALL_SIGNAL_EVENT:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e6ed24e..ccb6df7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3751,7 +3751,7 @@ static int pause_interception(struct vcpu_svm *svm)
 {
struct kvm_vcpu *vcpu = &(svm->vcpu);
 
-   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
+   kvm_vcpu_on_spin(vcpu, kvm_x86_ops->spin_in_kernel(vcpu));
return 1;
 }
 
@@ -5364,6 +5364,11 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
vcpu->arch.mcg_cap &= 0x1ff;
 }
 
+static bool svm_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return svm_get_cpl(vcpu) == 0;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -5476,6 +5481,7 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
.deliver_posted_interrupt = svm_deliver_avic_intr,
.update_pi_irte = svm_update_pi_irte,
.setup_mce = svm_setup_mce,
+   .spin_in_kernel = svm_spin_in_kernel,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9d6223a..297a158 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6761,7 +6761,8 @@ static int handle_pause(struct kvm_vcpu *vcpu)
if (ple_gap)
grow_ple_window(vcpu);
 
-   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
+   /* See comments in vmx_spin_in_kernel() */
+   kvm_vcpu_on_spin(vcpu, true);
return kvm_skip_emulated_instruction(vcpu);
 }
 
@@ -11636,6 +11637,17 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
~FEATURE_CONTROL_LMCE;
 }
 
+static bool vmx_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   /*
+* Intel sdm vol3 ch-25.1.3 says: The “PAUSE-loop exiting”
+* VM-execution control is ignored if CPL > 0. OTOH, KVM
+* never set PAUSE_EXITING and just set PLE if supported,
+* so the vcpu must be CPL=0 if it gets a PAUSE exit.
+*/
+   return true;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -11763,6 +11775,8 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 #endif
 
.setup_mce = vmx_setup_mce,
+
+   .spin_in_kernel = vmx_spin_in_kernel,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4430be6..28299b9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2881,6 +2881,10 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu 
*vcpu)
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
int idx;
+
+   if (vcpu->preempted)
+   vcpu->arch.preempted_in_kernel = !kvm_x86_ops->get_cpl(vcpu);
+
/*
 * Disable page faults because we're in atomic context here.
 * kvm_write_guest_offset_cached() would call might_fault()
@@ -7992,6 +7996,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
kvm_pmu_init(vcpu);
 
vcpu->arch.pending_external_vector = -1;
+   vcpu->arch.preempted_in_kernel = false;
 
kvm_hv_vcpu_init(vcpu);
 
@

[PATCH v2 3/4] KVM: s390: implements the kvm_arch_vcpu_in_kernel()

2017-08-07 Thread Longpeng(Mike)
This implements the kvm_arch_vcpu_in_kernel() for s390.

Signed-off-by: Longpeng(Mike) 
---
 arch/s390/kvm/kvm-s390.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 0b0c689..e46177b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2449,7 +2449,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
 
-- 
1.8.3.1




[PATCH v2 2/4] KVM: X86: implement the logic for spinlock optimization

2017-08-07 Thread Longpeng(Mike)
1. Implements the kvm_arch_vcpu_in_kernel(), because get_cpl requires
vcpu_load, so we must cache the result(whether the vcpu was preempted
when its cpl=0) in kvm_vcpu_arch.

2. Add ->spin_in_kernel hook, because we can get benefit from VMX.

Signed-off-by: Longpeng(Mike) 
---
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/hyperv.c   |  2 +-
 arch/x86/kvm/svm.c  |  8 +++-
 arch/x86/kvm/vmx.c  | 16 +++-
 arch/x86/kvm/x86.c  |  7 ++-
 5 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..d2b2d57 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -688,6 +688,9 @@ struct kvm_vcpu_arch {
 
/* GPA available (AMD only) */
bool gpa_available;
+
+   /* be preempted when it's in kernel-mode(cpl=0) */
+   bool preempted_in_kernel;
 };
 
 struct kvm_lpage_info {
@@ -1057,6 +1060,8 @@ struct kvm_x86_ops {
void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
 
void (*setup_mce)(struct kvm_vcpu *vcpu);
+
+   bool (*spin_in_kernel)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index cd0e6e6..dec5e8a 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1268,7 +1268,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 
switch (code) {
case HVCALL_NOTIFY_LONG_SPIN_WAIT:
-   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
+   kvm_vcpu_on_spin(vcpu, kvm_x86_ops->spin_in_kernel(vcpu));
break;
case HVCALL_POST_MESSAGE:
case HVCALL_SIGNAL_EVENT:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e6ed24e..ccb6df7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3751,7 +3751,7 @@ static int pause_interception(struct vcpu_svm *svm)
 {
struct kvm_vcpu *vcpu = &(svm->vcpu);
 
-   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
+   kvm_vcpu_on_spin(vcpu, kvm_x86_ops->spin_in_kernel(vcpu));
return 1;
 }
 
@@ -5364,6 +5364,11 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
vcpu->arch.mcg_cap &= 0x1ff;
 }
 
+static bool svm_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return svm_get_cpl(vcpu) == 0;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -5476,6 +5481,7 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
.deliver_posted_interrupt = svm_deliver_avic_intr,
.update_pi_irte = svm_update_pi_irte,
.setup_mce = svm_setup_mce,
+   .spin_in_kernel = svm_spin_in_kernel,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9d6223a..297a158 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6761,7 +6761,8 @@ static int handle_pause(struct kvm_vcpu *vcpu)
if (ple_gap)
grow_ple_window(vcpu);
 
-   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
+   /* See comments in vmx_spin_in_kernel() */
+   kvm_vcpu_on_spin(vcpu, true);
return kvm_skip_emulated_instruction(vcpu);
 }
 
@@ -11636,6 +11637,17 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
~FEATURE_CONTROL_LMCE;
 }
 
+static bool vmx_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   /*
+* Intel sdm vol3 ch-25.1.3 says: The “PAUSE-loop exiting”
+* VM-execution control is ignored if CPL > 0. OTOH, KVM
+* never set PAUSE_EXITING and just set PLE if supported,
+* so the vcpu must be CPL=0 if it gets a PAUSE exit.
+*/
+   return true;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -11763,6 +11775,8 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 #endif
 
.setup_mce = vmx_setup_mce,
+
+   .spin_in_kernel = vmx_spin_in_kernel,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4430be6..28299b9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2881,6 +2881,10 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu 
*vcpu)
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
int idx;
+
+   if (vcpu->preempted)
+   vcpu->arch.preempted_in_kernel = !kvm_x86_ops->get_cpl(vcpu);
+
/*
 * Disable page faults because we're in atomic context here.
 * kvm_write_guest_offset_cached() would call might_fault()
@@ -7992,6 +7996,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
kvm_pmu_init(vcpu);
 
vcpu->arch.pending_external_vector = -1;
+   vcpu->arch.preempted_in_kernel = false;
 
kvm_hv_vcpu_init(vcpu);
 
@@ -8441,7 +8446,7 @@ int kvm

[PATCH v2 1/4] KVM: add spinlock optimization framework

2017-08-07 Thread Longpeng(Mike)
If the vcpu(me) exit due to request a usermode spinlock, then
the spinlock-holder may be preempted in usermode or kernmode.

But if the vcpu(me) is in kernmode, then the holder must be
preempted in kernmode, so we should choose a vcpu in kernmode
as the most eligible candidate.

This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted or spinlock exit.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/arm/kvm/handle_exit.c   | 2 +-
 arch/arm64/kvm/handle_exit.c | 2 +-
 arch/mips/kvm/mips.c | 6 ++
 arch/powerpc/kvm/powerpc.c   | 6 ++
 arch/s390/kvm/diag.c | 2 +-
 arch/s390/kvm/kvm-s390.c | 6 ++
 arch/x86/kvm/hyperv.c| 2 +-
 arch/x86/kvm/svm.c   | 4 +++-
 arch/x86/kvm/vmx.c   | 2 +-
 arch/x86/kvm/x86.c   | 6 ++
 include/linux/kvm_host.h | 3 ++-
 virt/kvm/arm/arm.c   | 5 +
 virt/kvm/kvm_main.c  | 4 +++-
 13 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index 54442e3..a7ea5db 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -67,7 +67,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) {
trace_kvm_wfx(*vcpu_pc(vcpu), true);
vcpu->stat.wfe_exit_stat++;
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
} else {
trace_kvm_wfx(*vcpu_pc(vcpu), false);
vcpu->stat.wfi_exit_stat++;
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 17d8a16..d6c8cb6 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -84,7 +84,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
vcpu->stat.wfe_exit_stat++;
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
} else {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
vcpu->stat.wfi_exit_stat++;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index d4b2ad1..70208be 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -98,6 +98,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return !!(vcpu->arch.pending_exceptions);
 }
 
+bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1a75c0b..6184c45 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -58,6 +58,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
 }
 
+bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c
index ce865bd..4ea8c38 100644
--- a/arch/s390/kvm/diag.c
+++ b/arch/s390/kvm/diag.c
@@ -150,7 +150,7 @@ static int __diag_time_slice_end(struct kvm_vcpu *vcpu)
 {
VCPU_EVENT(vcpu, 5, "%s", "diag time slice end");
vcpu->stat.diagnose_44++;
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
return 0;
 }
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index af09d34..0b0c689 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2447,6 +2447,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_s390_vcpu_has_irq(vcpu, 0);
 }
 
+bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
+
 void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
 {
atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 337b6d2..cd0e6e6 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1268,7 +1268,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 
switch (code) {
case HVCALL_NOTIFY_LONG_SPIN_WAIT:
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
break;
case HVCALL_POST_MESSAGE:
case HVCALL_SIGNAL_EVENT:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1107626..e6ed24e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3749,7 +3749,9 @@ static int interrupt_window_interception(struct vcpu_svm

[PATCH v2 1/4] KVM: add spinlock optimization framework

2017-08-07 Thread Longpeng(Mike)
If the vcpu(me) exit due to request a usermode spinlock, then
the spinlock-holder may be preempted in usermode or kernmode.

But if the vcpu(me) is in kernmode, then the holder must be
preempted in kernmode, so we should choose a vcpu in kernmode
as the most eligible candidate.

This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted or spinlock exit.

Signed-off-by: Longpeng(Mike) 
---
 arch/arm/kvm/handle_exit.c   | 2 +-
 arch/arm64/kvm/handle_exit.c | 2 +-
 arch/mips/kvm/mips.c | 6 ++
 arch/powerpc/kvm/powerpc.c   | 6 ++
 arch/s390/kvm/diag.c | 2 +-
 arch/s390/kvm/kvm-s390.c | 6 ++
 arch/x86/kvm/hyperv.c| 2 +-
 arch/x86/kvm/svm.c   | 4 +++-
 arch/x86/kvm/vmx.c   | 2 +-
 arch/x86/kvm/x86.c   | 6 ++
 include/linux/kvm_host.h | 3 ++-
 virt/kvm/arm/arm.c   | 5 +
 virt/kvm/kvm_main.c  | 4 +++-
 13 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index 54442e3..a7ea5db 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -67,7 +67,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) {
trace_kvm_wfx(*vcpu_pc(vcpu), true);
vcpu->stat.wfe_exit_stat++;
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
} else {
trace_kvm_wfx(*vcpu_pc(vcpu), false);
vcpu->stat.wfi_exit_stat++;
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 17d8a16..d6c8cb6 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -84,7 +84,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
vcpu->stat.wfe_exit_stat++;
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
} else {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
vcpu->stat.wfi_exit_stat++;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index d4b2ad1..70208be 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -98,6 +98,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return !!(vcpu->arch.pending_exceptions);
 }
 
+bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1a75c0b..6184c45 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -58,6 +58,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
 }
 
+bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c
index ce865bd..4ea8c38 100644
--- a/arch/s390/kvm/diag.c
+++ b/arch/s390/kvm/diag.c
@@ -150,7 +150,7 @@ static int __diag_time_slice_end(struct kvm_vcpu *vcpu)
 {
VCPU_EVENT(vcpu, 5, "%s", "diag time slice end");
vcpu->stat.diagnose_44++;
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
return 0;
 }
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index af09d34..0b0c689 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2447,6 +2447,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_s390_vcpu_has_irq(vcpu, 0);
 }
 
+bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_in_kernel);
+
 void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
 {
atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 337b6d2..cd0e6e6 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1268,7 +1268,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 
switch (code) {
case HVCALL_NOTIFY_LONG_SPIN_WAIT:
-   kvm_vcpu_on_spin(vcpu);
+   kvm_vcpu_on_spin(vcpu, kvm_arch_vcpu_in_kernel(vcpu));
break;
case HVCALL_POST_MESSAGE:
case HVCALL_SIGNAL_EVENT:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1107626..e6ed24e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3749,7 +3749,9 @@ static int interrupt_window_interception(struct vcpu_svm 
*svm

[PATCH v2 4/4] KVM: arm: implements the kvm_arch_vcpu_in_kernel()

2017-08-07 Thread Longpeng(Mike)
This implements the kvm_arch_vcpu_in_kernel() for ARM.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 virt/kvm/arm/arm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 862f820..b9f68e4 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -418,7 +418,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu_mode_priv(vcpu);
 }
 
 /* Just ensure a guest exit from a particular CPU */
-- 
1.8.3.1




[PATCH v2 4/4] KVM: arm: implements the kvm_arch_vcpu_in_kernel()

2017-08-07 Thread Longpeng(Mike)
This implements the kvm_arch_vcpu_in_kernel() for ARM.

Signed-off-by: Longpeng(Mike) 
---
 virt/kvm/arm/arm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 862f820..b9f68e4 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -418,7 +418,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu_mode_priv(vcpu);
 }
 
 /* Just ensure a guest exit from a particular CPU */
-- 
1.8.3.1




[PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-07 Thread Longpeng(Mike)
This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.

I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.

== Geekbench-3.4.1 ==
VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running Geekbench-3.4.1 *10 truns*
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of each testcase's score:
(higher is better)
before  after   improve
Inter
 single 1176.7  1179.0  0.2%
 multi  3459.5  3426.5  -0.9%
Float
 single 1150.5  1150.9  0.0%
 multi  3364.5  3391.9  0.8%
Memory(stream)
 single 1768.7  1773.1  0.2%
 multi  2511.6  2557.2  1.8%
Overall
 single 1284.2  1286.2  0.2%
 multi  3231.4  3238.4  0.2%


== kernbench-0.42 ==
VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running "kernbench -n 10"
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of 'Elapsed Time':
(sooner is better)
before  after   improve
load -j412.762  12.751  0.1%
load -j32   9.743   8.955   8.1%
load -j 9.688   9.229   4.7%


Physical Machine:
  Architecture:  x86_64
  CPU op-mode(s):32-bit, 64-bit
  Byte Order:Little Endian
  CPU(s):24
  On-line CPU(s) list:   0-23
  Thread(s) per core:2
  Core(s) per socket:6
  Socket(s): 2
  NUMA node(s):  2
  Vendor ID: GenuineIntel
  CPU family:6
  Model: 45
  Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  Stepping:  7
  CPU MHz:   2799.902
  BogoMIPS:  5004.67
  Virtualization:VT-x
  L1d cache: 32K
  L1i cache: 32K
  L2 cache:  256K
  L3 cache:  15360K
  NUMA node0 CPU(s): 0-5,12-17
  NUMA node1 CPU(s): 6-11,18-23

---
Changes since V1:
 - split the implementation of s390 & arm. [David]
 - refactor the impls according to the suggestion. [Paolo]

Changes since RFC:
 - only cache result for X86. [David & Cornlia & Paolo]
 - add performance numbers. [David]
 - impls arm/s390. [Christoffer & David]
 - refactor the impls. [me]

---
Longpeng(Mike) (4):
  KVM: add spinlock optimization framework
  KVM: X86: implement the logic for spinlock optimization
  KVM: s390: implements the kvm_arch_vcpu_in_kernel()
  KVM: arm: implements the kvm_arch_vcpu_in_kernel()

 arch/arm/kvm/handle_exit.c  |  2 +-
 arch/arm64/kvm/handle_exit.c|  2 +-
 arch/mips/kvm/mips.c|  6 ++
 arch/powerpc/kvm/powerpc.c  |  6 ++
 arch/s390/kvm/diag.c|  2 +-
 arch/s390/kvm/kvm-s390.c|  6 ++
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/hyperv.c   |  2 +-
 arch/x86/kvm/svm.c  | 10 +-
 arch/x86/kvm/vmx.c  | 16 +++-
 arch/x86/kvm/x86.c  | 11 +++
 include/linux/kvm_host.h|  3 ++-
 virt/kvm/arm/arm.c  |  5 +
 virt/kvm/kvm_main.c |  4 +++-
 14 files changed, 72 insertions(+), 8 deletions(-)

-- 
1.8.3.1




[PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-07 Thread Longpeng(Mike)
This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.

I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.

== Geekbench-3.4.1 ==
VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running Geekbench-3.4.1 *10 truns*
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of each testcase's score:
(higher is better)
before  after   improve
Inter
 single 1176.7  1179.0  0.2%
 multi  3459.5  3426.5  -0.9%
Float
 single 1150.5  1150.9  0.0%
 multi  3364.5  3391.9  0.8%
Memory(stream)
 single 1768.7  1773.1  0.2%
 multi  2511.6  2557.2  1.8%
Overall
 single 1284.2  1286.2  0.2%
 multi  3231.4  3238.4  0.2%


== kernbench-0.42 ==
VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running "kernbench -n 10"
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of 'Elapsed Time':
(sooner is better)
before  after   improve
load -j412.762  12.751  0.1%
load -j32   9.743   8.955   8.1%
load -j 9.688   9.229   4.7%


Physical Machine:
  Architecture:  x86_64
  CPU op-mode(s):32-bit, 64-bit
  Byte Order:Little Endian
  CPU(s):24
  On-line CPU(s) list:   0-23
  Thread(s) per core:2
  Core(s) per socket:6
  Socket(s): 2
  NUMA node(s):  2
  Vendor ID: GenuineIntel
  CPU family:6
  Model: 45
  Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  Stepping:  7
  CPU MHz:   2799.902
  BogoMIPS:  5004.67
  Virtualization:VT-x
  L1d cache: 32K
  L1i cache: 32K
  L2 cache:  256K
  L3 cache:  15360K
  NUMA node0 CPU(s): 0-5,12-17
  NUMA node1 CPU(s): 6-11,18-23

---
Changes since V1:
 - split the implementation of s390 & arm. [David]
 - refactor the impls according to the suggestion. [Paolo]

Changes since RFC:
 - only cache result for X86. [David & Cornlia & Paolo]
 - add performance numbers. [David]
 - impls arm/s390. [Christoffer & David]
 - refactor the impls. [me]

---
Longpeng(Mike) (4):
  KVM: add spinlock optimization framework
  KVM: X86: implement the logic for spinlock optimization
  KVM: s390: implements the kvm_arch_vcpu_in_kernel()
  KVM: arm: implements the kvm_arch_vcpu_in_kernel()

 arch/arm/kvm/handle_exit.c  |  2 +-
 arch/arm64/kvm/handle_exit.c|  2 +-
 arch/mips/kvm/mips.c|  6 ++
 arch/powerpc/kvm/powerpc.c  |  6 ++
 arch/s390/kvm/diag.c|  2 +-
 arch/s390/kvm/kvm-s390.c|  6 ++
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/hyperv.c   |  2 +-
 arch/x86/kvm/svm.c  | 10 +-
 arch/x86/kvm/vmx.c  | 16 +++-
 arch/x86/kvm/x86.c  | 11 +++
 include/linux/kvm_host.h|  3 ++-
 virt/kvm/arm/arm.c  |  5 +
 virt/kvm/kvm_main.c |  4 +++-
 14 files changed, 72 insertions(+), 8 deletions(-)

-- 
1.8.3.1




Re: [PATCH 2/3] KVM: X86: implement the logic for spinlock optimization

2017-08-07 Thread Longpeng(Mike)



On 08/07/2017 06:45 PM, Paolo Bonzini wrote:

On 07/08/2017 10:44, Longpeng(Mike) wrote:

+
+   /*
+* Intel sdm vol3 ch-25.1.3 says: The “PAUSE-loop exiting”
+* VM-execution control is ignored if CPL > 0. So the vcpu
+* is always exiting with CPL=0 if it uses PLE.


This is not true (how can it be?).  What 25.1.3 says is, the VCPU is
always at CPL=0 if you get a PAUSE exit (reason 40) and PAUSE exiting is
0 (it always is for KVM).  But here you're looking for a VCPU that
didn't get a PAUSE exit, so the CPL can certainly be 3.



Hi Paolo,

My comment above is something wrong(please forgive my poor English), my 
origin meaning is:

The “PAUSE-loop exiting” VM-execution control is ignored if
CPL > 0. So the vcpu's CPL is must 0 if it exits due to PLE.

* kvm_arch_spin_in_kernel() returns whether the vcpu(which exits due to 
spinlock) is CPL=0. It only be called by kvm_vcpu_on_spin(), and the 
input vcpu is 'me' which get a PAUSE exit now. *


I split kvm_arch_vcpu_in_kernel(in RFC) into two functions: 
kvm_arch_spin_in_kernel and kvm_arch_preempt_in_kernel



Because of KVM/VMX L1 never set CPU_BASED_PAUSE_EXITING and only set
SECONDARY_EXEC_PAUSE_LOOP_EXITING if supported, so for L1:
1. get a PAUSE exit with CPL=0 if PLE is supported
2. never get a PAUSE exit if don't support PLE

So, I think it can direct return true(CPL=0) if supports PLE.

But for nested KVM/VMX(I'm not familiar with nested), it could set 
CPU_BASED_PAUSE_EXITING, so I think get_cpl() is also needed.



If the above is correct, what about this way( we can save a vmcs_read 
opeartion for L1):


kvm_arch_vcpu_spin_in_kernel(vcpu)
{
if (!is_guest_mode(vcpu))
return true;

return vmx_get_cpl(vcpu) == 0;
}

kvm_vcpu_on_spin()
{
/* @me get a PAUSE exit */
me_in_kernel = kvm_arch_vcpu_spin_in_kernel(me);
...
for each vcpu {
...
if (me_in_kernel && !...preempt_in_kernel(vcpu))
continue;
...
}
...
}

---
Regards,
Longpeng(Mike)


However, I understand that vmx_get_cpl can be a bit slow here.  You can
actually read SS's access rights directly in this function and get the
DPL from there, that's going to be just a single VMREAD.

The only difference is when vmx->rmode.vm86_active=1.  However,
pause-loop exiting is not working properly anyway if
vmx->rmode.vm86_active=1, because CPL=3 according to the processor.

Paolo


+* The following block needs less cycles than vmx_get_cpl().
+*/
+   if (cpu_has_secondary_exec_ctrls())
+   secondary_exec_ctrl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+   if (secondary_exec_ctrl & SECONDARY_EXEC_PAUSE_LOOP_EXITING)
+   return true;
+


Paolo



Re: [PATCH 2/3] KVM: X86: implement the logic for spinlock optimization

2017-08-07 Thread Longpeng(Mike)



On 08/07/2017 06:45 PM, Paolo Bonzini wrote:

On 07/08/2017 10:44, Longpeng(Mike) wrote:

+
+   /*
+* Intel sdm vol3 ch-25.1.3 says: The “PAUSE-loop exiting”
+* VM-execution control is ignored if CPL > 0. So the vcpu
+* is always exiting with CPL=0 if it uses PLE.


This is not true (how can it be?).  What 25.1.3 says is, the VCPU is
always at CPL=0 if you get a PAUSE exit (reason 40) and PAUSE exiting is
0 (it always is for KVM).  But here you're looking for a VCPU that
didn't get a PAUSE exit, so the CPL can certainly be 3.



Hi Paolo,

My comment above is something wrong(please forgive my poor English), my 
origin meaning is:

The “PAUSE-loop exiting” VM-execution control is ignored if
CPL > 0. So the vcpu's CPL is must 0 if it exits due to PLE.

* kvm_arch_spin_in_kernel() returns whether the vcpu(which exits due to 
spinlock) is CPL=0. It only be called by kvm_vcpu_on_spin(), and the 
input vcpu is 'me' which get a PAUSE exit now. *


I split kvm_arch_vcpu_in_kernel(in RFC) into two functions: 
kvm_arch_spin_in_kernel and kvm_arch_preempt_in_kernel



Because of KVM/VMX L1 never set CPU_BASED_PAUSE_EXITING and only set
SECONDARY_EXEC_PAUSE_LOOP_EXITING if supported, so for L1:
1. get a PAUSE exit with CPL=0 if PLE is supported
2. never get a PAUSE exit if don't support PLE

So, I think it can direct return true(CPL=0) if supports PLE.

But for nested KVM/VMX(I'm not familiar with nested), it could set 
CPU_BASED_PAUSE_EXITING, so I think get_cpl() is also needed.



If the above is correct, what about this way( we can save a vmcs_read 
opeartion for L1):


kvm_arch_vcpu_spin_in_kernel(vcpu)
{
if (!is_guest_mode(vcpu))
return true;

return vmx_get_cpl(vcpu) == 0;
}

kvm_vcpu_on_spin()
{
/* @me get a PAUSE exit */
me_in_kernel = kvm_arch_vcpu_spin_in_kernel(me);
...
for each vcpu {
...
if (me_in_kernel && !...preempt_in_kernel(vcpu))
continue;
...
}
...
}

---
Regards,
Longpeng(Mike)


However, I understand that vmx_get_cpl can be a bit slow here.  You can
actually read SS's access rights directly in this function and get the
DPL from there, that's going to be just a single VMREAD.

The only difference is when vmx->rmode.vm86_active=1.  However,
pause-loop exiting is not working properly anyway if
vmx->rmode.vm86_active=1, because CPL=3 according to the processor.

Paolo


+* The following block needs less cycles than vmx_get_cpl().
+*/
+   if (cpu_has_secondary_exec_ctrls())
+   secondary_exec_ctrl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+   if (secondary_exec_ctrl & SECONDARY_EXEC_PAUSE_LOOP_EXITING)
+   return true;
+


Paolo



Re: [PATCH 1/3] KVM: add spinlock-exiting optimize framework

2017-08-07 Thread Longpeng (Mike)


On 2017/8/7 16:55, David Hildenbrand wrote:

> On 07.08.2017 10:44, Longpeng(Mike) wrote:
>> If the vcpu(me) exit due to request a usermode spinlock, then
>> the spinlock-holder may be preempted in usermode or kernmode.
>>
>> But if the vcpu(me) is in kernmode, then the holder must be
>> preempted in kernmode, so we should choose a vcpu in kernmode
>> as the most eligible candidate.
>>
>> For some architecture(e.g. arm/s390), spin/preempt_in_kernel()
>> are the same, but they are different for X86.
>>
>> Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
>> ---
>>  arch/mips/kvm/mips.c   | 10 ++
>>  arch/powerpc/kvm/powerpc.c | 10 ++
>>  arch/s390/kvm/kvm-s390.c   | 10 ++
>>  arch/x86/kvm/x86.c | 10 ++
>>  include/linux/kvm_host.h   |  2 ++
>>  virt/kvm/arm/arm.c | 10 ++
>>  virt/kvm/kvm_main.c|  4 
>>  7 files changed, 56 insertions(+)
>>
>> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
>> index d4b2ad1..e04e6b3 100644
>> --- a/arch/mips/kvm/mips.c
>> +++ b/arch/mips/kvm/mips.c
>> @@ -98,6 +98,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return !!(vcpu->arch.pending_exceptions);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 1a75c0b..c573ddd 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -58,6 +58,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>  return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index af09d34..f78cdc2 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -2447,6 +2447,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_s390_vcpu_has_irq(vcpu, 0);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
>>  {
>>  atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 6c97c82..04c6a1f 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -8435,6 +8435,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 890b706..9613620 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -798,6 +798,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
>> *vcpu,
>>  void kvm_arch_hardware_unsetup(void);
>>  void kvm_arch_check_processor_compat(void *rtn);
>>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu);
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu);
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
>>  
>>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index a39a1e1..e45f780 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -416,6 +416,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
&

Re: [PATCH 1/3] KVM: add spinlock-exiting optimize framework

2017-08-07 Thread Longpeng (Mike)


On 2017/8/7 16:55, David Hildenbrand wrote:

> On 07.08.2017 10:44, Longpeng(Mike) wrote:
>> If the vcpu(me) exit due to request a usermode spinlock, then
>> the spinlock-holder may be preempted in usermode or kernmode.
>>
>> But if the vcpu(me) is in kernmode, then the holder must be
>> preempted in kernmode, so we should choose a vcpu in kernmode
>> as the most eligible candidate.
>>
>> For some architecture(e.g. arm/s390), spin/preempt_in_kernel()
>> are the same, but they are different for X86.
>>
>> Signed-off-by: Longpeng(Mike) 
>> ---
>>  arch/mips/kvm/mips.c   | 10 ++
>>  arch/powerpc/kvm/powerpc.c | 10 ++
>>  arch/s390/kvm/kvm-s390.c   | 10 ++
>>  arch/x86/kvm/x86.c | 10 ++
>>  include/linux/kvm_host.h   |  2 ++
>>  virt/kvm/arm/arm.c | 10 ++
>>  virt/kvm/kvm_main.c|  4 
>>  7 files changed, 56 insertions(+)
>>
>> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
>> index d4b2ad1..e04e6b3 100644
>> --- a/arch/mips/kvm/mips.c
>> +++ b/arch/mips/kvm/mips.c
>> @@ -98,6 +98,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return !!(vcpu->arch.pending_exceptions);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 1a75c0b..c573ddd 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -58,6 +58,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>  return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index af09d34..f78cdc2 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -2447,6 +2447,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_s390_vcpu_has_irq(vcpu, 0);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
>>  {
>>  atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 6c97c82..04c6a1f 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -8435,6 +8435,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 890b706..9613620 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -798,6 +798,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
>> *vcpu,
>>  void kvm_arch_hardware_unsetup(void);
>>  void kvm_arch_check_processor_compat(void *rtn);
>>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
>> +bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu);
>> +bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu);
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
>>  
>>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index a39a1e1..e45f780 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -416,6 +416,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>  && !v->

Re: [PATCH 3/3] KVM: implement spinlock optimization logic for arm/s390

2017-08-07 Thread Longpeng (Mike)

On 2017/8/7 16:52, David Hildenbrand wrote:

> On 07.08.2017 10:44, Longpeng(Mike) wrote:
>> Implements the kvm_arch_vcpu_spin/preempt_in_kernel() for arm/s390,
>> they needn't cache the result.
>>
>> Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
>> ---
>>  arch/s390/kvm/kvm-s390.c | 4 ++--
>>  virt/kvm/arm/arm.c   | 4 ++--
>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index f78cdc2..49b9178 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -2449,12 +2449,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  
>>  bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
>>  }
>>  
>>  bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
>>  }
>>  
>>  void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index e45f780..956f025 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -418,12 +418,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>  
>>  bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return vcpu_mode_priv(vcpu);
>>  }
>>  
>>  bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return vcpu_mode_priv(vcpu);
>>  }
>>  
>>  /* Just ensure a guest exit from a particular CPU */
>>
> 
> Can you split that into two parts? (arm and s390x?)


OK, I'll split in V2. :)

> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH 3/3] KVM: implement spinlock optimization logic for arm/s390

2017-08-07 Thread Longpeng (Mike)

On 2017/8/7 16:52, David Hildenbrand wrote:

> On 07.08.2017 10:44, Longpeng(Mike) wrote:
>> Implements the kvm_arch_vcpu_spin/preempt_in_kernel() for arm/s390,
>> they needn't cache the result.
>>
>> Signed-off-by: Longpeng(Mike) 
>> ---
>>  arch/s390/kvm/kvm-s390.c | 4 ++--
>>  virt/kvm/arm/arm.c   | 4 ++--
>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index f78cdc2..49b9178 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -2449,12 +2449,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  
>>  bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
>>  }
>>  
>>  bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
>>  }
>>  
>>  void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index e45f780..956f025 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -418,12 +418,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>  
>>  bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return vcpu_mode_priv(vcpu);
>>  }
>>  
>>  bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
>>  {
>> -return false;
>> +return vcpu_mode_priv(vcpu);
>>  }
>>  
>>  /* Just ensure a guest exit from a particular CPU */
>>
> 
> Can you split that into two parts? (arm and s390x?)


OK, I'll split in V2. :)

> 


-- 
Regards,
Longpeng(Mike)



[PATCH 1/3] KVM: add spinlock-exiting optimize framework

2017-08-07 Thread Longpeng(Mike)
If the vcpu(me) exit due to request a usermode spinlock, then
the spinlock-holder may be preempted in usermode or kernmode.

But if the vcpu(me) is in kernmode, then the holder must be
preempted in kernmode, so we should choose a vcpu in kernmode
as the most eligible candidate.

For some architecture(e.g. arm/s390), spin/preempt_in_kernel()
are the same, but they are different for X86.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/mips/kvm/mips.c   | 10 ++
 arch/powerpc/kvm/powerpc.c | 10 ++
 arch/s390/kvm/kvm-s390.c   | 10 ++
 arch/x86/kvm/x86.c | 10 ++
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/arm/arm.c | 10 ++
 virt/kvm/kvm_main.c|  4 
 7 files changed, 56 insertions(+)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index d4b2ad1..e04e6b3 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -98,6 +98,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return !!(vcpu->arch.pending_exceptions);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1a75c0b..c573ddd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -58,6 +58,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index af09d34..f78cdc2 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2447,6 +2447,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_s390_vcpu_has_irq(vcpu, 0);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
 {
atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c97c82..04c6a1f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8435,6 +8435,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 890b706..9613620 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -798,6 +798,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu);
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index a39a1e1..e45f780 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -416,6 +416,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
&& !v->arch.power_off && !v->arch.pause);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 /* Just ensure a guest exit from a particular CPU */
 static void exit_vm_noop(void *info)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f3f7427..0d0527b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2324,12 +2324,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
+   bool in_kern;
int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
int yielded = 0;
int try = 3;
int pass;
int i;
 
+   in_kern = kvm_arch_vcpu_spin_in_kernel(me);
kvm_vcpu_set_in_spin_loop(me, true);
/*
 * We boost the priority of a VCPU that is runnable but not
@@ -2351,6 +2353,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (swait_active(>wq) && 
!kvm_arch_v

[PATCH 1/3] KVM: add spinlock-exiting optimize framework

2017-08-07 Thread Longpeng(Mike)
If the vcpu(me) exit due to request a usermode spinlock, then
the spinlock-holder may be preempted in usermode or kernmode.

But if the vcpu(me) is in kernmode, then the holder must be
preempted in kernmode, so we should choose a vcpu in kernmode
as the most eligible candidate.

For some architecture(e.g. arm/s390), spin/preempt_in_kernel()
are the same, but they are different for X86.

Signed-off-by: Longpeng(Mike) 
---
 arch/mips/kvm/mips.c   | 10 ++
 arch/powerpc/kvm/powerpc.c | 10 ++
 arch/s390/kvm/kvm-s390.c   | 10 ++
 arch/x86/kvm/x86.c | 10 ++
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/arm/arm.c | 10 ++
 virt/kvm/kvm_main.c|  4 
 7 files changed, 56 insertions(+)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index d4b2ad1..e04e6b3 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -98,6 +98,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return !!(vcpu->arch.pending_exceptions);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1a75c0b..c573ddd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -58,6 +58,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return 1;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index af09d34..f78cdc2 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2447,6 +2447,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_s390_vcpu_has_irq(vcpu, 0);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
 {
atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c97c82..04c6a1f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8435,6 +8435,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 890b706..9613620 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -798,6 +798,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu);
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index a39a1e1..e45f780 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -416,6 +416,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
&& !v->arch.power_off && !v->arch.pause);
 }
 
+bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
+bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return false;
+}
+
 /* Just ensure a guest exit from a particular CPU */
 static void exit_vm_noop(void *info)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f3f7427..0d0527b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2324,12 +2324,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
+   bool in_kern;
int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
int yielded = 0;
int try = 3;
int pass;
int i;
 
+   in_kern = kvm_arch_vcpu_spin_in_kernel(me);
kvm_vcpu_set_in_spin_loop(me, true);
/*
 * We boost the priority of a VCPU that is runnable but not
@@ -2351,6 +2353,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (swait_active(>wq) && 
!kvm_arch_vcpu_ru

[PATCH 3/3] KVM: implement spinlock optimization logic for arm/s390

2017-08-07 Thread Longpeng(Mike)
Implements the kvm_arch_vcpu_spin/preempt_in_kernel() for arm/s390,
they needn't cache the result.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/s390/kvm/kvm-s390.c | 4 ++--
 virt/kvm/arm/arm.c   | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index f78cdc2..49b9178 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2449,12 +2449,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
 }
 
 bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
 }
 
 void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index e45f780..956f025 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -418,12 +418,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 
 bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu_mode_priv(vcpu);
 }
 
 bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu_mode_priv(vcpu);
 }
 
 /* Just ensure a guest exit from a particular CPU */
-- 
1.8.3.1




[PATCH 2/3] KVM: X86: implement the logic for spinlock optimization

2017-08-07 Thread Longpeng(Mike)
Implements the kvm_arch_vcpu_spin/preempt_in_kernel(), because get_cpl
requires vcpu_load, so we must cache the result(whether the vcpu was
preempted when its cpl=0) in kvm_arch_vcpu.

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/svm.c  |  6 ++
 arch/x86/kvm/vmx.c  | 20 
 arch/x86/kvm/x86.c  |  9 +++--
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..d2b2d57 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -688,6 +688,9 @@ struct kvm_vcpu_arch {
 
/* GPA available (AMD only) */
bool gpa_available;
+
+   /* be preempted when it's in kernel-mode(cpl=0) */
+   bool preempted_in_kernel;
 };
 
 struct kvm_lpage_info {
@@ -1057,6 +1060,8 @@ struct kvm_x86_ops {
void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
 
void (*setup_mce)(struct kvm_vcpu *vcpu);
+
+   bool (*spin_in_kernel)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4d8141e..552ab4c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5352,6 +5352,11 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
vcpu->arch.mcg_cap &= 0x1ff;
 }
 
+static bool svm_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return svm_get_cpl(vcpu) == 0;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -5464,6 +5469,7 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
.deliver_posted_interrupt = svm_deliver_avic_intr,
.update_pi_irte = svm_update_pi_irte,
.setup_mce = svm_setup_mce,
+   .spin_in_kernel = svm_spin_in_kernel,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 39a6222..d0dfe2e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11547,6 +11547,25 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
~FEATURE_CONTROL_LMCE;
 }
 
+static bool vmx_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   u32 secondary_exec_ctrl = 0;
+
+   /*
+* Intel sdm vol3 ch-25.1.3 says: The “PAUSE-loop exiting”
+* VM-execution control is ignored if CPL > 0. So the vcpu
+* is always exiting with CPL=0 if it uses PLE.
+*
+* The following block needs less cycles than vmx_get_cpl().
+*/
+   if (cpu_has_secondary_exec_ctrls())
+   secondary_exec_ctrl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+   if (secondary_exec_ctrl & SECONDARY_EXEC_PAUSE_LOOP_EXITING)
+   return true;
+
+   return vmx_get_cpl(vcpu) == 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -11674,6 +11693,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 #endif
 
.setup_mce = vmx_setup_mce,
+   .spin_in_kernel = vmx_spin_in_kernel,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04c6a1f..fa79a60 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2881,6 +2881,10 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu 
*vcpu)
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
int idx;
+
+   if (vcpu->preempted)
+   vcpu->arch.preempted_in_kernel = !kvm_x86_ops->get_cpl(vcpu);
+
/*
 * Disable page faults because we're in atomic context here.
 * kvm_write_guest_offset_cached() would call might_fault()
@@ -7988,6 +7992,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
kvm_pmu_init(vcpu);
 
vcpu->arch.pending_external_vector = -1;
+   vcpu->arch.preempted_in_kernel = false;
 
kvm_hv_vcpu_init(vcpu);
 
@@ -8437,12 +8442,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return kvm_x86_ops->spin_in_kernel(vcpu);
 }
 
 bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu->arch.preempted_in_kernel;
 }
 
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
-- 
1.8.3.1




[PATCH 3/3] KVM: implement spinlock optimization logic for arm/s390

2017-08-07 Thread Longpeng(Mike)
Implements the kvm_arch_vcpu_spin/preempt_in_kernel() for arm/s390,
they needn't cache the result.

Signed-off-by: Longpeng(Mike) 
---
 arch/s390/kvm/kvm-s390.c | 4 ++--
 virt/kvm/arm/arm.c   | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index f78cdc2..49b9178 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2449,12 +2449,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
 }
 
 bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
 }
 
 void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index e45f780..956f025 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -418,12 +418,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 
 bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu_mode_priv(vcpu);
 }
 
 bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu_mode_priv(vcpu);
 }
 
 /* Just ensure a guest exit from a particular CPU */
-- 
1.8.3.1




[PATCH 2/3] KVM: X86: implement the logic for spinlock optimization

2017-08-07 Thread Longpeng(Mike)
Implements the kvm_arch_vcpu_spin/preempt_in_kernel(), because get_cpl
requires vcpu_load, so we must cache the result(whether the vcpu was
preempted when its cpl=0) in kvm_arch_vcpu.

Signed-off-by: Longpeng(Mike) 
---
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/svm.c  |  6 ++
 arch/x86/kvm/vmx.c  | 20 
 arch/x86/kvm/x86.c  |  9 +++--
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fb..d2b2d57 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -688,6 +688,9 @@ struct kvm_vcpu_arch {
 
/* GPA available (AMD only) */
bool gpa_available;
+
+   /* be preempted when it's in kernel-mode(cpl=0) */
+   bool preempted_in_kernel;
 };
 
 struct kvm_lpage_info {
@@ -1057,6 +1060,8 @@ struct kvm_x86_ops {
void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
 
void (*setup_mce)(struct kvm_vcpu *vcpu);
+
+   bool (*spin_in_kernel)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4d8141e..552ab4c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5352,6 +5352,11 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
vcpu->arch.mcg_cap &= 0x1ff;
 }
 
+static bool svm_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   return svm_get_cpl(vcpu) == 0;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -5464,6 +5469,7 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
.deliver_posted_interrupt = svm_deliver_avic_intr,
.update_pi_irte = svm_update_pi_irte,
.setup_mce = svm_setup_mce,
+   .spin_in_kernel = svm_spin_in_kernel,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 39a6222..d0dfe2e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11547,6 +11547,25 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
~FEATURE_CONTROL_LMCE;
 }
 
+static bool vmx_spin_in_kernel(struct kvm_vcpu *vcpu)
+{
+   u32 secondary_exec_ctrl = 0;
+
+   /*
+* Intel sdm vol3 ch-25.1.3 says: The “PAUSE-loop exiting”
+* VM-execution control is ignored if CPL > 0. So the vcpu
+* is always exiting with CPL=0 if it uses PLE.
+*
+* The following block needs less cycles than vmx_get_cpl().
+*/
+   if (cpu_has_secondary_exec_ctrls())
+   secondary_exec_ctrl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+   if (secondary_exec_ctrl & SECONDARY_EXEC_PAUSE_LOOP_EXITING)
+   return true;
+
+   return vmx_get_cpl(vcpu) == 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -11674,6 +11693,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 #endif
 
.setup_mce = vmx_setup_mce,
+   .spin_in_kernel = vmx_spin_in_kernel,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04c6a1f..fa79a60 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2881,6 +2881,10 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu 
*vcpu)
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
int idx;
+
+   if (vcpu->preempted)
+   vcpu->arch.preempted_in_kernel = !kvm_x86_ops->get_cpl(vcpu);
+
/*
 * Disable page faults because we're in atomic context here.
 * kvm_write_guest_offset_cached() would call might_fault()
@@ -7988,6 +7992,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
kvm_pmu_init(vcpu);
 
vcpu->arch.pending_external_vector = -1;
+   vcpu->arch.preempted_in_kernel = false;
 
kvm_hv_vcpu_init(vcpu);
 
@@ -8437,12 +8442,12 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_vcpu_spin_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return kvm_x86_ops->spin_in_kernel(vcpu);
 }
 
 bool kvm_arch_vcpu_preempt_in_kernel(struct kvm_vcpu *vcpu)
 {
-   return false;
+   return vcpu->arch.preempted_in_kernel;
 }
 
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
-- 
1.8.3.1




[PATCH 0/3] KVM: optimize the kvm_vcpu_on_spin

2017-08-07 Thread Longpeng(Mike)
This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.

I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.

== Geekbench-3.4.1 ==
VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running Geekbench-3.4.1 *10 truns*
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of each testcase's score:
(higher is better)
before  after   improve
Inter
 single 1176.7  1179.0  0.2%
 multi  3459.5  3426.5  -0.9%
Float
 single 1150.5  1150.9  0.0%
 multi  3364.5  3391.9  0.8%
Memory(stream)
 single 1768.7  1773.1  0.2%
 multi  2511.6  2557.2  1.8%
Overall
 single 1284.2  1286.2  0.2%
 multi  3231.4  3238.4  0.2%


== kernbench-0.42 ==
VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running "kernbench -n 10"
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of 'Elapsed Time':
(sooner is better)
before  after   improve
load -j412.762  12.751  0.1%
load -j32   9.743   8.955   8.1%
load -j 9.688   9.229   4.7%


Physical Machine:
  Architecture:  x86_64
  CPU op-mode(s):32-bit, 64-bit
  Byte Order:Little Endian
  CPU(s):24
  On-line CPU(s) list:   0-23
  Thread(s) per core:2
  Core(s) per socket:6
  Socket(s): 2
  NUMA node(s):  2
  Vendor ID: GenuineIntel
  CPU family:6
  Model: 45
  Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  Stepping:  7
  CPU MHz:   2799.902
  BogoMIPS:  5004.67
  Virtualization:VT-x
  L1d cache: 32K
  L1i cache: 32K
  L2 cache:  256K
  L3 cache:  15360K
  NUMA node0 CPU(s): 0-5,12-17
  NUMA node1 CPU(s): 6-11,18-23

---
Changes since RFC:
 - only cache result for X86. [David & Cornlia & Paolo]
 - add performance numbers. [David]
 - impls arm/s390. [Christoffer & David]
 - refactor the impls. [me]

---
Longpeng(Mike) (3):
  KVM: add spinlock-exiting optimize framework
  KVM: X86: implement the logic for spinlock optimization
  KVM: implement spinlock optimization logic for arm/s390

 arch/mips/kvm/mips.c| 10 ++
 arch/powerpc/kvm/powerpc.c  | 10 ++
 arch/s390/kvm/kvm-s390.c| 10 ++
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/svm.c  |  6 ++
 arch/x86/kvm/vmx.c  | 20 
 arch/x86/kvm/x86.c  | 15 +++
 include/linux/kvm_host.h|  2 ++
 virt/kvm/arm/arm.c  | 10 ++
 virt/kvm/kvm_main.c |  4 
 10 files changed, 92 insertions(+)

-- 
1.8.3.1




[PATCH 0/3] KVM: optimize the kvm_vcpu_on_spin

2017-08-07 Thread Longpeng(Mike)
This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.

I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.

== Geekbench-3.4.1 ==
VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running Geekbench-3.4.1 *10 truns*
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of each testcase's score:
(higher is better)
before  after   improve
Inter
 single 1176.7  1179.0  0.2%
 multi  3459.5  3426.5  -0.9%
Float
 single 1150.5  1150.9  0.0%
 multi  3364.5  3391.9  0.8%
Memory(stream)
 single 1768.7  1773.1  0.2%
 multi  2511.6  2557.2  1.8%
Overall
 single 1284.2  1286.2  0.2%
 multi  3231.4  3238.4  0.2%


== kernbench-0.42 ==
VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running "kernbench -n 10"
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of 'Elapsed Time':
(sooner is better)
before  after   improve
load -j412.762  12.751  0.1%
load -j32   9.743   8.955   8.1%
load -j 9.688   9.229   4.7%


Physical Machine:
  Architecture:  x86_64
  CPU op-mode(s):32-bit, 64-bit
  Byte Order:Little Endian
  CPU(s):24
  On-line CPU(s) list:   0-23
  Thread(s) per core:2
  Core(s) per socket:6
  Socket(s): 2
  NUMA node(s):  2
  Vendor ID: GenuineIntel
  CPU family:6
  Model: 45
  Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  Stepping:  7
  CPU MHz:   2799.902
  BogoMIPS:  5004.67
  Virtualization:VT-x
  L1d cache: 32K
  L1i cache: 32K
  L2 cache:  256K
  L3 cache:  15360K
  NUMA node0 CPU(s): 0-5,12-17
  NUMA node1 CPU(s): 6-11,18-23

---
Changes since RFC:
 - only cache result for X86. [David & Cornlia & Paolo]
 - add performance numbers. [David]
 - impls arm/s390. [Christoffer & David]
 - refactor the impls. [me]

---
Longpeng(Mike) (3):
  KVM: add spinlock-exiting optimize framework
  KVM: X86: implement the logic for spinlock optimization
  KVM: implement spinlock optimization logic for arm/s390

 arch/mips/kvm/mips.c| 10 ++
 arch/powerpc/kvm/powerpc.c  | 10 ++
 arch/s390/kvm/kvm-s390.c| 10 ++
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/svm.c  |  6 ++
 arch/x86/kvm/vmx.c  | 20 
 arch/x86/kvm/x86.c  | 15 +++
 include/linux/kvm_host.h|  2 ++
 virt/kvm/arm/arm.c  | 10 ++
 virt/kvm/kvm_main.c |  4 
 10 files changed, 92 insertions(+)

-- 
1.8.3.1




[PATCH] KVM: X86: init irq->level in kvm_pv_kick_cpu_op

2017-08-01 Thread Longpeng(Mike)
'lapic_irq' is a local variable and its 'level' field isn't
initialized, so 'level' is random, it doesn't matter but
makes UBSAN unhappy:

UBSAN: Undefined behaviour in .../lapic.c:...
load of value 10 is not a valid value for type '_Bool'
...
Call Trace:
 [] dump_stack+0x1e/0x20
 [] ubsan_epilogue+0x12/0x55
 [] __ubsan_handle_load_invalid_value+0x118/0x162
 [] kvm_apic_set_irq+0xc3/0xf0 [kvm]
 [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
 [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
 [] kvm_emulate_hypercall+0x62e/0x760 [kvm]
 [] handle_vmcall+0x1a/0x30 [kvm_intel]
 [] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
...

Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c97c82..b411f92 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6215,6 +6215,7 @@ static void kvm_pv_kick_cpu_op(struct kvm *kvm, unsigned 
long flags, int apicid)
 
lapic_irq.shorthand = 0;
lapic_irq.dest_mode = 0;
+   lapic_irq.level = 0;
lapic_irq.dest_id = apicid;
lapic_irq.msi_redir_hint = false;
 
-- 
1.8.3.1




[PATCH] KVM: X86: init irq->level in kvm_pv_kick_cpu_op

2017-08-01 Thread Longpeng(Mike)
'lapic_irq' is a local variable and its 'level' field isn't
initialized, so 'level' is random, it doesn't matter but
makes UBSAN unhappy:

UBSAN: Undefined behaviour in .../lapic.c:...
load of value 10 is not a valid value for type '_Bool'
...
Call Trace:
 [] dump_stack+0x1e/0x20
 [] ubsan_epilogue+0x12/0x55
 [] __ubsan_handle_load_invalid_value+0x118/0x162
 [] kvm_apic_set_irq+0xc3/0xf0 [kvm]
 [] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
 [] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
 [] kvm_emulate_hypercall+0x62e/0x760 [kvm]
 [] handle_vmcall+0x1a/0x30 [kvm_intel]
 [] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
...

Signed-off-by: Longpeng(Mike) 
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c97c82..b411f92 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6215,6 +6215,7 @@ static void kvm_pv_kick_cpu_op(struct kvm *kvm, unsigned 
long flags, int apicid)
 
lapic_irq.shorthand = 0;
lapic_irq.dest_mode = 0;
+   lapic_irq.level = 0;
lapic_irq.dest_id = apicid;
lapic_irq.msi_redir_hint = false;
 
-- 
1.8.3.1




Re: [RFC] KVM: optimize the kvm_vcpu_on_spin

2017-07-31 Thread Longpeng (Mike)


On 2017/7/31 21:20, Paolo Bonzini wrote:

> On 31/07/2017 14:27, David Hildenbrand wrote:
>>> I'm not sure whether the operation of get the vcpu's priority-level is
>>> expensive on all architectures, so I record it in kvm_sched_out() for
>>> minimal the extra cycles cost in kvm_vcpu_on_spin().
>>>
>> as you only care for x86 right now either way, you can directly optimize
>> here for the good (here: x86) case (keeping changes and therefore
>> possible bugs minimal).
> 
> I agree with Cornelia that this is inconsistent, so you shouldn't update
> me->in_kernmode in kvm_vcpu_on_spin.  However, get_cpl requires
> vcpu_load on Intel x86, so Mike's patch is necessary (vmx_get_cpl ->
> vmx_read_guest_seg_ar -> vmcs_read32).
> 

Hi Paolo,

It seems that other architectures(e.g. arm/s390) needn't to cache the result,
but x86 need, so I need to move 'in_kernmode' into kvm_vcpu_arch and only add
this field to x86, right?

> Alternatively, we can add a new callback kvm_x86_ops->sched_out to x86
> KVM, and call vmx_get_cpl from the Intel implementation (vmx_sched_out).


In this approach, vmx_sched_out would only call vmx_get_cpl, isn't too
redundant, because we can just call kvm_x86_ops->get_cpl instead at the right 
place?

>  This will cache the result until the next sched_in, so that


'until the next sched_in' --> Do we need to clear the result in sched in ?

> kvm_vcpu_on_spin can use it.
> 
> Paolo
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [RFC] KVM: optimize the kvm_vcpu_on_spin

2017-07-31 Thread Longpeng (Mike)


On 2017/7/31 21:20, Paolo Bonzini wrote:

> On 31/07/2017 14:27, David Hildenbrand wrote:
>>> I'm not sure whether the operation of get the vcpu's priority-level is
>>> expensive on all architectures, so I record it in kvm_sched_out() for
>>> minimal the extra cycles cost in kvm_vcpu_on_spin().
>>>
>> as you only care for x86 right now either way, you can directly optimize
>> here for the good (here: x86) case (keeping changes and therefore
>> possible bugs minimal).
> 
> I agree with Cornelia that this is inconsistent, so you shouldn't update
> me->in_kernmode in kvm_vcpu_on_spin.  However, get_cpl requires
> vcpu_load on Intel x86, so Mike's patch is necessary (vmx_get_cpl ->
> vmx_read_guest_seg_ar -> vmcs_read32).
> 

Hi Paolo,

It seems that other architectures(e.g. arm/s390) needn't to cache the result,
but x86 need, so I need to move 'in_kernmode' into kvm_vcpu_arch and only add
this field to x86, right?

> Alternatively, we can add a new callback kvm_x86_ops->sched_out to x86
> KVM, and call vmx_get_cpl from the Intel implementation (vmx_sched_out).


In this approach, vmx_sched_out would only call vmx_get_cpl, isn't too
redundant, because we can just call kvm_x86_ops->get_cpl instead at the right 
place?

>  This will cache the result until the next sched_in, so that


'until the next sched_in' --> Do we need to clear the result in sched in ?

> kvm_vcpu_on_spin can use it.
> 
> Paolo
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [RFC] KVM: optimize the kvm_vcpu_on_spin

2017-07-31 Thread Longpeng (Mike)


On 2017/7/31 21:22, Christoffer Dall wrote:

> On Sat, Jul 29, 2017 at 02:22:57PM +0800, Longpeng(Mike) wrote:
>> We had disscuss the idea here:
>> https://www.spinics.net/lists/kvm/msg140593.html
> 
> This is not a very nice way to start a commit description.
> 
> Please provide the necessary background to understand your change
> directly in the commit message.
> 
>>
>> I think it's also suitable for other architectures.
>>
> 
> I think this sentence can go in the end of the commit message together
> with your explanation of only doing this for x86.
> 


OK :)

> By the way, the ARM solution should be pretty simple:
> 
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index a39a1e1..b9f68e4 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -416,6 +416,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>   && !v->arch.power_off && !v->arch.pause);
>  }
>  
> +bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
> +{
> + return vcpu_mode_priv(vcpu);
> +}
> +
>  /* Just ensure a guest exit from a particular CPU */
>  static void exit_vm_noop(void *info)
>  {
> 
> 
> I am also curious in the workload you use to measure this and how I can
> evaluate the benefit on ARM?
> 


We had tested this using the SpecVirt testsuite, no improvement (no decrease at
least) because of the spinlock isn't the major factor of this testsuite.

Currently I haven't any performance numbers to prove the patch is make sense,
but I'll do some tests later.

> Thanks,
> -Christoffer
> 
>> If the vcpu(me) exit due to request a usermode spinlock, then
>> the spinlock-holder may be preempted in usermode or kernmode.
>> But if the vcpu(me) is in kernmode, then the holder must be
>> preempted in kernmode, so we should choose a vcpu in kernmode
>> as the most eligible candidate.
>>
>> PS: I only implement X86 arch currently for I'm not familiar
>> with other architecture.
>>
>> Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
>> ---
>>  arch/mips/kvm/mips.c   | 5 +
>>  arch/powerpc/kvm/powerpc.c | 5 +
>>  arch/s390/kvm/kvm-s390.c   | 5 +
>>  arch/x86/kvm/x86.c | 5 +
>>  include/linux/kvm_host.h   | 4 
>>  virt/kvm/arm/arm.c | 5 +
>>  virt/kvm/kvm_main.c| 9 -
>>  7 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
>> index d4b2ad1..2e2701d 100644
>> --- a/arch/mips/kvm/mips.c
>> +++ b/arch/mips/kvm/mips.c
>> @@ -98,6 +98,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return !!(vcpu->arch.pending_exceptions);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 1a75c0b..2489f64 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -58,6 +58,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>  return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index 3f2884e..9d7c42e 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -2443,6 +2443,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_s390_vcpu_has_irq(vcpu, 0);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
>>  {
>>  atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 82a63c5..b5a2e53 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -8435,6 +8435,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return kvm_x86_ops->get_cpl(vcpu) == 0;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {

Re: [RFC] KVM: optimize the kvm_vcpu_on_spin

2017-07-31 Thread Longpeng (Mike)


On 2017/7/31 21:22, Christoffer Dall wrote:

> On Sat, Jul 29, 2017 at 02:22:57PM +0800, Longpeng(Mike) wrote:
>> We had disscuss the idea here:
>> https://www.spinics.net/lists/kvm/msg140593.html
> 
> This is not a very nice way to start a commit description.
> 
> Please provide the necessary background to understand your change
> directly in the commit message.
> 
>>
>> I think it's also suitable for other architectures.
>>
> 
> I think this sentence can go in the end of the commit message together
> with your explanation of only doing this for x86.
> 


OK :)

> By the way, the ARM solution should be pretty simple:
> 
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index a39a1e1..b9f68e4 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -416,6 +416,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>   && !v->arch.power_off && !v->arch.pause);
>  }
>  
> +bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
> +{
> + return vcpu_mode_priv(vcpu);
> +}
> +
>  /* Just ensure a guest exit from a particular CPU */
>  static void exit_vm_noop(void *info)
>  {
> 
> 
> I am also curious in the workload you use to measure this and how I can
> evaluate the benefit on ARM?
> 


We had tested this using the SpecVirt testsuite, no improvement (no decrease at
least) because of the spinlock isn't the major factor of this testsuite.

Currently I haven't any performance numbers to prove the patch is make sense,
but I'll do some tests later.

> Thanks,
> -Christoffer
> 
>> If the vcpu(me) exit due to request a usermode spinlock, then
>> the spinlock-holder may be preempted in usermode or kernmode.
>> But if the vcpu(me) is in kernmode, then the holder must be
>> preempted in kernmode, so we should choose a vcpu in kernmode
>> as the most eligible candidate.
>>
>> PS: I only implement X86 arch currently for I'm not familiar
>> with other architecture.
>>
>> Signed-off-by: Longpeng(Mike) 
>> ---
>>  arch/mips/kvm/mips.c   | 5 +
>>  arch/powerpc/kvm/powerpc.c | 5 +
>>  arch/s390/kvm/kvm-s390.c   | 5 +
>>  arch/x86/kvm/x86.c | 5 +
>>  include/linux/kvm_host.h   | 4 
>>  virt/kvm/arm/arm.c | 5 +
>>  virt/kvm/kvm_main.c| 9 -
>>  7 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
>> index d4b2ad1..2e2701d 100644
>> --- a/arch/mips/kvm/mips.c
>> +++ b/arch/mips/kvm/mips.c
>> @@ -98,6 +98,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return !!(vcpu->arch.pending_exceptions);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 1a75c0b..2489f64 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -58,6 +58,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>  return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return 1;
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index 3f2884e..9d7c42e 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -2443,6 +2443,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_s390_vcpu_has_irq(vcpu, 0);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return false;
>> +}
>> +
>>  void kvm_s390_vcpu_block(struct kvm_vcpu *vcpu)
>>  {
>>  atomic_or(PROG_BLOCK_SIE, >arch.sie_block->prog20);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 82a63c5..b5a2e53 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -8435,6 +8435,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>  return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
>>  }
>>  
>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
>> +{
>> +return kvm_x86_ops->get_cpl(vcpu) == 0;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  re

Re: [RFC] KVM: optimize the kvm_vcpu_on_spin

2017-07-31 Thread Longpeng (Mike)


On 2017/7/31 20:31, Cornelia Huck wrote:

> On Mon, 31 Jul 2017 20:08:14 +0800
> "Longpeng (Mike)" <longpe...@huawei.com> wrote:
> 
>> Hi David,
>>
>> On 2017/7/31 19:31, David Hildenbrand wrote:
> 
>>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>>> index 648b34c..f8f0d74 100644
>>>> --- a/include/linux/kvm_host.h
>>>> +++ b/include/linux/kvm_host.h
>>>> @@ -272,6 +272,9 @@ struct kvm_vcpu {
>>>>} spin_loop;
>>>>  #endif
>>>>bool preempted;
>>>> +  /* If vcpu is in kernel-mode when preempted */
>>>> +  bool in_kernmode;
>>>> +  
>>>
>>> Why do you have to store that ...
>>>   
>>
>>> [...]> +me->in_kernmode = kvm_arch_vcpu_spin_kernmode(me);
>>>>kvm_vcpu_set_in_spin_loop(me, true);
>>>>/*
>>>> * We boost the priority of a VCPU that is runnable but not
>>>> @@ -2351,6 +2353,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>>continue;
>>>>if (swait_active(>wq) && 
>>>> !kvm_arch_vcpu_runnable(vcpu))
>>>>continue;
>>>> +  if (me->in_kernmode && !vcpu->in_kernmode)  
>>>
>>> Wouldn't it be easier to simply have
>>>
>>> in_kernel = kvm_arch_vcpu_in_kernel(me);
>>> ...
>>> if (in_kernel && !kvm_arch_vcpu_in_kernel(vcpu))
>>> ...
>>>   
>>
>> I'm not sure whether the operation of get the vcpu's priority-level is
>> expensive on all architectures, so I record it in kvm_sched_out() for
>> minimal the extra cycles cost in kvm_vcpu_on_spin().
> 
> As it is now, this handling looks a bit inconsistent. You only update
> the field on sched-out via preemption _or_ if kvm_vcpu_on_spin is
> called for the vcpu. In most contexts, this field will have stale
> content.
> 
> Also, would checking for kernel mode be more expensive than the various
> other checks already done in this function?
> 

> [I like David's suggestion.]
> 


Hi Cornelia & David,

I'll take your suggestion, thanks :)

>>
>>>> +  continue;
>>>>if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>>>continue;
>>>>  
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [RFC] KVM: optimize the kvm_vcpu_on_spin

2017-07-31 Thread Longpeng (Mike)


On 2017/7/31 20:31, Cornelia Huck wrote:

> On Mon, 31 Jul 2017 20:08:14 +0800
> "Longpeng (Mike)"  wrote:
> 
>> Hi David,
>>
>> On 2017/7/31 19:31, David Hildenbrand wrote:
> 
>>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>>> index 648b34c..f8f0d74 100644
>>>> --- a/include/linux/kvm_host.h
>>>> +++ b/include/linux/kvm_host.h
>>>> @@ -272,6 +272,9 @@ struct kvm_vcpu {
>>>>} spin_loop;
>>>>  #endif
>>>>bool preempted;
>>>> +  /* If vcpu is in kernel-mode when preempted */
>>>> +  bool in_kernmode;
>>>> +  
>>>
>>> Why do you have to store that ...
>>>   
>>
>>> [...]> +me->in_kernmode = kvm_arch_vcpu_spin_kernmode(me);
>>>>kvm_vcpu_set_in_spin_loop(me, true);
>>>>/*
>>>> * We boost the priority of a VCPU that is runnable but not
>>>> @@ -2351,6 +2353,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>>continue;
>>>>if (swait_active(>wq) && 
>>>> !kvm_arch_vcpu_runnable(vcpu))
>>>>continue;
>>>> +  if (me->in_kernmode && !vcpu->in_kernmode)  
>>>
>>> Wouldn't it be easier to simply have
>>>
>>> in_kernel = kvm_arch_vcpu_in_kernel(me);
>>> ...
>>> if (in_kernel && !kvm_arch_vcpu_in_kernel(vcpu))
>>> ...
>>>   
>>
>> I'm not sure whether the operation of get the vcpu's priority-level is
>> expensive on all architectures, so I record it in kvm_sched_out() for
>> minimal the extra cycles cost in kvm_vcpu_on_spin().
> 
> As it is now, this handling looks a bit inconsistent. You only update
> the field on sched-out via preemption _or_ if kvm_vcpu_on_spin is
> called for the vcpu. In most contexts, this field will have stale
> content.
> 
> Also, would checking for kernel mode be more expensive than the various
> other checks already done in this function?
> 

> [I like David's suggestion.]
> 


Hi Cornelia & David,

I'll take your suggestion, thanks :)

>>
>>>> +  continue;
>>>>if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>>>continue;
>>>>  
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [RFC] KVM: optimize the kvm_vcpu_on_spin

2017-07-31 Thread Longpeng (Mike)
Hi David,

On 2017/7/31 19:31, David Hildenbrand wrote:

> [no idea if this change makes sense (and especially if it has any bad
> side effects), do you have performance numbers? I'll just have a look at
> the general structure of the patch in the meanwhile]
> 

I haven't any test results yet, could you give me some suggestion about what
benchmarks are suitable ?

>> +bool kvm_arch_vcpu_spin_kernmode(struct kvm_vcpu *vcpu)
> 
> kvm_arch_vcpu_in_kernel() ?
> 

Um...yes, I'll correct this.

>> +{
>> +return kvm_x86_ops->get_cpl(vcpu) == 0;
>> +}
>> +
>>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>>  {
>>  return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 648b34c..f8f0d74 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -272,6 +272,9 @@ struct kvm_vcpu {
>>  } spin_loop;
>>  #endif
>>  bool preempted;
>> +/* If vcpu is in kernel-mode when preempted */
>> +bool in_kernmode;
>> +
> 
> Why do you have to store that ...
> 

> [...]> +  me->in_kernmode = kvm_arch_vcpu_spin_kernmode(me);
>>  kvm_vcpu_set_in_spin_loop(me, true);
>>  /*
>>   * We boost the priority of a VCPU that is runnable but not
>> @@ -2351,6 +2353,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>  continue;
>>  if (swait_active(>wq) && 
>> !kvm_arch_vcpu_runnable(vcpu))
>>  continue;
>> +if (me->in_kernmode && !vcpu->in_kernmode)
> 
> Wouldn't it be easier to simply have
> 
> in_kernel = kvm_arch_vcpu_in_kernel(me);
> ...
> if (in_kernel && !kvm_arch_vcpu_in_kernel(vcpu))
> ...
> 

I'm not sure whether the operation of get the vcpu's priority-level is
expensive on all architectures, so I record it in kvm_sched_out() for
minimal the extra cycles cost in kvm_vcpu_on_spin().

>> +continue;
>>  if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>  continue;
>>  
>> @@ -4009,8 +4013,11 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>>  {
>>  struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
>>  
>> -if (current->state == TASK_RUNNING)
>> +if (current->state == TASK_RUNNING) {
>>  vcpu->preempted = true;
>> +vcpu->in_kernmode = kvm_arch_vcpu_spin_kernmode(vcpu);
>> +}
>> +
> 
> so you don't have to do this change, too.
> 
>>  kvm_arch_vcpu_put(vcpu);
>>  }
>>  
>>
> 
> 


-- 
Regards,
Longpeng(Mike)



  1   2   >