RE: [PATCH 1/2] swiotlb: Remove alloc_size argument to swiotlb_tbl_map_single()

2024-05-06 Thread Michael Kelley
From: mhkelle...@gmail.com 
> 

Gentle ping ...

Anyone interested in reviewing this series of two patches?  It fixes
an edge case bug in the size of the swiotlb request coming from
dma-iommu, and plugs a hole that allows untrusted devices to see
kernel data unrelated to the intended DMA transfer.  I think these are
the last "known bugs" that came out of the extensive swiotlb discussion
and patches for 6.9.

Michael

> Currently swiotlb_tbl_map_single() takes alloc_align_mask and
> alloc_size arguments to specify an swiotlb allocation that is
> larger than mapping_size. This larger allocation is used solely
> by iommu_dma_map_single() to handle untrusted devices that should
> not have DMA visibility to memory pages that are partially used
> for unrelated kernel data.
> 
> Having two arguments to specify the allocation is redundant. While
> alloc_align_mask naturally specifies the alignment of the starting
> address of the allocation, it can also implicitly specify the size
> by rounding up the mapping_size to that alignment.
> 
> Additionally, the current approach has an edge case bug.
> iommu_dma_map_page() already does the rounding up to compute the
> alloc_size argument. But swiotlb_tbl_map_single() then calculates
> the alignment offset based on the DMA min_align_mask, and adds
> that offset to alloc_size. If the offset is non-zero, the addition
> may result in a value that is larger than the max the swiotlb can
> allocate. If the rounding up is done _after_ the alignment offset is
> added to the mapping_size (and the original mapping_size conforms to
> the value returned by swiotlb_max_mapping_size), then the max that the
> swiotlb can allocate will not be exceeded.
> 
> In view of these issues, simplify the swiotlb_tbl_map_single() interface
> by removing the alloc_size argument. Most call sites pass the same
> value for mapping_size and alloc_size, and they pass alloc_align_mask
> as zero. Just remove the redundant argument from these callers, as they
> will see no functional change. For iommu_dma_map_page() also remove
> the alloc_size argument, and have swiotlb_tbl_map_single() compute
> the alloc_size by rounding up mapping_size after adding the offset
> based on min_align_mask. This has the side effect of fixing the
> edge case bug but with no other functional change.
> 
> Also add a sanity test on the alloc_align_mask. While IOMMU code
> currently ensures the granule is not larger than PAGE_SIZE, if
> that guarantee were to be removed in the future, the downstream
> effect on the swiotlb might go unnoticed until strange allocation
> failures occurred.
> 
> Tested on an ARM64 system with 16K page size and some kernel
> test-only hackery to allow modifying the DMA min_align_mask and
> the granule size that becomes the alloc_align_mask. Tested these
> combinations with a variety of original memory addresses and
> sizes, including those that reproduce the edge case bug:
> 
> * 4K granule and 0 min_align_mask
> * 4K granule and 0xFFF min_align_mask (4K - 1)
> * 16K granule and 0xFFF min_align_mask
> * 64K granule and 0xFFF min_align_mask
> * 64K granule and 0x3FFF min_align_mask (16K - 1)
> 
> With the changes, all combinations pass.
> 
> Signed-off-by: Michael Kelley 
> ---
> I've haven't used any "Fixes:" tags. This patch really should be
> backported only if all the other recent swiotlb fixes get backported,
> and I'm unclear on whether that will happen.
> 
> I saw the brief discussion about removing the "dir" parameter from
> swiotlb_tbl_map_single(). That removal could easily be done as part
> of this patch, since it's already changing the swiotlb_tbl_map_single()
> parameters. But I think the conclusion of the discussion was to leave
> the "dir" parameter for symmetry with the swiotlb_sync_*() functions.
> Please correct me if that's wrong, and I'll respin this patch to do
> the removal.
> 
>  drivers/iommu/dma-iommu.c |  2 +-
>  drivers/xen/swiotlb-xen.c |  2 +-
>  include/linux/swiotlb.h   |  2 +-
>  kernel/dma/swiotlb.c  | 56 +--
>  4 files changed, 45 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 07d087eecc17..c21ef1388499 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1165,7 +1165,7 @@ static dma_addr_t iommu_dma_map_page(struct device 
> *dev, struct page *page,
>   trace_swiotlb_bounced(dev, phys, size);
> 
>   aligned_size = iova_align(iovad, size);
> - phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> + phys = swiotlb_tbl_map_single(dev, phys, size,
>   

RE: [PATCH 1/2] swiotlb: Remove alloc_size argument to swiotlb_tbl_map_single()

2024-04-15 Thread Michael Kelley
From: Petr Tesařík  Sent: Monday, April 15, 2024 5:50 AM
> 
> On Mon, 15 Apr 2024 12:23:22 +0000
> Michael Kelley  wrote:
> 
> > From: Petr Tesařík  Sent: Monday, April 15, 2024 4:46 AM
> > >
> > > Hi Michael,
> > >
> > > sorry for taking so long to answer. Yes, there was no agreement on the
> > > removal of the "dir" parameter, but I'm not sure it's because of
> > > symmetry with swiotlb_sync_*(), because the topic was not really
> > > discussed.
> > >
> > > The discussion was about the KUnit test suite and whether direction is
> > > a property of the bounce buffer or of each sync operation. Since DMA API
> > > defines associates each DMA buffer with a direction, the direction
> > > parameter passed to swiotlb_sync_*() should match what was passed to
> > > swiotlb_tbl_map_single(), because that's how it is used by the generic
> > > DMA code. In other words, if the parameter is kept, it should be kept
> > > to match dma_map_*().
> > >
> > > However, there is also symmetry with swiotlb_tbl_unmap_single(). This
> > > function does use the parameter for the final sync. I believe there
> > > should be a matching initial sync in swiotlb_tbl_map_single(). In
> > > short, the buffer sync for DMA non-coherent devices should be moved from
> > > swiotlb_map() to swiotlb_tbl_map_single(). If this sync is not needed,
> > > then the caller can (and should) include DMA_ATTR_SKIP_CPU_SYNC in
> > > the flags parameter.
> > >
> > > To sum it up:
> > >
> > > * Do *NOT* remove the "dir" parameter.
> > > * Let me send a patch which moves the initial buffer sync.
> > >
> >
> > I'm not seeing the need to move the initial buffer sync.  All
> > callers of swiotlb_tbl_map_single() already have a subsequent
> > check for a non-coherent device, and a call to
> > arch_sync_dma_for_device().  And the Xen code has some
> > special handling that probably shouldn't go in
> > swiotlb_tbl_map_single().  Or am I missing something?
> 
> Oh, sure, there's nothing broken ATM. It's merely a cleanup. The API is
> asymmetric and thus confusing. You get a final sync by default if you
> call swiotlb_tbl_unmap_single(), 

I don't see that final sync in swiotlb_tbl_unmap_single().  It calls
swiotlb_bounce() to copy the data, but it doesn't deal with
non-coherent devices or call arch_sync_dma_for_cpu().

> but you don't get an initial sync by
> default if you call swiotlb_tbl_map_single(). This is difficult to
> remember, so potential new users of the API may incorrectly assume that
> an initial sync is done, or that a final sync is not done.
> 
> And yes, when moving the code, all current users of
> swiotlb_tbl_map_single() should specify DMA_ATTR_SKIP_CPU_SYNC.
> 
> Petr T


RE: [PATCH 1/2] swiotlb: Remove alloc_size argument to swiotlb_tbl_map_single()

2024-04-15 Thread Michael Kelley
From: Petr Tesařík  Sent: Monday, April 15, 2024 4:46 AM
> 
> Hi Michael,
> 
> sorry for taking so long to answer. Yes, there was no agreement on the
> removal of the "dir" parameter, but I'm not sure it's because of
> symmetry with swiotlb_sync_*(), because the topic was not really
> discussed.
> 
> The discussion was about the KUnit test suite and whether direction is
> a property of the bounce buffer or of each sync operation. Since DMA API
> defines associates each DMA buffer with a direction, the direction
> parameter passed to swiotlb_sync_*() should match what was passed to
> swiotlb_tbl_map_single(), because that's how it is used by the generic
> DMA code. In other words, if the parameter is kept, it should be kept
> to match dma_map_*().
> 
> However, there is also symmetry with swiotlb_tbl_unmap_single(). This
> function does use the parameter for the final sync. I believe there
> should be a matching initial sync in swiotlb_tbl_map_single(). In
> short, the buffer sync for DMA non-coherent devices should be moved from
> swiotlb_map() to swiotlb_tbl_map_single(). If this sync is not needed,
> then the caller can (and should) include DMA_ATTR_SKIP_CPU_SYNC in
> the flags parameter.
> 
> To sum it up:
> 
> * Do *NOT* remove the "dir" parameter.
> * Let me send a patch which moves the initial buffer sync.
> 

I'm not seeing the need to move the initial buffer sync.  All
callers of swiotlb_tbl_map_single() already have a subsequent
check for a non-coherent device, and a call to 
arch_sync_dma_for_device().  And the Xen code has some 
special handling that probably shouldn't go in
swiotlb_tbl_map_single().  Or am I missing something?

Michael



RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

2023-07-11 Thread Michael Kelley (LINUX)
From: Petr Tesařík  Sent: Monday, July 10, 2023 2:36 AM
> 
> On Sat, 8 Jul 2023 15:18:32 +0000
> "Michael Kelley (LINUX)"  wrote:
> 
> > From: Petr Tesařík  Sent: Friday, July 7, 2023 3:22 AM
> > >
> > > On Fri, 7 Jul 2023 10:29:00 +0100
> > > Greg Kroah-Hartman  wrote:
> > >
> > > > On Thu, Jul 06, 2023 at 02:22:50PM +, Michael Kelley (LINUX) wrote:
> > > > > From: Greg Kroah-Hartman  Sent: Thursday, 
> > > > > July 6,
> > > 2023 1:07 AM
> > > > > >
> > > > > > On Thu, Jul 06, 2023 at 03:50:55AM +, Michael Kelley (LINUX) 
> > > > > > wrote:
> > > > > > > From: Petr Tesarik  Sent: Tuesday, 
> > > > > > > June 27, 2023
> > > > > > 2:54 AM
> > > > > > > >
> > > > > > > > Try to allocate a transient memory pool if no suitable slots 
> > > > > > > > can be found,
> > > > > > > > except when allocating from a restricted pool. The transient 
> > > > > > > > pool is just
> > > > > > > > enough big for this one bounce buffer. It is inserted into a 
> > > > > > > > per-device
> > > > > > > > list of transient memory pools, and it is freed again when the 
> > > > > > > > bounce
> > > > > > > > buffer is unmapped.
> > > > > > > >
> > > > > > > > Transient memory pools are kept in an RCU list. A memory 
> > > > > > > > barrier is
> > > > > > > > required after adding a new entry, because any address within a 
> > > > > > > > transient
> > > > > > > > buffer must be immediately recognized as belonging to the 
> > > > > > > > SWIOTLB, even if
> > > > > > > > it is passed to another CPU.
> > > > > > > >
> > > > > > > > Deletion does not require any synchronization beyond RCU 
> > > > > > > > ordering
> > > > > > > > guarantees. After a buffer is unmapped, its physical addresses 
> > > > > > > > may no
> > > > > > > > longer be passed to the DMA API, so the memory range of the 
> > > > > > > > corresponding
> > > > > > > > stale entry in the RCU list never matches. If the memory range 
> > > > > > > > gets
> > > > > > > > allocated again, then it happens only after a RCU quiescent 
> > > > > > > > state.
> > > > > > > >
> > > > > > > > Since bounce buffers can now be allocated from different pools, 
> > > > > > > > add a
> > > > > > > > parameter to swiotlb_alloc_pool() to let the caller know which 
> > > > > > > > memory pool
> > > > > > > > is used. Add swiotlb_find_pool() to find the memory pool 
> > > > > > > > corresponding to
> > > > > > > > an address. This function is now also used by 
> > > > > > > > is_swiotlb_buffer(), because
> > > > > > > > a simple boundary check is no longer sufficient.
> > > > > > > >
> > > > > > > > The logic in swiotlb_alloc_tlb() is taken from 
> > > > > > > > __dma_direct_alloc_pages(),
> > > > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > > > >
> > > > > > > > Note that this is not the most efficient way to provide a 
> > > > > > > > bounce buffer,
> > > > > > > > but when a DMA buffer can't be mapped, something may (and will) 
> > > > > > > > actually
> > > > > > > > break. At that point it is better to make an allocation, even 
> > > > > > > > if it may be
> > > > > > > > an expensive operation.
> > > > > > >
> > > > > > > I continue to think about swiotlb memory management from the 
> > > > > > > standpoint
> > > > > > > of CoCo VMs that may be quite large with high network and storage 
> > > > > > > loads.
> > > > > > > These VMs are often running mission-critical workloads that

RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

2023-07-08 Thread Michael Kelley (LINUX)
From: Petr Tesařík  Sent: Friday, July 7, 2023 3:22 AM
> 
> On Fri, 7 Jul 2023 10:29:00 +0100
> Greg Kroah-Hartman  wrote:
> 
> > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > > From: Greg Kroah-Hartman  Sent: Thursday, 
> > > July 6,
> 2023 1:07 AM
> > > >
> > > > On Thu, Jul 06, 2023 at 03:50:55AM +, Michael Kelley (LINUX) wrote:
> > > > > From: Petr Tesarik  Sent: Tuesday, June 
> > > > > 27, 2023
> > > > 2:54 AM
> > > > > >
> > > > > > Try to allocate a transient memory pool if no suitable slots can be 
> > > > > > found,
> > > > > > except when allocating from a restricted pool. The transient pool 
> > > > > > is just
> > > > > > enough big for this one bounce buffer. It is inserted into a 
> > > > > > per-device
> > > > > > list of transient memory pools, and it is freed again when the 
> > > > > > bounce
> > > > > > buffer is unmapped.
> > > > > >
> > > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > > required after adding a new entry, because any address within a 
> > > > > > transient
> > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, 
> > > > > > even if
> > > > > > it is passed to another CPU.
> > > > > >
> > > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > > guarantees. After a buffer is unmapped, its physical addresses may 
> > > > > > no
> > > > > > longer be passed to the DMA API, so the memory range of the 
> > > > > > corresponding
> > > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > > >
> > > > > > Since bounce buffers can now be allocated from different pools, add 
> > > > > > a
> > > > > > parameter to swiotlb_alloc_pool() to let the caller know which 
> > > > > > memory pool
> > > > > > is used. Add swiotlb_find_pool() to find the memory pool 
> > > > > > corresponding to
> > > > > > an address. This function is now also used by is_swiotlb_buffer(), 
> > > > > > because
> > > > > > a simple boundary check is no longer sufficient.
> > > > > >
> > > > > > The logic in swiotlb_alloc_tlb() is taken from 
> > > > > > __dma_direct_alloc_pages(),
> > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > >
> > > > > > Note that this is not the most efficient way to provide a bounce 
> > > > > > buffer,
> > > > > > but when a DMA buffer can't be mapped, something may (and will) 
> > > > > > actually
> > > > > > break. At that point it is better to make an allocation, even if it 
> > > > > > may be
> > > > > > an expensive operation.
> > > > >
> > > > > I continue to think about swiotlb memory management from the 
> > > > > standpoint
> > > > > of CoCo VMs that may be quite large with high network and storage 
> > > > > loads.
> > > > > These VMs are often running mission-critical workloads that can't 
> > > > > tolerate
> > > > > a bounce buffer allocation failure.  To prevent such failures, the 
> > > > > swiotlb
> > > > > memory size must be overly large, which wastes memory.
> > > >
> > > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > > no control over other vms in that same system, then you have worse
> > > > problems, sorry.
> > > >
> > > > Just don't do that.
> > > >
> > >
> > > No, the cases I'm concerned about don't involve memory overcommit.
> > >
> > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
> > > code in the Linux guest allocates a configurable, but fixed, amount of 
> > > guest
> > > memory at boot time for this purpose.  But it's hard to know how much
> > > swiotlb bounce buffer memory will be needed 

RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

2023-07-06 Thread Michael Kelley (LINUX)
From: Greg Kroah-Hartman  Sent: Thursday, July 6, 
2023 1:07 AM
> 
> On Thu, Jul 06, 2023 at 03:50:55AM +, Michael Kelley (LINUX) wrote:
> > From: Petr Tesarik  Sent: Tuesday, June 27, 
> > 2023
> 2:54 AM
> > >
> > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > except when allocating from a restricted pool. The transient pool is just
> > > enough big for this one bounce buffer. It is inserted into a per-device
> > > list of transient memory pools, and it is freed again when the bounce
> > > buffer is unmapped.
> > >
> > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > required after adding a new entry, because any address within a transient
> > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > it is passed to another CPU.
> > >
> > > Deletion does not require any synchronization beyond RCU ordering
> > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > longer be passed to the DMA API, so the memory range of the corresponding
> > > stale entry in the RCU list never matches. If the memory range gets
> > > allocated again, then it happens only after a RCU quiescent state.
> > >
> > > Since bounce buffers can now be allocated from different pools, add a
> > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > a simple boundary check is no longer sufficient.
> > >
> > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > simplified and enhanced to use coherent memory pools if needed.
> > >
> > > Note that this is not the most efficient way to provide a bounce buffer,
> > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > break. At that point it is better to make an allocation, even if it may be
> > > an expensive operation.
> >
> > I continue to think about swiotlb memory management from the standpoint
> > of CoCo VMs that may be quite large with high network and storage loads.
> > These VMs are often running mission-critical workloads that can't tolerate
> > a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> > memory size must be overly large, which wastes memory.
> 
> If "mission critical workloads" are in a vm that allowes overcommit and
> no control over other vms in that same system, then you have worse
> problems, sorry.
> 
> Just don't do that.
> 

No, the cases I'm concerned about don't involve memory overcommit.

CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
code in the Linux guest allocates a configurable, but fixed, amount of guest
memory at boot time for this purpose.  But it's hard to know how much
swiotlb bounce buffer memory will be needed to handle peak I/O loads.
This patch set does dynamic allocation of swiotlb bounce buffer memory,
which can help avoid needing to configure an overly large fixed size at boot.

Michael



RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

2023-07-05 Thread Michael Kelley (LINUX)
From: Petr Tesarik  Sent: Tuesday, June 27, 2023 
2:54 AM
> 
> Try to allocate a transient memory pool if no suitable slots can be found,
> except when allocating from a restricted pool. The transient pool is just
> enough big for this one bounce buffer. It is inserted into a per-device
> list of transient memory pools, and it is freed again when the bounce
> buffer is unmapped.
> 
> Transient memory pools are kept in an RCU list. A memory barrier is
> required after adding a new entry, because any address within a transient
> buffer must be immediately recognized as belonging to the SWIOTLB, even if
> it is passed to another CPU.
> 
> Deletion does not require any synchronization beyond RCU ordering
> guarantees. After a buffer is unmapped, its physical addresses may no
> longer be passed to the DMA API, so the memory range of the corresponding
> stale entry in the RCU list never matches. If the memory range gets
> allocated again, then it happens only after a RCU quiescent state.
> 
> Since bounce buffers can now be allocated from different pools, add a
> parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> an address. This function is now also used by is_swiotlb_buffer(), because
> a simple boundary check is no longer sufficient.
> 
> The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> simplified and enhanced to use coherent memory pools if needed.
> 
> Note that this is not the most efficient way to provide a bounce buffer,
> but when a DMA buffer can't be mapped, something may (and will) actually
> break. At that point it is better to make an allocation, even if it may be
> an expensive operation.

I continue to think about swiotlb memory management from the standpoint
of CoCo VMs that may be quite large with high network and storage loads.
These VMs are often running mission-critical workloads that can't tolerate
a bounce buffer allocation failure.  To prevent such failures, the swiotlb
memory size must be overly large, which wastes memory.

Your new approach helps by using the coherent memory pools as an overflow
space.   But in a lot of ways, it only pushes the problem around.  As you
noted in your cover letter, reducing the initial size of the swiotlb might
require increasing the size of the coherent pools.

What might be really useful is to pend bounce buffer requests while the
new worker thread is adding more swiotlb pools.  Of course, requests made
from interrupt level can't be pended, but at least in my experience with large
CoCo VMs, storage I/O is the biggest consumer of bounce buffers.  A lot
(most?) storage requests make the swiotlb_map() call in a context where
it is OK to pend.   If the coherent pool overflow space is could be used only
for swiotlb_map() calls that can't pend, it's more likely to be sufficient to
bridge the gap until new pools are added.

Could swiotlb code detect if it's OK to pend, and then pend a bounce
buffer request until the worker thread adds a new pool?  Even an overly
conversative check would help reduce pressure on the coherent pools
as overflow space.

Michael

> 
> Signed-off-by: Petr Tesarik 
> ---
>  include/linux/device.h  |   4 +
>  include/linux/dma-mapping.h |   2 +
>  include/linux/swiotlb.h |  13 +-
>  kernel/dma/direct.c |   2 +-
>  kernel/dma/swiotlb.c| 265 ++--
>  5 files changed, 272 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 83081aa99e6a..a1ee4c5924b8 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -510,6 +510,8 @@ struct device_physical_location {
>   * @dma_mem: Internal for coherent mem override.
>   * @cma_area:Contiguous memory area for dma allocations
>   * @dma_io_tlb_mem: Software IO TLB allocator.  Not for driver use.
> + * @dma_io_tlb_pools:List of transient swiotlb memory pools.
> + * @dma_io_tlb_lock: Protects changes to the list of active pools.
>   * @archdata:For arch-specific additions.
>   * @of_node: Associated device tree node.
>   * @fwnode:  Associated device node supplied by platform firmware.
> @@ -615,6 +617,8 @@ struct device {
>  #endif
>  #ifdef CONFIG_SWIOTLB
>   struct io_tlb_mem *dma_io_tlb_mem;
> + struct list_head dma_io_tlb_pools;
> + spinlock_t dma_io_tlb_lock;
>  #endif
>   /* arch specific additions */
>   struct dev_archdata archdata;
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 0ee20b764000..c36c5a546787 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -417,6 +417,8 @@ static inline void dma_sync_sgtable_for_device(struct 
> device
> *dev,
>  #define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 
> 0)
>  #define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0)
> 
> +bool 

RE: [patch V2 38/38] x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it

2023-05-06 Thread Michael Kelley (LINUX)
From: Thomas Gleixner  Sent: Saturday, May 6, 2023 9:23 AM
> 
> On Sat, May 06 2023 at 00:53, Michael Kelley wrote:
> > From: Thomas Gleixner  Sent: Thursday, May 4, 2023 
> > 12:03 PM
> > [snip]
> >
> >> @@ -934,10 +961,10 @@ static void announce_cpu(int cpu, int ap
> >>if (!node_width)
> >>node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
> >>
> >> -  if (cpu == 1)
> >> -  printk(KERN_INFO "x86: Booting SMP configuration:\n");
> >> -
> >>if (system_state < SYSTEM_RUNNING) {
> >> +  if (num_online_cpus() == 1)
> >
> > Unfortunately, this new check doesn't work.  Here's the output I get:
> >
> > [0.721384] smp: Bringing up secondary CPUs ...
> > [0.725359] smpboot: x86: Booting SMP configuration:
> > [0.729249]  node  #0, CPUs:#2
> > [0.729654] smpboot: x86: Booting SMP configuration:
> > [0.737247]   #4
> >
> > Evidently num_online_cpus() isn't updated until after all the primary
> > siblings get started.
> 
> Duh. Where is that brown paperbag?
> 
> > When booting with cpuhp.parallel=0, the output is good.
> 
> Exactly that was on the command line when I quickly booted that change :(
> 
> The below should fix it for real.
> 
> Thanks,
> 
> tglx
> ---
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -951,9 +951,9 @@ static int wakeup_secondary_cpu_via_init
>  /* reduce the number of lines printed when booting a large cpu count system 
> */
>  static void announce_cpu(int cpu, int apicid)
>  {
> + static int width, node_width, first = 1;
>   static int current_node = NUMA_NO_NODE;
>   int node = early_cpu_to_node(cpu);
> - static int width, node_width;
> 
>   if (!width)
>   width = num_digits(num_possible_cpus()) + 1; /* + '#' sign */
> @@ -962,7 +962,7 @@ static void announce_cpu(int cpu, int ap
>   node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
> 
>   if (system_state < SYSTEM_RUNNING) {
> - if (num_online_cpus() == 1)
> + if (first)
>   pr_info("x86: Booting SMP configuration:\n");
> 
>   if (node != current_node) {
> @@ -975,11 +975,11 @@ static void announce_cpu(int cpu, int ap
>   }
> 
>   /* Add padding for the BSP */
> - if (num_online_cpus() == 1)
> + if (first)
>   pr_cont("%*s", width + 1, " ");
> + first = 0;
> 
>   pr_cont("%*s#%d", width - num_digits(cpu), " ", cpu);
> -
>   } else
>   pr_info("Booting Node %d Processor %d APIC 0x%x\n",
>   node, cpu, apicid);

This works.  dmesg output is clean for these guest VM combinations
on Hyper-V that I tested:

* Normal VM:  16 vCPUs in 1 NUMA node and 32 vCPUs in 2 NUMA nodes
* Same configs for a SEV-SNP Confidential VM with paravisor

Tested with and without cpuhp.parallel=0

For the entire series:
Tested-by: Michael Kelley 



RE: [patch V2 38/38] x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it

2023-05-05 Thread Michael Kelley (LINUX)
From: Thomas Gleixner  Sent: Thursday, May 4, 2023 12:03 PM
> 
> Implement the validation function which tells the core code whether
> parallel bringup is possible.
> 
> The only condition for now is that the kernel does not run in an encrypted
> guest as these will trap the RDMSR via #VC, which cannot be handled at that
> point in early startup.
> 
> There was an earlier variant for AMD-SEV which used the GHBC protocol for
> retrieving the APIC ID via CPUID, but there is no guarantee that the
> initial APIC ID in CPUID is the same as the real APIC ID. There is no
> enforcement from the secure firmware and the hypervisor can assign APIC IDs
> as it sees fit as long as the ACPI/MADT table is consistent with that
> assignment.
> 
> Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling
> AMD-SEV guests for parallel startup needs some more thought.
> 
> Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside
> the scope of this change.
> 
> Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of
> CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart.
> 
> [ mikelley: Reported the announce_cpu() fallout
> 
> Originally-by: David Woodhouse 
> Signed-off-by: Thomas Gleixner 
> ---
> V2: Fixup announce_cpu() - Michael Kelley
> ---
>  arch/x86/Kconfig |3 +
>  arch/x86/kernel/cpu/common.c |6 ---
>  arch/x86/kernel/smpboot.c|   83 
> ---

[snip]

> @@ -934,10 +961,10 @@ static void announce_cpu(int cpu, int ap
>   if (!node_width)
>   node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
> 
> - if (cpu == 1)
> - printk(KERN_INFO "x86: Booting SMP configuration:\n");
> -
>   if (system_state < SYSTEM_RUNNING) {
> + if (num_online_cpus() == 1)

Unfortunately, this new check doesn't work.  Here's the output I get:

[0.721384] smp: Bringing up secondary CPUs ...
[0.725359] smpboot: x86: Booting SMP configuration:
[0.729249]  node  #0, CPUs:#2
[0.729654] smpboot: x86: Booting SMP configuration:
[0.737247]   #4
[0.737511] smpboot: x86: Booting SMP configuration:
[0.741246]   #6
[0.741508] smpboot: x86: Booting SMP configuration:
[0.745248]   #8
[0.745507] smpboot: x86: Booting SMP configuration:
[0.749250]  #10
[0.749514] smpboot: x86: Booting SMP configuration:
[0.753248]  #12
[0.753492] smpboot: x86: Booting SMP configuration:
[0.757249]  #14  #1  #3  #5  #7  #9 #11 #13 #15
[0.769317] smp: Brought up 1 node, 16 CPUs
[0.773246] smpboot: Max logical packages: 1
[0.777257] smpboot: Total of 16 processors activated (78253.79 BogoMIPS)

Evidently num_online_cpus() isn't updated until after all the primary
siblings get started.

When booting with cpuhp.parallel=0, the output is good.

Michael

> + pr_info("x86: Booting SMP configuration:\n");
> +
>   if (node != current_node) {
>   if (current_node > (-1))
>   pr_cont("\n");
> @@ -948,7 +975,7 @@ static void announce_cpu(int cpu, int ap
>   }
> 
>   /* Add padding for the BSP */
> - if (cpu == 1)
> + if (num_online_cpus() == 1)
>   pr_cont("%*s", width + 1, " ");
> 
>   pr_cont("%*s#%d", width - num_digits(cpu), " ", cpu);


RE: [PATCH v6 00/16] x86/mtrr: fix handling with PAT but without MTRR

2023-05-02 Thread Michael Kelley (LINUX)
From: Juergen Gross  Sent: Tuesday, May 2, 2023 5:09 AM
> 
> This series tries to fix the rather special case of PAT being available
> without having MTRRs (either due to CONFIG_MTRR being not set, or
> because the feature has been disabled e.g. by a hypervisor).
> 
> The main use cases are Xen PV guests and SEV-SNP guests running under
> Hyper-V.
> 
> Instead of trying to work around all the issues by adding if statements
> here and there, just try to use the complete available infrastructure
> by setting up a read-only MTRR state when needed.
> 
> In the Xen PV case the current MTRR MSR values can be read from the
> hypervisor, while for the SEV-SNP case all needed is to set the
> default caching mode to "WB".
> 
> I have added more cleanup which has been discussed when looking into
> the most recent failures.
> 
> Note that I couldn't test the Hyper-V related change (patch 3).
> 
> Running on bare metal and with Xen didn't show any problems with the
> series applied.
> 
> It should be noted that patches 9+10 are replacing today's way to
> lookup the MTRR cache type for a memory region from looking at the
> MTRR register values to building a memory map with the cache types.
> This should make the lookup much faster and much easier to understand.
> 
> Changes in V2:
> - replaced former patches 1+2 with new patches 1-4, avoiding especially
>   the rather hacky approach of V1, while making all the MTRR type
>   conflict tests available for the Xen PV case
> - updated patch 6 (was patch 4 in V1)
> 
> Changes in V3:
> - dropped patch 5 of V2, as already applied
> - split patch 1 of V2 into 2 patches
> - new patches 6-10
> - addressed comments
> 
> Changes in V4:
> - addressed comments
> 
> Changes in V5
> - addressed comments
> - some other small fixes
> - new patches 3, 8 and 15
> 
> Changes in V6:
> - patch 1 replaces patches 1+2 of V5
> - new patches 8+12
> - addressed comments
> 
> Juergen Gross (16):
>   x86/mtrr: remove physical address size calculation
>   x86/mtrr: replace some constants with defines
>   x86/mtrr: support setting MTRR state for software defined MTRRs
>   x86/hyperv: set MTRR state when running as SEV-SNP Hyper-V guest
>   x86/xen: set MTRR state when running as Xen PV initial domain
>   x86/mtrr: replace vendor tests in MTRR code
>   x86/mtrr: have only one set_mtrr() variant
>   x86/mtrr: move 32-bit code from mtrr.c to legacy.c
>   x86/mtrr: allocate mtrr_value array dynamically
>   x86/mtrr: add get_effective_type() service function
>   x86/mtrr: construct a memory map with cache modes
>   x86/mtrr: add mtrr=debug command line option
>   x86/mtrr: use new cache_map in mtrr_type_lookup()
>   x86/mtrr: don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
>   x86/mm: only check uniform after calling mtrr_type_lookup()
>   x86/mtrr: remove unused code
> 
>  .../admin-guide/kernel-parameters.txt |   4 +
>  arch/x86/hyperv/ivm.c |   4 +
>  arch/x86/include/asm/mtrr.h   |  43 +-
>  arch/x86/include/uapi/asm/mtrr.h  |   6 +-
>  arch/x86/kernel/cpu/mtrr/Makefile |   2 +-
>  arch/x86/kernel/cpu/mtrr/amd.c|   2 +-
>  arch/x86/kernel/cpu/mtrr/centaur.c|  11 +-
>  arch/x86/kernel/cpu/mtrr/cleanup.c|  22 +-
>  arch/x86/kernel/cpu/mtrr/cyrix.c  |   2 +-
>  arch/x86/kernel/cpu/mtrr/generic.c| 677 --
>  arch/x86/kernel/cpu/mtrr/legacy.c |  90 +++
>  arch/x86/kernel/cpu/mtrr/mtrr.c   | 195 ++---
>  arch/x86/kernel/cpu/mtrr/mtrr.h   |  18 +-
>  arch/x86/kernel/setup.c   |   2 +
>  arch/x86/mm/pgtable.c |  24 +-
>  arch/x86/xen/enlighten_pv.c   |  52 ++
>  16 files changed, 721 insertions(+), 433 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/mtrr/legacy.c
> 
> --
> 2.35.3

I've tested the full v6 series in a normal Hyper-V guest and in an SEV-SNP 
guest.

In the SNP guest, the page attributes in /sys/kernel/debug/x86/pat_memtype_list
are "write-back" in the expected cases.  The "mtrr" x86 feature no longer 
appears
in the "flags" output of "lscpu" or /proc/cpuinfo.  /proc/mtrr does not exist, 
again
as expected.

In a normal VM, the "mtrr" x86 feature appears in the flags, and /proc/mtrr
shows expected values.  The boot option mtrr=debug works as expected.

Tested-by: Michael Kelley 



RE: [patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup

2023-04-27 Thread Michael Kelley (LINUX)
allel bring-up because most of the SNP-ness is
hidden in the paravisor.  I was glad to see this work properly.

There's not much difference in performance with and without parallel
bring-up on the 32 vCPU VM.   Without parallel, the time is about 26
milliseconds.  With parallel, it's about 24 ms.   So bring-up is already
fast in the virtual environment.

The cosmetic issue is in the dmesg log, and arises because Hyper-V
enumerates SMT CPUs differently from many other environments.  In
a Hyper-V guest, the SMT threads in a core are numbered as 
pairs.  Guest CPUs #0 & #1 are SMT threads in core, as are #2 & #3, etc.  With
parallel bring-up, here's the dmesg output:

[0.444345] smp: Bringing up secondary CPUs ...
[0.445139]  node  #0, CPUs:#2  #4  #6  #8 #10 #12 #14 #16 #18 #20 
#22 #24 #26 #28 #30
[0.454112] x86: Booting SMP configuration:
[0.456035]   #1  #3  #5  #7  #9 #11 #13 #15 #17 #19 #21 #23 #25 #27 #29 
#31
[0.466120] smp: Brought up 1 node, 32 CPUs
[0.467036] smpboot: Max logical packages: 1
[0.468035] smpboot: Total of 32 processors activated (153240.06 BogoMIPS)

The function announce_cpu() is specifically testing for CPU #1 to output the
"Booting SMP configuration" message.  In a Hyper-V guest, CPU #1 is the second
SMT thread in a core, so it isn't started until all the even-numbered CPUs are
started.

I don't know if this cosmetic issue is worth fixing, but I thought I'd point it 
out.

In any case,

Tested-by: Michael Kelley 



RE: [PATCH v4 00/12] x86/mtrr: fix handling with PAT but without MTRR

2023-03-07 Thread Michael Kelley (LINUX)
From: Juergen Gross  Sent: Monday, March 6, 2023 8:34 AM
> 
> This series tries to fix the rather special case of PAT being available
> without having MTRRs (either due to CONFIG_MTRR being not set, or
> because the feature has been disabled e.g. by a hypervisor).
> 
> The main use cases are Xen PV guests and SEV-SNP guests running under
> Hyper-V.
> 
> Instead of trying to work around all the issues by adding if statements
> here and there, just try to use the complete available infrastructure
> by setting up a read-only MTRR state when needed.
> 
> In the Xen PV case the current MTRR MSR values can be read from the
> hypervisor, while for the SEV-SNP case all needed is to set the
> default caching mode to "WB".
> 
> I have added more cleanup which has been discussed when looking into
> the most recent failures.
> 
> Note that I couldn't test the Hyper-V related change (patch 3).
> 
> Running on bare metal and with Xen didn't show any problems with the
> series applied.
> 
> It should be noted that patches 9+10 are replacing today's way to
> lookup the MTRR cache type for a memory region from looking at the
> MTRR register values to building a memory map with the cache types.
> This should make the lookup much faster and much easier to understand.
> 
> Changes in V2:
> - replaced former patches 1+2 with new patches 1-4, avoiding especially
>   the rather hacky approach of V1, while making all the MTRR type
>   conflict tests available for the Xen PV case
> - updated patch 6 (was patch 4 in V1)
> 
> Changes in V3:
> - dropped patch 5 of V2, as already applied
> - split patch 1 of V2 into 2 patches
> - new patches 6-10
> - addressed comments
> 
> Changes in V4:
> - addressed comments
> 
> Juergen Gross (12):
>   x86/mtrr: split off physical address size calculation
>   x86/mtrr: optimize mtrr_calc_physbits()
>   x86/mtrr: support setting MTRR state for software defined MTRRs
>   x86/hyperv: set MTRR state when running as SEV-SNP Hyper-V guest
>   x86/xen: set MTRR state when running as Xen PV initial domain
>   x86/mtrr: replace vendor tests in MTRR code
>   x86/mtrr: allocate mtrr_value array dynamically
>   x86/mtrr: add get_effective_type() service function
>   x86/mtrr: construct a memory map with cache modes
>   x86/mtrr: use new cache_map in mtrr_type_lookup()
>   x86/mtrr: don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
>   x86/mm: only check uniform after calling mtrr_type_lookup()
> 
>  arch/x86/include/asm/mtrr.h|  15 +-
>  arch/x86/include/uapi/asm/mtrr.h   |   6 +-
>  arch/x86/kernel/cpu/mshyperv.c |   4 +
>  arch/x86/kernel/cpu/mtrr/amd.c |   2 +-
>  arch/x86/kernel/cpu/mtrr/centaur.c |   2 +-
>  arch/x86/kernel/cpu/mtrr/cleanup.c |   4 +-
>  arch/x86/kernel/cpu/mtrr/cyrix.c   |   2 +-
>  arch/x86/kernel/cpu/mtrr/generic.c | 492 ++---
>  arch/x86/kernel/cpu/mtrr/mtrr.c|  94 +++---
>  arch/x86/kernel/cpu/mtrr/mtrr.h|   7 +-
>  arch/x86/kernel/setup.c|   2 +
>  arch/x86/mm/pgtable.c  |  24 +-
>  arch/x86/xen/enlighten_pv.c|  52 +++
>  13 files changed, 454 insertions(+), 252 deletions(-)
> 
> --
> 2.35.3

I've tested a Linux 6.2 kernel plus this series in a normal Hyper-V
guest and in a Hyper-V guest using SEV-SNP with vTOM.  MMIO
memory is correctly mapped as WB or UC- depending on the
request, which fixes the original problem introduced for Hyper-V
by the Xen-specific change.

Tested-by: Michael Kelley 



RE: [PATCH 3/7] hv: simplify sysctl registration

2023-03-02 Thread Michael Kelley (LINUX)
From: Luis Chamberlain  On Behalf Of Luis Chamberlain 
Sent: Thursday, March 2, 2023 12:46 PM
>
> register_sysctl_table() is a deprecated compatibility wrapper.
> register_sysctl() can do the directory creation for you so just use
> that.
> 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/hv/vmbus_drv.c | 11 +--
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index d24dd65b33d4..229353f1e9c2 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1460,15 +1460,6 @@ static struct ctl_table hv_ctl_table[] = {
>   {}
>  };
> 
> -static struct ctl_table hv_root_table[] = {
> - {
> - .procname   = "kernel",
> - .mode   = 0555,
> - .child  = hv_ctl_table
> - },
> - {}
> -};
> -
>  /*
>   * vmbus_bus_init -Main vmbus driver initialization routine.
>   *
> @@ -1547,7 +1538,7 @@ static int vmbus_bus_init(void)
>* message recording won't be available in isolated
>* guests should the following registration fail.
>*/
> - hv_ctl_table_hdr = register_sysctl_table(hv_root_table);
> + hv_ctl_table_hdr = register_sysctl("kernel", hv_ctl_table);
>   if (!hv_ctl_table_hdr)
>   pr_err("Hyper-V: sysctl table register error");
> 
> --
> 2.39.1

Reviewed-by: Michael Kelley 



Problem with pat_enable() and commit 72cbc8f04fe2

2023-01-09 Thread Michael Kelley (LINUX)
I've come across a case with a VM running on Hyper-V that doesn't get
MTRRs, but the PAT is functional.  (This is a Confidential VM using
AMD's SEV-SNP encryption technology with the vTOM option.)  In this
case, the changes in commit 72cbc8f04fe2 ("x86/PAT: Have pat_enabled()
properly reflect state when running on Xen") apply.   pat_enabled() returns
"true", but the MTRRs are not enabled.

But with this commit, there's a problem.  Consider memremap() on a RAM
region, called with MEMREMAP_WB plus MEMREMAP_DEC as the 3rd
argument. Because of the request for a decrypted mapping,
arch_memremap_can_ram_remap() returns false, and a new mapping
must be created, which is appropriate.

The following call stack results:

  memremap()
  arch_memremap_wb()
  ioremap_cache()
  __ioremap_caller()
  memtype_reserve()  <--- pcm is _PAGE_CACHE_MODE_WB
  pat_x_mtrr_type()  <-- only called after commit 72cbc8f04fe2

pat_x_mtrr_type() returns _PAGE_CACHE_MODE_UC_MINUS because
mtrr_type_lookup() fails.  As a result, memremap() erroneously creates the
new mapping as uncached.   This uncached mapping is causing a significant
performance problem in certain Hyper-V Confidential VM configurations.

Any thoughts on resolving this?  Should memtype_reserve() be checking
both pat_enabled() *and* whether MTRRs are enabled before calling
pat_x_mtrr_type()?  Or does that defeat the purpose of commit
72cbc8f04fe2 in the Xen environment?

I'm also looking at how to avoid this combination in a Hyper-V Confidential
VM, but that doesn't address underlying the flaw.

Michael



RE: [PATCH v3 7/8] genirq: Return a const cpumask from irq_data_get_affinity_mask

2022-07-03 Thread Michael Kelley (LINUX)
From: Samuel Holland  Sent: Friday, July 1, 2022 1:01 PM
> 
> Now that the irq_data_update_affinity helper exists, enforce its use
> by returning a a const cpumask from irq_data_get_affinity_mask.

Nit: duplicate word "a"

> 
> Since the previous commit already updated places that needed to call
> irq_data_update_affinity, this commit updates the remaining code that
> either did not modify the cpumask or immediately passed the modified
> mask to irq_set_affinity.
> 
> Signed-off-by: Samuel Holland 
> ---
> 
> Changes in v3:
>  - New patch to make the returned cpumasks const
> 
>  arch/mips/cavium-octeon/octeon-irq.c |  4 ++--
>  arch/sh/kernel/irq.c |  7 ---
>  arch/x86/hyperv/irqdomain.c  |  2 +-
>  arch/xtensa/kernel/irq.c |  7 ---
>  drivers/iommu/hyperv-iommu.c |  2 +-
>  drivers/pci/controller/pci-hyperv.c  | 10 +-
>  include/linux/irq.h  | 12 +++-
>  kernel/irq/chip.c|  8 +---
>  kernel/irq/debugfs.c |  2 +-
>  kernel/irq/ipi.c | 16 +---
>  10 files changed, 39 insertions(+), 31 deletions(-)
> 

[snip]

> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index 7e0f6bedc248..42c70d28ef27 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -192,7 +192,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>   struct pci_dev *dev;
>   struct hv_interrupt_entry out_entry, *stored_entry;
>   struct irq_cfg *cfg = irqd_cfg(data);
> - cpumask_t *affinity;
> + const cpumask_t *affinity;
>   int cpu;
>   u64 status;
> 

[snip]

> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index e285a220c913..51bd66a45a11 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -194,7 +194,7 @@ hyperv_root_ir_compose_msi_msg(struct irq_data *irq_data,
> struct msi_msg *msg)
>   u32 vector;
>   struct irq_cfg *cfg;
>   int ioapic_id;
> - struct cpumask *affinity;
> + const struct cpumask *affinity;
>   int cpu;
>   struct hv_interrupt_entry entry;
>   struct hyperv_root_ir_data *data = irq_data->chip_data;
> diff --git a/drivers/pci/controller/pci-hyperv.c 
> b/drivers/pci/controller/pci-hyperv.c
> index db814f7b93ba..aebada45569b 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -642,7 +642,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
>   struct hv_retarget_device_interrupt *params;
>   struct tran_int_desc *int_desc;
>   struct hv_pcibus_device *hbus;
> - struct cpumask *dest;
> + const struct cpumask *dest;
>   cpumask_var_t tmp;
>   struct pci_bus *pbus;
>   struct pci_dev *pdev;
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct
> pci_response *resp,
>  }
> 
>  static u32 hv_compose_msi_req_v1(
> - struct pci_create_interrupt *int_pkt, struct cpumask *affinity,
> + struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
>   u32 slot, u8 vector, u8 vector_count)
>  {
>   int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1641,7 +1641,7 @@ static int hv_compose_msi_req_get_cpu(struct cpumask
> *affinity)
>  }
> 
>  static u32 hv_compose_msi_req_v2(
> - struct pci_create_interrupt2 *int_pkt, struct cpumask *affinity,
> + struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity,
>   u32 slot, u8 vector, u8 vector_count)
>  {
>   int cpu;
> @@ -1660,7 +1660,7 @@ static u32 hv_compose_msi_req_v2(
>  }
> 
>  static u32 hv_compose_msi_req_v3(
> - struct pci_create_interrupt3 *int_pkt, struct cpumask *affinity,
> + struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity,
>   u32 slot, u32 vector, u8 vector_count)
>  {
>   int cpu;
> @@ -1697,7 +1697,7 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>   struct hv_pci_dev *hpdev;
>   struct pci_bus *pbus;
>   struct pci_dev *pdev;
> -     struct cpumask *dest;
> + const struct cpumask *dest;
>   struct compose_comp_ctxt comp;
>   struct tran_int_desc *int_desc;
>   struct msi_desc *msi_desc;

For these files with Hyper-V related changes:
arch/x86/hyperv/irqdomain.c
drivers/iommu/hyperv-iommu.c
drivers/pci/controller/pci-hyperv.c

Reviewed-by: Michael Kelley 



RE: [PATCH 16/30] drivers/hv/vmbus, video/hyperv_fb: Untangle and refactor Hyper-V panic notifiers

2022-05-03 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Friday, April 29, 2022 
3:35 PM
> 
> Hi Michael, first of all thanks for the great review, much appreciated.
> Some comments inline below:
> 
> On 29/04/2022 14:16, Michael Kelley (LINUX) wrote:
> > [...]
> >> hypervisor I/O completion), so we postpone that to run late. But more
> >> relevant: this *same* vmbus unloading happens in the crash_shutdown()
> >> handler, so if kdump is set, we can safely skip this panic notifier and
> >> defer such clean-up to the kexec crash handler.
> >
> > While the last sentence is true for Hyper-V on x86/x64, it's not true for
> > Hyper-V on ARM64.  x86/x64 has the 'machine_ops' data structure
> > with the ability to provide a custom crash_shutdown() function, which
> > Hyper-V does in the form of hv_machine_crash_shutdown().  But ARM64
> > has no mechanism to provide such a custom function that will eventually
> > do the needed vmbus_initiate_unload() before running kdump.
> >
> > I'm not immediately sure what the best solution is for ARM64.  At this
> > point, I'm just pointing out the problem and will think about the tradeoffs
> > for various possible solutions.  Please do the same yourself. :-)
> >
> 
> Oh, you're totally right! I just assumed ARM64 would the the same, my
> bad. Just to propose some alternatives, so you/others can also discuss
> here and we can reach a consensus about the trade-offs:
> 
> (a) We could forget about this change, and always do the clean-up here,
> not relying in machine_crash_shutdown().
> Pro: really simple, behaves the same as it is doing currently.
> Con: less elegant/concise, doesn't allow arm64 customization.
> 
> (b) Add a way to allow ARM64 customization of shutdown crash handler.
> Pro: matches x86, more customizable, improves arm64 arch code.
> Con: A tad more complex.
> 
> Also, a question that came-up: if ARM64 has no way of calling special
> crash shutdown handler, how can you execute hv_stimer_cleanup() and
> hv_synic_disable_regs() there? Or are they not required in ARM64?
> 

My suggestion is to do (a) for now.  I suspect (b) could be a more
extended discussion and I wouldn't want your patch set to get held
up on that discussion.  I don't know what the sense of the ARM64
maintainers would be toward (b).  They have tried to avoid picking
up code warts like have accumulated on the x86/x64 side over the
years, and I agree with that effort.  But as more and varied
hypervisors become available for ARM64, it seems like a framework
for supporting a custom shutdown handler may become necessary.
But that could take a little time.

You are right about hv_stimer_cleanup() and hv_synic_disable_regs().
We are not running these when a panic occurs on ARM64, and we
should be, though the risk is small.   We will pursue (b) and add
these additional cleanups as part of that.  But again, I would suggest
doing (a) for now, and we will switch back to your solution once
(b) is in place.

> 
> >>
> >> (c) There is also a Hyper-V framebuffer panic notifier, which relies in
> >> doing a vmbus operation that demands a valid connection. So, we must
> >> order this notifier with the panic notifier from vmbus_drv.c, in order to
> >> guarantee that the framebuffer code executes before the vmbus connection
> >> is unloaded.
> >
> > Patch 21 of this set puts the Hyper-V FB panic notifier on the pre_reboot
> > notifier list, which means it won't execute before the VMbus connection
> > unload in the case of kdump.   This notifier is making sure that Hyper-V
> > is notified about the last updates made to the frame buffer before the
> > panic, so maybe it needs to be put on the hypervisor notifier list.  It
> > sends a message to Hyper-V over its existing VMbus channel, but it
> > does not wait for a reply.  It does, however, obtain a spin lock on the
> > ring buffer used to communicate with Hyper-V.   Unless someone has
> > a better suggestion, I'm inclined to take the risk of blocking on that
> > spin lock.
> 
> The logic behind that was: when kdump is set, we'd skip the vmbus
> disconnect on notifiers, deferring that to crash_shutdown(), logic this
> one refuted in the above discussion on ARM64 (one more Pro argument to
> the idea of refactoring aarch64 code to allow a custom crash shutdown
> handler heh). But you're right, for the default level 2, we skip the
> pre_reboot notifiers on kdump, effectively skipping this notifier.
> 
> Some ideas of what we can do here:
> 
> I) we could change the framebuffer notifier to rely on trylocks, instead
> of risking a lockup scenario, and with that, we can execute it before
> the vmbus disconnect in the hypervisor list;

I think we have to d

RE: [PATCH 19/30] panic: Add the panic hypervisor notifier list

2022-05-03 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Friday, April 29, 2022 
11:04 AM
> 
> On 29/04/2022 14:30, Michael Kelley (LINUX) wrote:
> > From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
> > 2022
> 3:49 PM
> >> [...]
> >>
> >> @@ -2843,7 +2843,7 @@ static void __exit vmbus_exit(void)
> >>if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
> >>kmsg_dump_unregister(_kmsg_dumper);
> >>unregister_die_notifier(_die_report_block);
> >> -  atomic_notifier_chain_unregister(_notifier_list,
> >> +  atomic_notifier_chain_unregister(_hypervisor_list,
> >>_panic_report_block);
> >>}
> >>
> >
> > Using the hypervisor_list here produces a bit of a mismatch.  In many cases
> > this notifier will do nothing, and will defer to the kmsg_dump() mechanism
> > to notify the hypervisor about the panic.   Running the kmsg_dump()
> > mechanism is linked to the info_list, so I'm thinking the Hyper-V panic 
> > report
> > notifier should be on the info_list as well.  That way the reporting 
> > behavior
> > is triggered at the same point in the panic path regardless of which
> > reporting mechanism is used.
> >
> 
> Hi Michael, thanks for your feedback! I agree that your idea could work,
> but...there is one downside: imagine the kmsg_dump() approach is not set
> in some Hyper-V guest, then we would rely in the regular notification
> mechanism [hv_die_panic_notify_crash()], right?
> But...you want then to run this notifier in the informational list,
> which...won't execute *by default* before kdump if no kmsg_dump() is
> set. So, this logic is convoluted when you mix it with the default level
> concept + kdump.

Yes, you are right.  But to me that speaks as much to the linkage
between the informational list and kmsg_dump() being the core
problem.  But as I described in my reply to Patch 24, I can live with
the linkage as-is.

FWIW, guests on newer versions of Hyper-V will always register a
kmsg dumper.  The flags that are tested to decide whether to
register provide compatibility with older versions of Hyper-V that 
don’t support the 4K bytes of notification info.

> 
> May I suggest something? If possible, take a run with this patch set +
> DEBUG_NOTIFIER=y, in *both* cases (with and without the kmsg_dump()
> set). I did that and they run almost at the same time...I've checked the
> notifiers called, it's like almost nothing runs in-between.
> 
> I feel the panic notification mechanism does really fit with a
> hypervisor list, it's a good match with the nature of the list, which
> aims at informing the panic notification to the hypervisor/FW.
> Of course we can modify it if you prefer...but please take into account
> the kdump case and how it complicates the logic.

I agree that the runtime effect of one list vs. the other is nil.  The
code works and can stay as you written it.

I was trying to align from a conceptual standpoint.  It was a bit
unexpected that one path would be on the hypervisor list, and the
other path effectively on the informational list.  When I see
conceptual mismatches like that, I tend to want to understand why,
and if there is something more fundamental that is out-of-whack.


> 
> Let me know your considerations, in case you can experiment with the
> patch set as-is.
> Cheers,
> 
> 
> Guilherme


RE: [PATCH 24/30] panic: Refactor the panic path

2022-05-03 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Friday, April 29, 2022 
1:38 PM
> 
> On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> > From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
> > 2022
> 3:49 PM
> >> [...]
> >> +  panic_notifiers_level=
> >> +  [KNL] Set the panic notifiers execution order.
> >> +  Format: 
> >> +  We currently have 4 lists of panic notifiers; based
> >> +  on the functionality and risk (for panic success) the
> >> +  callbacks are added in a given list. The lists are:
> >> +  - hypervisor/FW notification list (low risk);
> >> +  - informational list (low/medium risk);
> >> +  - pre_reboot list (higher risk);
> >> +  - post_reboot list (only run late in panic and after
> >> +  kdump, not configurable for now).
> >> +  This parameter defines the ordering of the first 3
> >> +  lists with regards to kdump; the levels determine
> >> +  which set of notifiers execute before kdump. The
> >> +  accepted levels are:
> >> +  0: kdump is the first thing to run, NO list is
> >> +  executed before kdump.
> >> +  1: only the hypervisor list is executed before kdump.
> >> +  2 (default level): the hypervisor list and (*if*
> >> +  there's any kmsg_dumper defined) the informational
> >> +  list are executed before kdump.
> >> +  3: both the hypervisor and the informational lists
> >> +  (always) execute before kdump.
> >
> > I'm not clear on why level 2 exists.  What is the scenario where
> > execution of the info list before kdump should be conditional on the
> > existence of a kmsg_dumper?   Maybe the scenario is described
> > somewhere in the patch set and I just missed it.
> >
> 
> Hi Michael, thanks for your review/consideration. So, this idea started
> kind of some time ago. It all started with a need of exposing more
> information on kernel log *before* kdump and *before* pstore -
> specifically, we're talking about panic_print. But this cause some
> reactions, Baoquan was very concerned with that [0]. Soon after, I've
> proposed a panic notifiers filter (orthogonal) approach, to which Petr
> suggested instead doing a major refactor [1] - it finally is alive in
> the form of this series.
> 
> The theory behind the level 2 is to allow a scenario of kdump with the
> minimum amount of notifiers - what is the point in printing more
> information if the user doesn't care, since it's going to kdump? Now, if
> there is a kmsg dumper, it means that there is likely some interest in
> collecting information, and that might as well be required before the
> potential kdump (which is my case, hence the proposal on [0]).
> 
> Instead of forcing one of the two behaviors (level 1 or level 3), we
> have a middle-term/compromise: if there's interest in collecting such
> data (in the form of a kmsg dumper), we then execute the informational
> notifiers before kdump. If not, why to increase (even slightly) the risk
> for kdump?
> 
> I'm OK in removing the level 2 if people prefer, but I don't feel it's a
> burden, quite opposite - seems a good way to accommodate the somewhat
> antagonistic ideas (jump to kdump ASAP vs collecting more info in the
> panicked kernel log).
> 
> [0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/
> 
> [1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/
> 

To me, it's a weak correlation between having a kmsg dumper, and
wanting or not wanting the info level output to come before kdump.
Hyper-V is one of only a few places that register a kmsg dumper, so most
Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
the info level output after kdump.  It seems like anyone who cared strongly
about the info level output would set the panic_notifier_level to 1 or to 3
so that the result is more deterministic.  But that's just my opinion, and
it's probably an opinion that is not as well informed on the topic as some
others in the discussion. So keeping things as in your patch set is not a
show-stopper for me.

However, I would request a clarification in the documentation.   The
panic_notifier_level affects not only the hypervisor, informational,
and pre_reboot lists, but it also affects panic_print_sys_info() and
kmsg_dump().  Specifically, at level 1, panic_print_sys_info() and
kmsg_dump() will not be run before kdump.  At level 3, th

RE: [PATCH 24/30] panic: Refactor the panic path

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
> 
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
> 
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
> 
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
> 
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
> 
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
> 
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
> 
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
> 
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
> 
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
> 
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
> 
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
> 
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/
> 
> Suggested-by: Petr Mladek 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> Special thanks to Petr and Baoquan for the suggestion and feedback in a 
> previous
> email thread. There's some important design decisions that worth mentioning 
> and
> discussing:
> 
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but would be something worthwhile to have a KConfig option to set
> the default level? It would help distros that want the old behavior
> (no notifiers before kdump) as default.
> 
> * The implementation choice was to _avoid_ intricate if conditionals in the
> panic path, which would _definitely_ be present with the panic notifiers 
> levels
> idea; so, instead of lots of if conditionals, the set/clear bits approach with
> functions called in 2 points (but executing only in one of them) is much 
> easier
> to follow an was used here; the ordering helper function and the comments also
> help a lot to avoid confusion (hopefully).
> 
> * Choice was to *always* use crash_smp_send_stop() instead of sometimes making
> use of the 

RE: [PATCH 19/30] panic: Add the panic hypervisor notifier list

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> The goal of this new panic notifier is to allow its users to register
> callbacks to run very early in the panic path. This aims hypervisor/FW
> notification mechanisms as well as simple LED functions, and any other
> simple and safe mechanism that should run early in the panic path; more
> dangerous callbacks should execute later.
> 
> For now, the patch is almost a no-op (although it changes a bit the
> ordering in which some panic notifiers are executed). In a subsequent
> patch, the panic path will be refactored, then the panic hypervisor
> notifiers will effectively run very early in the panic path.
> 
> We also defer documenting it all properly in the subsequent refactor
> patch. While at it, we removed some useless header inclusions and
> fixed some notifiers return too (by using the standard NOTIFY_DONE).
> 
> Cc: Alexander Gordeev 
> Cc: Andrea Parri (Microsoft) 
> Cc: Ard Biesheuvel 
> Cc: Benjamin Herrenschmidt 
> Cc: Brian Norris 
> Cc: Christian Borntraeger 
> Cc: Christophe JAILLET 
> Cc: David Gow 
> Cc: "David S. Miller" 
> Cc: Dexuan Cui 
> Cc: Doug Berger 
> Cc: Evan Green 
> Cc: Florian Fainelli 
> Cc: Haiyang Zhang 
> Cc: Hari Bathini 
> Cc: Heiko Carstens 
> Cc: Julius Werner 
> Cc: Justin Chen 
> Cc: "K. Y. Srinivasan" 
> Cc: Lee Jones 
> Cc: Markus Mayer 
> Cc: Michael Ellerman 
> Cc: Michael Kelley 
> Cc: Mihai Carabas 
> Cc: Nicholas Piggin 
> Cc: Paul Mackerras 
> Cc: Pavel Machek 
> Cc: Scott Branden 
> Cc: Sebastian Reichel 
> Cc: Shile Zhang 
> Cc: Stephen Hemminger 
> Cc: Sven Schnelle 
> Cc: Thomas Bogendoerfer 
> Cc: Tianyu Lan 
> Cc: Vasily Gorbik 
> Cc: Wang ShaoBo 
> Cc: Wei Liu 
> Cc: zhenwei pi 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/mips/sgi-ip22/ip22-reset.c  | 2 +-
>  arch/mips/sgi-ip32/ip32-reset.c  | 3 +--
>  arch/powerpc/kernel/setup-common.c   | 2 +-
>  arch/sparc/kernel/sstate.c   | 3 +--
>  drivers/firmware/google/gsmi.c   | 4 ++--
>  drivers/hv/vmbus_drv.c   | 4 ++--
>  drivers/leds/trigger/ledtrig-activity.c  | 4 ++--
>  drivers/leds/trigger/ledtrig-heartbeat.c | 4 ++--
>  drivers/misc/bcm-vk/bcm_vk_dev.c | 6 +++---
>  drivers/misc/pvpanic/pvpanic.c   | 4 ++--
>  drivers/power/reset/ltc2952-poweroff.c   | 4 ++--
>  drivers/s390/char/zcore.c| 5 +++--
>  drivers/soc/bcm/brcmstb/pm/pm-arm.c  | 2 +-
>  include/linux/panic_notifier.h   | 1 +
>  kernel/panic.c   | 4 
>  15 files changed, 28 insertions(+), 24 deletions(-)

[ snip]

> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index f37f12d48001..901b97034308 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1614,7 +1614,7 @@ static int vmbus_bus_init(void)
>   hv_kmsg_dump_register();
> 
>   register_die_notifier(_die_report_block);
> - atomic_notifier_chain_register(_notifier_list,
> + atomic_notifier_chain_register(_hypervisor_list,
>   _panic_report_block);
>   }
> 
> @@ -2843,7 +2843,7 @@ static void __exit vmbus_exit(void)
>   if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
>   kmsg_dump_unregister(_kmsg_dumper);
>   unregister_die_notifier(_die_report_block);
> - atomic_notifier_chain_unregister(_notifier_list,
> + atomic_notifier_chain_unregister(_hypervisor_list,
>   _panic_report_block);
>   }
> 

Using the hypervisor_list here produces a bit of a mismatch.  In many cases
this notifier will do nothing, and will defer to the kmsg_dump() mechanism
to notify the hypervisor about the panic.   Running the kmsg_dump()
mechanism is linked to the info_list, so I'm thinking the Hyper-V panic report
notifier should be on the info_list as well.  That way the reporting behavior
is triggered at the same point in the panic path regardless of which
reporting mechanism is used.






RE: [PATCH 16/30] drivers/hv/vmbus, video/hyperv_fb: Untangle and refactor Hyper-V panic notifiers

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> Currently Hyper-V guests are among the most relevant users of the panic
> infrastructure, like panic notifiers, kmsg dumpers, etc. The reasons rely
> both in cleaning-up procedures (closing a hypervisor <-> guest connection,
> disabling a paravirtualized timer) as well as to data collection (sending
> panic information to the hypervisor) and framebuffer management.
> 
> The thing is: some notifiers are related to others, ordering matters, some
> functionalities are duplicated and there are lots of conditionals behind
> sending panic information to the hypervisor. This patch, as part of an
> effort to clean-up the panic notifiers mechanism and better document
> things, address some of the issues/complexities of Hyper-V panic handling
> through the following changes:
> 
> (a) We have die and panic notifiers on vmbus_drv.c and both have goals of
> sending panic information to the hypervisor, though the panic notifier is
> also responsible for a cleaning-up procedure.
> 
> This commit clears the code by splitting the panic notifier in two, one
> for closing the vmbus connection whereas the other is only for sending
> panic info to hypervisor. With that, it was possible to merge the die and
> panic notifiers in a single/well-documented function, and clear some
> conditional complexities on sending such information to the hypervisor.
> 
> (b) The new panic notifier created after (a) is only doing a single thing:
> cleaning the vmbus connection. This procedure might cause a delay (due to
> hypervisor I/O completion), so we postpone that to run late. But more
> relevant: this *same* vmbus unloading happens in the crash_shutdown()
> handler, so if kdump is set, we can safely skip this panic notifier and
> defer such clean-up to the kexec crash handler.

While the last sentence is true for Hyper-V on x86/x64, it's not true for
Hyper-V on ARM64.  x86/x64 has the 'machine_ops' data structure
with the ability to provide a custom crash_shutdown() function, which
Hyper-V does in the form of hv_machine_crash_shutdown().  But ARM64
has no mechanism to provide such a custom function that will eventually
do the needed vmbus_initiate_unload() before running kdump.

I'm not immediately sure what the best solution is for ARM64.  At this
point, I'm just pointing out the problem and will think about the tradeoffs
for various possible solutions.  Please do the same yourself. :-)

> 
> (c) There is also a Hyper-V framebuffer panic notifier, which relies in
> doing a vmbus operation that demands a valid connection. So, we must
> order this notifier with the panic notifier from vmbus_drv.c, in order to
> guarantee that the framebuffer code executes before the vmbus connection
> is unloaded.

Patch 21 of this set puts the Hyper-V FB panic notifier on the pre_reboot
notifier list, which means it won't execute before the VMbus connection
unload in the case of kdump.   This notifier is making sure that Hyper-V
is notified about the last updates made to the frame buffer before the
panic, so maybe it needs to be put on the hypervisor notifier list.  It
sends a message to Hyper-V over its existing VMbus channel, but it
does not wait for a reply.  It does, however, obtain a spin lock on the
ring buffer used to communicate with Hyper-V.   Unless someone has
a better suggestion, I'm inclined to take the risk of blocking on that
spin lock.

> 
> Also, this commit removes a useless header.
> 
> Although there is code rework and re-ordering, we expect that this change
> has no functional regressions but instead optimize the path and increase
> panic reliability on Hyper-V. This was tested on Hyper-V with success.
> 
> Fixes: 792f232d57ff ("Drivers: hv: vmbus: Fix potential crash on module 
> unload")
> Fixes: 74347a99e73a ("x86/Hyper-V: Unload vmbus channel in hv panic callback")

The "Fixes:" tags imply that these changes should be backported to older
longterm kernel versions, which I don't think is the case.  There is a
dependency on Patch 14 of your series where PANIC_NOTIFIER is
introduced.

> Cc: Andrea Parri (Microsoft) 
> Cc: Dexuan Cui 
> Cc: Haiyang Zhang 
> Cc: "K. Y. Srinivasan" 
> Cc: Michael Kelley 
> Cc: Stephen Hemminger 
> Cc: Tianyu Lan 
> Cc: Wei Liu 
> Tested-by: Fabio A M Martins 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> Special thanks to Michael Kelley for the good information about the Hyper-V
> panic path in email threads some months ago, and to Fabio for the testing
> performed.
> 
> Michael and all Microsoft folks: a careful analysis to double-check our 
> changes
> and assumptions here is really appreciated, this code is complex and 
> intricate,
> it is possible some corner case might hav

RE: [PATCH 18/30] notifier: Show function names on notifier routines if DEBUG_NOTIFIERS is set

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> Currently we have a debug infrastructure in the notifiers file, but
> it's very simple/limited. This patch extends it by:
> 
> (a) Showing all registered/unregistered notifiers' callback names;
> 
> (b) Adding a dynamic debug tuning to allow showing called notifiers'
> function names. Notice that this should be guarded as a tunable since
> it can flood the kernel log buffer.
> 
> Cc: Arjan van de Ven 
> Cc: Cong Wang 
> Cc: Sebastian Andrzej Siewior 
> Cc: Valentin Schneider 
> Cc: Xiaoming Ni 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> We have some design decisions that worth discussing here:
> 
> (a) First of call, using C99 helps a lot to write clear and concise code, but

s/call/all/

> due to commit 4d94f910e79a ("Kbuild: use -Wdeclaration-after-statement") we
> have a warning if mixing variable declarations with code. For this patch 
> though,
> doing that makes the code way clear, so decision was to add the debug code
> inside brackets whenever this warning pops up. We can change that, but that'll
> cause more ifdefs in the same function.
> 
> (b) In the symbol lookup helper function, we modify the parameter passed but
> even more, we return it as well! This is unusual and seems unnecessary, but 
> was
> the strategy taken to allow embedding such function in the pr_debug() call.
> 
> Not doing that would likely requiring 3 symbol_name variables to avoid
> concurrency (registering notifier A while calling notifier B) - we rely in
> local variables as a serialization mechanism.
> 
> We're open for suggestions in case this design is not appropriate;
> thanks in advance!
> 
>  kernel/notifier.c | 48 +--
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/notifier.c b/kernel/notifier.c
> index ba005ebf4730..21032ebcde57 100644
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -7,6 +7,22 @@
>  #include 
>  #include 
> 
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +#include 
> +
> +/*
> + *   Helper to get symbol names in case DEBUG_NOTIFIERS is set.
> + *   Return the modified parameter is a strategy used to achieve
> + *   the pr_debug() functionality - with this, function is only
> + *   executed if the dynamic debug tuning is effectively set.
> + */
> +static inline char *notifier_name(struct notifier_block *nb, char *sym_name)
> +{
> + lookup_symbol_name((unsigned long)(nb->notifier_call), sym_name);
> + return sym_name;
> +}
> +#endif
> +
>  /*
>   *   Notifier list for kernel code which wants to be called
>   *   at shutdown. This is used to stop any idling DMA operations
> @@ -34,20 +50,41 @@ static int notifier_chain_register(struct notifier_block 
> **nl,
>   }
>   n->next = *nl;
>   rcu_assign_pointer(*nl, n);
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> + {
> + char sym_name[KSYM_NAME_LEN];
> +
> + pr_info("notifiers: registered %s()\n",
> + notifier_name(n, sym_name));
> + }
> +#endif
>   return 0;
>  }
> 
>  static int notifier_chain_unregister(struct notifier_block **nl,
>   struct notifier_block *n)
>  {
> + int ret = -ENOENT;
> +
>   while ((*nl) != NULL) {
>   if ((*nl) == n) {
>   rcu_assign_pointer(*nl, n->next);
> - return 0;
> + ret = 0;
> + break;
>   }
>   nl = &((*nl)->next);
>   }
> - return -ENOENT;
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> + if (!ret) {
> + char sym_name[KSYM_NAME_LEN];
> +
> + pr_info("notifiers: unregistered %s()\n",
> + notifier_name(n, sym_name));
> + }
> +#endif
> + return ret;
>  }
> 
>  /**
> @@ -80,6 +117,13 @@ static int notifier_call_chain(struct notifier_block **nl,
>   nb = next_nb;
>   continue;
>   }
> +
> + {
> + char sym_name[KSYM_NAME_LEN];
> +
> + pr_debug("notifiers: calling %s()\n",
> +  notifier_name(nb, sym_name));
> + }
>  #endif
>   ret = nb->notifier_call(nb, val, v);
> 
> --
> 2.36.0




RE: [PATCH 02/30] ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> Currently the regular CPU shutdown path for ARM disables IRQs/FIQs
> in the secondary CPUs - smp_send_stop() calls ipi_cpu_stop(), which
> is responsible for that. This makes sense, since we're turning off
> such CPUs, putting them in an endless busy-wait loop.
> 
> Problem is that there is an alternative path for disabling CPUs,
> in the form of function crash_smp_send_stop(), used for kexec/panic
> paths. This functions relies in a SMP call that also triggers a

s/functions relies in/function relies on/

> busy-wait loop [at machine_crash_nonpanic_core()], but *without*
> disabling interrupts. This might lead to odd scenarios, like early
> interrupts in the boot of kexec'd kernel or even interrupts in
> other CPUs while the main one still works in the panic path and
> assumes all secondary CPUs are (really!) off.
> 
> This patch mimics the ipi_cpu_stop() interrupt disable mechanism
> in the crash CPU shutdown path, hence disabling IRQs/FIQs in all
> secondary CPUs in the kexec/panic path as well.
> 
> Cc: Marc Zyngier 
> Cc: Russell King 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/arm/kernel/machine_kexec.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
> index f567032a09c0..ef788ee00519 100644
> --- a/arch/arm/kernel/machine_kexec.c
> +++ b/arch/arm/kernel/machine_kexec.c
> @@ -86,6 +86,9 @@ void machine_crash_nonpanic_core(void *unused)
>   set_cpu_online(smp_processor_id(), false);
>   atomic_dec(_for_crash_ipi);
> 
> + local_fiq_disable();
> + local_irq_disable();
> +
>   while (1) {
>   cpu_relax();
>   wfe();
> --
> 2.36.0




RE: [PATCH 09/15] swiotlb: make the swiotlb_init interface more useful

2022-04-04 Thread Michael Kelley (LINUX)
From: Christoph Hellwig  Sent: Sunday, April 3, 2022 10:06 PM
> 
> Pass a bool to pass if swiotlb needs to be enabled based on the

Wording problems.  I'm not sure what you meant to say.

> addressing needs and replace the verbose argument with a set of
> flags, including one to force enable bounce buffering.
> 
> Note that this patch removes the possibility to force xen-swiotlb
> use using swiotlb=force on the command line on x86 (arm and arm64
> never supported that), but this interface will be restored shortly.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/arm/mm/init.c |  6 +
>  arch/arm64/mm/init.c   |  6 +
>  arch/ia64/mm/init.c|  4 +--
>  arch/mips/cavium-octeon/dma-octeon.c   |  2 +-
>  arch/mips/loongson64/dma.c |  2 +-
>  arch/mips/sibyte/common/dma.c  |  2 +-
>  arch/powerpc/mm/mem.c  |  3 ++-
>  arch/powerpc/platforms/pseries/setup.c |  3 ---
>  arch/riscv/mm/init.c   |  8 +-
>  arch/s390/mm/init.c|  3 +--
>  arch/x86/kernel/pci-dma.c  | 15 ++-
>  drivers/xen/swiotlb-xen.c  |  4 +--
>  include/linux/swiotlb.h| 15 ++-
>  include/trace/events/swiotlb.h | 29 -
>  kernel/dma/swiotlb.c   | 35 ++
>  15 files changed, 55 insertions(+), 82 deletions(-)
> 
> diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
> index fe249ea919083..ce64bdb55a16b 100644
> --- a/arch/arm/mm/init.c
> +++ b/arch/arm/mm/init.c
> @@ -271,11 +271,7 @@ static void __init free_highpages(void)
>  void __init mem_init(void)
>  {
>  #ifdef CONFIG_ARM_LPAE
> - if (swiotlb_force == SWIOTLB_FORCE ||
> - max_pfn > arm_dma_pfn_limit)
> - swiotlb_init(1);
> - else
> - swiotlb_force = SWIOTLB_NO_FORCE;
> + swiotlb_init(max_pfn > arm_dma_pfn_limit, SWIOTLB_VERBOSE);
>  #endif
> 
>   set_max_mapnr(pfn_to_page(max_pfn) - mem_map);
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 8ac25f19084e8..7b6ea4d6733d6 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -398,11 +398,7 @@ void __init bootmem_init(void)
>   */
>  void __init mem_init(void)
>  {
> - if (swiotlb_force == SWIOTLB_FORCE ||
> - max_pfn > PFN_DOWN(arm64_dma_phys_limit))
> - swiotlb_init(1);
> - else if (!xen_swiotlb_detect())
> - swiotlb_force = SWIOTLB_NO_FORCE;
> + swiotlb_init(max_pfn > PFN_DOWN(arm64_dma_phys_limit),
> SWIOTLB_VERBOSE);
> 
>   /* this will put all unused low memory onto the freelists */
>   memblock_free_all();
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 5d165607bf354..3c3e15b22608f 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -437,9 +437,7 @@ mem_init (void)
>   if (iommu_detected)
>   break;
>  #endif
> -#ifdef CONFIG_SWIOTLB
> - swiotlb_init(1);
> -#endif
> + swiotlb_init(true, SWIOTLB_VERBOSE);
>   } while (0);
> 
>  #ifdef CONFIG_FLATMEM
> diff --git a/arch/mips/cavium-octeon/dma-octeon.c 
> b/arch/mips/cavium-octeon/dma-
> octeon.c
> index fb7547e217263..9fbba6a8fa4c5 100644
> --- a/arch/mips/cavium-octeon/dma-octeon.c
> +++ b/arch/mips/cavium-octeon/dma-octeon.c
> @@ -235,5 +235,5 @@ void __init plat_swiotlb_setup(void)
>  #endif
> 
>   swiotlb_adjust_size(swiotlbsize);
> - swiotlb_init(1);
> + swiotlb_init(true, SWIOTLB_VERBOSE);
>  }
> diff --git a/arch/mips/loongson64/dma.c b/arch/mips/loongson64/dma.c
> index 364f2f27c8723..8220a1bc0db64 100644
> --- a/arch/mips/loongson64/dma.c
> +++ b/arch/mips/loongson64/dma.c
> @@ -24,5 +24,5 @@ phys_addr_t dma_to_phys(struct device *dev, dma_addr_t
> daddr)
> 
>  void __init plat_swiotlb_setup(void)
>  {
> - swiotlb_init(1);
> + swiotlb_init(true, SWIOTLB_VERBOSE);
>  }
> diff --git a/arch/mips/sibyte/common/dma.c b/arch/mips/sibyte/common/dma.c
> index eb47a94f3583e..c5c2c782aff68 100644
> --- a/arch/mips/sibyte/common/dma.c
> +++ b/arch/mips/sibyte/common/dma.c
> @@ -10,5 +10,5 @@
> 
>  void __init plat_swiotlb_setup(void)
>  {
> - swiotlb_init(1);
> + swiotlb_init(true, SWIOTLB_VERBOSE);
>  }
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 8e301cd8925b2..e1519e2edc656 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
> 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -251,7 +252,7 @@ void __init mem_init(void)
>   if (is_secure_guest())
>   svm_swiotlb_init();
>   else
> - swiotlb_init(0);
> + swiotlb_init(ppc_swiotlb_enable, 0);
>  #endif
> 
>   high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
> diff --git a/arch/powerpc/platforms/pseries/setup.c
> b/arch/powerpc/platforms/pseries/setup.c
> index 069d7b3bb142e..c6e06d91b6602 

RE: [PATCH 10/12] swiotlb: add a SWIOTLB_ANY flag to lift the low memory restriction

2022-03-06 Thread Michael Kelley (LINUX)
From: Dongli Zhang  Sent: Friday, March 4, 2022 10:28 
AM
> 
> Hi Michael,
> 
> On 3/4/22 10:12 AM, Michael Kelley (LINUX) wrote:
> > From: Christoph Hellwig  Sent: Tuesday, March 1, 2022 2:53 AM
> >>
> >> Power SVM wants to allocate a swiotlb buffer that is not restricted to low 
> >> memory for
> >> the trusted hypervisor scheme.  Consolidate the support for this into the 
> >> swiotlb_init
> >> interface by adding a new flag.
> >
> > Hyper-V Isolated VMs want to do the same thing of not restricting the 
> > swiotlb
> > buffer to low memory.  That's what Tianyu Lan's patch set[1] is proposing.
> > Hyper-V synthetic devices have no DMA addressing limitations, and the
> > likelihood of using a PCI pass-thru device with addressing limitations in an
> > Isolated VM seems vanishingly small.
> >
> > So could use of the SWIOTLB_ANY flag be generalized?  Let Hyper-V init
> > code set the flag before swiotlb_init() is called.  Or provide a CONFIG
> > variable that Hyper-V Isolated VMs could set.
> 
> I used to send 64-bit swiotlb, while at that time people thought it was the 
> same
> as Restricted DMA patchset.
> 
> https://lore.kernel.org/all/20210203233709.19819-1-dongli.zh...@oracle.com/
> 
> However, I do not think Restricted DMA patchset is going to supports 64-bit 
> (or
> high memory) DMA. Is this what you are looking for?

Yes, it looks like your patchset would do what we want for Hyper-V Isolated
VMs, but it is a more complex solution than is needed.  My assertion is that
in some environments, such as Hyper-V Isolated VMs, we're willing to assume
all devices are 64-bit DMA capable, and to stop carrying the legacy baggage.
Bounce buffering is used for a different scenario (memory encryption), and
the bounce buffers can be allocated in high memory.   There's no need for a
2nd swiotlb buffer.

Michael



RE: [PATCH 10/12] swiotlb: add a SWIOTLB_ANY flag to lift the low memory restriction

2022-03-04 Thread Michael Kelley (LINUX)
From: Christoph Hellwig  Sent: Tuesday, March 1, 2022 2:53 AM
> 
> Power SVM wants to allocate a swiotlb buffer that is not restricted to low 
> memory for
> the trusted hypervisor scheme.  Consolidate the support for this into the 
> swiotlb_init
> interface by adding a new flag.

Hyper-V Isolated VMs want to do the same thing of not restricting the swiotlb
buffer to low memory.  That's what Tianyu Lan's patch set[1] is proposing.
Hyper-V synthetic devices have no DMA addressing limitations, and the
likelihood of using a PCI pass-thru device with addressing limitations in an
Isolated VM seems vanishingly small.

So could use of the SWIOTLB_ANY flag be generalized?  Let Hyper-V init
code set the flag before swiotlb_init() is called.  Or provide a CONFIG
variable that Hyper-V Isolated VMs could set.

Michael

[1] https://lore.kernel.org/lkml/20220209122302.213882-1-ltyker...@gmail.com/

> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/include/asm/svm.h   |  4 
>  arch/powerpc/include/asm/swiotlb.h   |  1 +
>  arch/powerpc/kernel/dma-swiotlb.c|  1 +
>  arch/powerpc/mm/mem.c|  5 +
>  arch/powerpc/platforms/pseries/svm.c | 26 +-
>  include/linux/swiotlb.h  |  1 +
>  kernel/dma/swiotlb.c |  9 +++--
>  7 files changed, 12 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
> index 7546402d796af..85580b30aba48 100644
> --- a/arch/powerpc/include/asm/svm.h
> +++ b/arch/powerpc/include/asm/svm.h
> @@ -15,8 +15,6 @@ static inline bool is_secure_guest(void)
>   return mfmsr() & MSR_S;
>  }
> 
> -void __init svm_swiotlb_init(void);
> -
>  void dtl_cache_ctor(void *addr);
>  #define get_dtl_cache_ctor() (is_secure_guest() ? dtl_cache_ctor : NULL)
> 
> @@ -27,8 +25,6 @@ static inline bool is_secure_guest(void)
>   return false;
>  }
> 
> -static inline void svm_swiotlb_init(void) {}
> -
>  #define get_dtl_cache_ctor() NULL
> 
>  #endif /* CONFIG_PPC_SVM */
> diff --git a/arch/powerpc/include/asm/swiotlb.h
> b/arch/powerpc/include/asm/swiotlb.h
> index 3c1a1cd161286..4203b5e0a88ed 100644
> --- a/arch/powerpc/include/asm/swiotlb.h
> +++ b/arch/powerpc/include/asm/swiotlb.h
> @@ -9,6 +9,7 @@
>  #include 
> 
>  extern unsigned int ppc_swiotlb_enable;
> +extern unsigned int ppc_swiotlb_flags;
> 
>  #ifdef CONFIG_SWIOTLB
>  void swiotlb_detect_4g(void);
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-
> swiotlb.c
> index fc7816126a401..ba256c37bcc0f 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -10,6 +10,7 @@
>  #include 
> 
>  unsigned int ppc_swiotlb_enable;
> +unsigned int ppc_swiotlb_flags;
> 
>  void __init swiotlb_detect_4g(void)
>  {
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index
> e1519e2edc656..a4d65418c30a9 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -249,10 +249,7 @@ void __init mem_init(void)
>* back to to-down.
>*/
>   memblock_set_bottom_up(true);
> - if (is_secure_guest())
> - svm_swiotlb_init();
> - else
> - swiotlb_init(ppc_swiotlb_enable, 0);
> + swiotlb_init(ppc_swiotlb_enable, ppc_swiotlb_flags);
>  #endif
> 
>   high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); diff --git
> a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
> index c5228f4969eb2..3b4045d508ec8 100644
> --- a/arch/powerpc/platforms/pseries/svm.c
> +++ b/arch/powerpc/platforms/pseries/svm.c
> @@ -28,7 +28,7 @@ static int __init init_svm(void)
>* need to use the SWIOTLB buffer for DMA even if dma_capable() says
>* otherwise.
>*/
> - swiotlb_force = SWIOTLB_FORCE;
> + ppc_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_FORCE;
> 
>   /* Share the SWIOTLB buffer with the host. */
>   swiotlb_update_mem_attributes();
> @@ -37,30 +37,6 @@ static int __init init_svm(void)  }  
> machine_early_initcall(pseries,
> init_svm);
> 
> -/*
> - * Initialize SWIOTLB. Essentially the same as swiotlb_init(), except that it
> - * can allocate the buffer anywhere in memory. Since the hypervisor doesn't 
> have
> - * any addressing limitation, we don't need to allocate it in low addresses.
> - */
> -void __init svm_swiotlb_init(void)
> -{
> - unsigned char *vstart;
> - unsigned long bytes, io_tlb_nslabs;
> -
> - io_tlb_nslabs = (swiotlb_size_or_default() >> IO_TLB_SHIFT);
> - io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
> -
> - bytes = io_tlb_nslabs << IO_TLB_SHIFT;
> -
> - vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
> - if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
> - return;
> -
> -
> - memblock_free(vstart, PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
> - panic("SVM: Cannot allocate SWIOTLB buffer");
> -}
> -
>  int set_memory_encrypted(unsigned long addr, int 

RE: [PATCH 08/11] swiotlb: make the swiotlb_init interface more useful

2022-02-28 Thread Michael Kelley (LINUX)
From: Christoph Hellwig  Sent: Monday, February 28, 2022 3:31 AM
> 
> On Mon, Feb 28, 2022 at 02:53:39AM +, Michael Kelley (LINUX) wrote:
> > From: Christoph Hellwig  Sent: Sunday, February 27, 2022 6:31 
> > AM
> > >
> > > Pass a bool to pass if swiotlb needs to be enabled based on the
> > > addressing needs and replace the verbose argument with a set of
> > > flags, including one to force enable bounce buffering.
> > >
> > > Note that this patch removes the possibility to force xen-swiotlb
> > > use using swiotlb=force on the command line on x86 (arm and arm64
> > > never supported that), but this interface will be restored shortly.
> > >
> > > Signed-off-by: Christoph Hellwig 
> > > ---
> > >  arch/arm/mm/init.c |  6 +
> > >  arch/arm64/mm/init.c   |  6 +
> > >  arch/ia64/mm/init.c|  4 +--
> > >  arch/mips/cavium-octeon/dma-octeon.c   |  2 +-
> > >  arch/mips/loongson64/dma.c |  2 +-
> > >  arch/mips/sibyte/common/dma.c  |  2 +-
> > >  arch/powerpc/include/asm/swiotlb.h |  1 +
> > >  arch/powerpc/mm/mem.c  |  3 ++-
> > >  arch/powerpc/platforms/pseries/setup.c |  3 ---
> > >  arch/riscv/mm/init.c   |  8 +-
> > >  arch/s390/mm/init.c|  3 +--
> > >  arch/x86/kernel/cpu/mshyperv.c |  8 --
> > >  arch/x86/kernel/pci-dma.c  | 15 ++-
> > >  arch/x86/mm/mem_encrypt_amd.c  |  3 ---
> > >  drivers/xen/swiotlb-xen.c  |  4 +--
> > >  include/linux/swiotlb.h| 15 ++-
> > >  include/trace/events/swiotlb.h | 29 -
> > >  kernel/dma/swiotlb.c   | 35 ++
> > >  18 files changed, 56 insertions(+), 93 deletions(-)
> >
> > [snip]
> >
> > >
> > > diff --git a/arch/x86/kernel/cpu/mshyperv.c 
> > > b/arch/x86/kernel/cpu/mshyperv.c
> > > index 5a99f993e6392..568274917f1cd 100644
> > > --- a/arch/x86/kernel/cpu/mshyperv.c
> > > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > > @@ -336,14 +336,6 @@ static void __init ms_hyperv_init_platform(void)
> > >   swiotlb_unencrypted_base =
> ms_hyperv.shared_gpa_boundary;
> > >  #endif
> > >   }
> > > -
> > > -#ifdef CONFIG_SWIOTLB
> > > - /*
> > > -  * Enable swiotlb force mode in Isolation VM to
> > > -  * use swiotlb bounce buffer for dma transaction.
> > > -  */
> > > - swiotlb_force = SWIOTLB_FORCE;
> > > -#endif
> >
> > With this code removed, it's not clear to me what forces the use of the
> > swiotlb in a Hyper-V isolated VM.  The code in pci_swiotlb_detect_4g() 
> > doesn't
> > catch this case because cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
> > returns "false" in a Hyper-V guest.  In the Hyper-V guest, it's only
> > cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) that returns "true".  I'm
> > looking more closely at the meaning of the CC_ATTR_* values, and it may
> > be that Hyper-V should also return "true" for CC_ATTR_MEM_ENCRYPT,
> > but I don't think CC_ATTR_HOST_MEM_ENCRYPT should return "true".
> 
> Ok, I assumed that CC_ATTR_HOST_MEM_ENCRYPT returned true in this case.
> I guess we just need to check for CC_ATTR_GUEST_MEM_ENCRYPT as well
> there?

I'm unsure.

The comments for CC_ATTR_HOST_MEM_ENCRYPT indicates that it is for
SME.   The comments for both CC_ATTR_MEM_ENCRYPT and
CC_ATTR_GUEST_MEM_ENCRYPT mention SEV and SEV-ES (and presumably
SEV-SNP).   But I haven't looked at the details of the core SNP patches from
the AMD folks.   I'd say that they need to weigh in on the right approach
here that will work for both SME and the various SEV flavors, and then
hopefully the Hyper-V case will fit in.

Michael



RE: [PATCH 08/11] swiotlb: make the swiotlb_init interface more useful

2022-02-27 Thread Michael Kelley (LINUX)
From: Christoph Hellwig  Sent: Sunday, February 27, 2022 6:31 AM
> 
> Pass a bool to pass if swiotlb needs to be enabled based on the
> addressing needs and replace the verbose argument with a set of
> flags, including one to force enable bounce buffering.
> 
> Note that this patch removes the possibility to force xen-swiotlb
> use using swiotlb=force on the command line on x86 (arm and arm64
> never supported that), but this interface will be restored shortly.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/arm/mm/init.c |  6 +
>  arch/arm64/mm/init.c   |  6 +
>  arch/ia64/mm/init.c|  4 +--
>  arch/mips/cavium-octeon/dma-octeon.c   |  2 +-
>  arch/mips/loongson64/dma.c |  2 +-
>  arch/mips/sibyte/common/dma.c  |  2 +-
>  arch/powerpc/include/asm/swiotlb.h |  1 +
>  arch/powerpc/mm/mem.c  |  3 ++-
>  arch/powerpc/platforms/pseries/setup.c |  3 ---
>  arch/riscv/mm/init.c   |  8 +-
>  arch/s390/mm/init.c|  3 +--
>  arch/x86/kernel/cpu/mshyperv.c |  8 --
>  arch/x86/kernel/pci-dma.c  | 15 ++-
>  arch/x86/mm/mem_encrypt_amd.c  |  3 ---
>  drivers/xen/swiotlb-xen.c  |  4 +--
>  include/linux/swiotlb.h| 15 ++-
>  include/trace/events/swiotlb.h | 29 -
>  kernel/dma/swiotlb.c   | 35 ++
>  18 files changed, 56 insertions(+), 93 deletions(-)

[snip]

> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 5a99f993e6392..568274917f1cd 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -336,14 +336,6 @@ static void __init ms_hyperv_init_platform(void)
>   swiotlb_unencrypted_base = 
> ms_hyperv.shared_gpa_boundary;
>  #endif
>   }
> -
> -#ifdef CONFIG_SWIOTLB
> - /*
> -  * Enable swiotlb force mode in Isolation VM to
> -  * use swiotlb bounce buffer for dma transaction.
> -  */
> - swiotlb_force = SWIOTLB_FORCE;
> -#endif

With this code removed, it's not clear to me what forces the use of the
swiotlb in a Hyper-V isolated VM.  The code in pci_swiotlb_detect_4g() doesn't
catch this case because cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
returns "false" in a Hyper-V guest.  In the Hyper-V guest, it's only
cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) that returns "true".  I'm
looking more closely at the meaning of the CC_ATTR_* values, and it may
be that Hyper-V should also return "true" for CC_ATTR_MEM_ENCRYPT,
but I don't think CC_ATTR_HOST_MEM_ENCRYPT should return "true".

Michael






RE: [patch V3 00/35] genirq/msi, PCI/MSI: Spring cleaning - Part 2

2021-12-15 Thread Michael Kelley (LINUX)
From: Thomas Gleixner  Sent: Wednesday, December 15, 2021 
8:36 AM
> 
> On Wed, Dec 15 2021 at 17:18, Thomas Gleixner wrote:
> 
> > On Tue, Dec 14 2021 at 22:19, Thomas Gleixner wrote:
> >> On Tue, Dec 14 2021 at 14:56, Nishanth Menon wrote:
> >>
> >> thanks for trying. I'll have a look again with brain awake tomorrow
> >> morning.
> >
> > Morning was busy with other things, but I found what my sleepy brain
> > managed to do wrong yesterday evening.
> >
> > Let me reintegrate the pile and I'll send you an update.
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git 
> msi-v4.1-part-2
>git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git 
> msi-v4.2-part-3
> 
> That should cure the problem.

Tested the msi-v4.2-part-3 tag in two different Azure/Hyper-V VMs.  One
is a Generation 1 VM that has legacy PCI devices and one is a Generation 2
VM with no legacy PCI devices.   Tested hot add and remove of Mellanox
CX-3 and CX-4 SR-IOV NIC virtual functions that are directly mapped into the
VM.  Also tested local NVMe devices directly mapped into one of the VMs.

No issues encountered.  So for Azure/Hyper-V specifically,

Tested-by: Michael Kelley 




RE: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-12-03 Thread Michael Kelley (LINUX)
From: Tianyu Lan  Sent: Wednesday, December 1, 2021 8:03 AM
> 
> hyperv Isolation VM requires bounce buffer support to copy
> data from/to encrypted memory and so enable swiotlb force
> mode to use swiotlb bounce buffer for DMA transaction.
> 
> In Isolation VM with AMD SEV, the bounce buffer needs to be
> accessed via extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
> 
> Hyper-V initalizes swiotlb bounce buffer and default swiotlb
> needs to be disabled. pci_swiotlb_detect_override() and
> pci_swiotlb_detect_4gb() enable the default one. To override
> the setting, hyperv_swiotlb_detect() needs to run before
> these detect functions which depends on the pci_xen_swiotlb_
> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
> _detect() to keep the order.
> 
> Swiotlb bounce buffer code calls set_memory_decrypted()
> to mark bounce buffer visible to host and map it in extra
> address space via memremap. Populate the shared_gpa_boundary
> (vTOM) via swiotlb_unencrypted_base variable.
> 
> The map function memremap() can't work in the early place
> hyperv_iommu_swiotlb_init() and so call swiotlb_update_mem_attributes()
> in the hyperv_iommu_swiotlb_later_init().
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/xen/pci-swiotlb-xen.c |  3 +-
>  drivers/hv/vmbus_drv.c |  3 ++
>  drivers/iommu/hyperv-iommu.c   | 56 ++
>  include/linux/hyperv.h |  8 +
>  4 files changed, 69 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 46df59aeaa06..30fd0600b008 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -4,6 +4,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include 
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
> 
>  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
> -   NULL,
> +   hyperv_swiotlb_detect,
> pci_xen_swiotlb_init,
> NULL);
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 392c1ac4f819..0a64ccfafb8b 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "hyperv_vmbus.h"
> 
> @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t 
> *type,
>   return child_device_obj;
>  }
> 
> +static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
>  /*
>   * vmbus_device_register - Register the child device
>   */
> @@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device 
> *child_device_obj)
>   }
>   hv_debug_add_dev_dir(child_device_obj);
> 
> + child_device_obj->device.dma_mask = _dma_mask;
>   return 0;
> 
>  err_kset_unregister:
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index e285a220c913..dd729d49a1eb 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -13,14 +13,20 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "irq_remapping.h"
> 
> @@ -337,4 +343,54 @@ static const struct irq_domain_ops 
> hyperv_root_ir_domain_ops = {
>   .free = hyperv_root_irq_remapping_free,
>  };
> 
> +static void __init hyperv_iommu_swiotlb_init(void)
> +{
> + unsigned long hyperv_io_tlb_size;
> + void *hyperv_io_tlb_start;
> +
> + /*
> +  * Allocate Hyper-V swiotlb bounce buffer at early place
> +  * to reserve large contiguous memory.
> +  */
> + hyperv_io_tlb_size = swiotlb_size_or_default();
> + hyperv_io_tlb_start = memblock_alloc(hyperv_io_tlb_size, PAGE_SIZE);
> +
> + if (!hyperv_io_tlb_start)
> + pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");

In the error case, won't swiotlb_init_with_tlb() end up panic'ing when
it tries to zero out the memory?   The only real choice here is to
return immediately after printing the message, and not call
swiotlb_init_with_tlb().

> +
> + swiotlb_init_with_tbl(hyperv_io_tlb_start,
> +   hyperv_io_tlb_size >> IO_TLB_SHIFT, true);
> +}
> +
> +int __init hyperv_swiotlb_detect(void)
> +{
> + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> + return 0;
> +
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + /*
> +  * Enable swiotlb force mode in Isolation VM to
> +  * use swiotlb bounce 

RE: [PATCH V3 5/5] hv_netvsc: Add Isolation VM support for netvsc driver

2021-12-03 Thread Michael Kelley (LINUX)
From: Tianyu Lan  Sent: Wednesday, December 1, 2021 8:03 AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma adress will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> rx/tx ring buffer is allocated via vzalloc() and they need to be
> mapped into unencrypted address space(above vTOM) before sharing
> with host and accessing. Add hv_map/unmap_memory() to map/umap rx
> /tx ring buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v2:
>* Add hv_map/unmap_memory() to map/umap rx/tx ring buffer.
> ---
>  arch/x86/hyperv/ivm.c |  28 ++
>  drivers/hv/hv_common.c|  11 +++
>  drivers/net/hyperv/hyperv_net.h   |   5 ++
>  drivers/net/hyperv/netvsc.c   | 136 +-
>  drivers/net/hyperv/netvsc_drv.c   |   1 +
>  drivers/net/hyperv/rndis_filter.c |   2 +
>  include/asm-generic/mshyperv.h|   2 +
>  include/linux/hyperv.h|   5 ++
>  8 files changed, 187 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 69c7a57f3307..9f78d8f67ea3 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -287,3 +287,31 @@ int hv_set_mem_host_visibility(unsigned long kbuffer, 
> int pagecount, bool visibl
>   kfree(pfn_array);
>   return ret;
>  }
> +
> +/*
> + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
> + */
> +void *hv_map_memory(void *addr, unsigned long size)
> +{
> + unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,

This should be just PAGE_SIZE, as this code is unrelated to communication
with Hyper-V.

> +   sizeof(unsigned long), GFP_KERNEL);
> + void *vaddr;
> + int i;
> +
> + if (!pfns)
> + return NULL;
> +
> + for (i = 0; i < size / PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(addr + i * PAGE_SIZE) +

Same here:  Use virt_to_pfn().

> + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO);
> + kfree(pfns);
> +
> + return vaddr;
> +}
> +
> +void hv_unmap_memory(void *addr)
> +{
> + vunmap(addr);
> +}
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 7be173a99f27..3c5cb1f70319 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -295,3 +295,14 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, 
> void *output, u32 input_s
>   return HV_STATUS_INVALID_PARAMETER;
>  }
>  EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
> +
> +void __weak *hv_map_memory(void *addr, unsigned long size)
> +{
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(hv_map_memory);
> +
> +void __weak hv_unmap_memory(void *addr)
> +{
> +}
> +EXPORT_SYMBOL_GPL(hv_unmap_memory);
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 315278a7cf88..cf69da0e296c 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>   u32 total_bytes;
>   u32 send_buf_index;
>   u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
>  };
> 
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,7 @@ struct netvsc_device {
> 
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + void *recv_original_buf;
>   u32 recv_buf_size; /* allocated bytes */
>   struct vmbus_gpadl recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
> @@ -1082,6 +1084,7 @@ struct netvsc_device {
> 
>   /* Send buffer allocated by us */
>   void *send_buf;
> + void *send_original_buf;
>   u32 send_buf_size;
>   struct vmbus_gpadl send_buf_gpadl_handle;
>   u32 send_section_cnt;
> @@ -1731,4 +1734,6 @@ struct rndis_message {
>  #define RETRY_US_HI  1
>  #define RETRY_MAX2000/* >10 sec */
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +   struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 396bc1c204e6..b7ade735a806 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
>   int i;
> 
>   kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_original_buf) {
> + hv_unmap_memory(nvdev->recv_buf);
> + vfree(nvdev->recv_original_buf);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if 

RE: [PATCH V2 5/6] net: netvsc: Add Isolation VM support for netvsc driver

2021-11-24 Thread Michael Kelley (LINUX)
From: Tianyu Lan  Sent: Tuesday, November 23, 2021 6:31 AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma address will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> Allocate rx/tx ring buffer via dma_alloc_noncontiguous() in Isolation
> VM. After calling vmbus_establish_gpadl() which marks these pages visible
> to host, map these pages unencrypted addes space via dma_vmap_noncontiguous().
> 

The big unresolved topic is how best to do the allocation and mapping of the big
netvsc send and receive buffers.  Let me summarize and make a recommendation.

Background
==
1.  Each Hyper-V synthetic network device requires a large pre-allocated receive
 buffer (defaults to 16 Mbytes) and a similar send buffer (defaults to 1 
Mbyte).
2.  The buffers are allocated in guest memory and shared with the Hyper-V host.
 As such, in the Hyper-V SNP environment, the memory must be unencrypted
 and accessed in the Hyper-V guest with shared_gpa_boundary (i.e., VTOM)
 added to the physical memory address.
3.  The buffers need *not* be contiguous in guest physical memory, but must be
 contiguously mapped in guest kernel virtual space.
4.  Network devices may come and go during the life of the VM, so allocation of
 these buffers and their mappings may be done after Linux has been running 
for
 a long time.
5.  Performance of the allocation and mapping process is not an issue since it 
is
 done only on synthetic network device add/remove.
6.  So the primary goals are an appropriate logical abstraction, code that is
 simple and straightforward, and efficient memory usage.

Approaches
==
During the development of these patches, four approaches have been
implemented:

1.  Two virtual mappings:  One from vmalloc() to allocate the guest memory, and
 the second from vmap_pfns() after adding the shared_gpa_boundary.   This is
 implemented in Hyper-V or netvsc specific code, with no use of DMA APIs.
 No separate list of physical pages is maintained, so for creating the 
second
 mapping, the PFN list is assembled temporarily by doing virt-to-phys()
 page-by-page on the vmalloc mapping, and then discarded because it is no
 longer needed.  [v4 of the original patch series.]

2.  Two virtual mappings as in (1) above, but implemented via new DMA calls
 dma_map_decrypted() and dma_unmap_encrypted().  [v3 of the original
 patch series.]

3.  Two virtual mappings as in (1) above, but implemented via DMA noncontiguous
  allocation and mapping calls, as enhanced to allow for custom map/unmap
  implementations.  A list of physical pages is maintained in the 
dma_sgt_handle
  as expected by the DMA noncontiguous API.  [New split-off patch series v1 
& v2]

4.   Single virtual mapping from vmap_pfns().  The netvsc driver allocates 
physical
  memory via alloc_pages() with as much contiguity as possible, and 
maintains a
  list of physical pages and ranges.   Single virtual map is setup with 
vmap_pfns()
  after adding shared_gpa_boundary.  [v5 of the original patch series.]

Both implementations using DMA APIs use very little of the existing DMA
machinery.  Both require extensions to the DMA APIs, and custom ops functions.
While in some sense the netvsc send and receive buffers involve DMA, they
do not require any DMA actions on a per-I/O basis.  It seems better to me to
not try to fit these two buffers into the DMA model as a one-off.  Let's just
use Hyper-V specific code to allocate and map them, as is done with the
Hyper-V VMbus channel ring buffers.

That leaves approaches (1) and (4) above.  Between those two, (1) is
simpler even though there are two virtual mappings.  Using alloc_pages() as
in (4) is messy and there's no real benefit to using higher order allocations.
(4) also requires maintaining a separate list of PFNs and ranges, which offsets
some of the benefits to having only one virtual mapping active at any point in
time.

I don't think there's a clear "right" answer, so it's a judgment call.  We've
explored what other approaches would look like, and I'd say let's go with
(1) as the simpler approach.  Thoughts?

Michael



RE: [PATCH V2 5/6] net: netvsc: Add Isolation VM support for netvsc driver

2021-11-23 Thread Michael Kelley (LINUX)
From: Tianyu Lan  Sent: Tuesday, November 23, 2021 6:31 AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma address will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> Allocate rx/tx ring buffer via dma_alloc_noncontiguous() in Isolation
> VM. After calling vmbus_establish_gpadl() which marks these pages visible
> to host, map these pages unencrypted addes space via dma_vmap_noncontiguous().
> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/net/hyperv/hyperv_net.h   |   5 +
>  drivers/net/hyperv/netvsc.c   | 192 +++---
>  drivers/net/hyperv/rndis_filter.c |   2 +
>  include/linux/hyperv.h|   6 +
>  4 files changed, 190 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 315278a7cf88..31c77a00d01e 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>   u32 total_bytes;
>   u32 send_buf_index;
>   u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
>  };
> 
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,7 @@ struct netvsc_device {
> 
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + struct sg_table *recv_sgt;
>   u32 recv_buf_size; /* allocated bytes */
>   struct vmbus_gpadl recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
> @@ -1082,6 +1084,7 @@ struct netvsc_device {
> 
>   /* Send buffer allocated by us */
>   void *send_buf;
> + struct sg_table *send_sgt;
>   u32 send_buf_size;
>   struct vmbus_gpadl send_buf_gpadl_handle;
>   u32 send_section_cnt;
> @@ -1731,4 +1734,6 @@ struct rndis_message {
>  #define RETRY_US_HI  1
>  #define RETRY_MAX2000/* >10 sec */
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +   struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 396bc1c204e6..9cdc71930830 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -146,15 +147,39 @@ static struct netvsc_device *alloc_net_device(void)
>   return net_device;
>  }
> 
> +static struct hv_device *netvsc_channel_to_device(struct vmbus_channel 
> *channel)
> +{
> + struct vmbus_channel *primary = channel->primary_channel;
> +
> + return primary ? primary->device_obj : channel->device_obj;
> +}
> +
>  static void free_netvsc_device(struct rcu_head *head)
>  {
>   struct netvsc_device *nvdev
>   = container_of(head, struct netvsc_device, rcu);
> + struct hv_device *dev =
> + netvsc_channel_to_device(nvdev->chan_table[0].channel);
>   int i;
> 
>   kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_sgt) {
> + dma_vunmap_noncontiguous(>device, nvdev->recv_buf);
> + dma_free_noncontiguous(>device, nvdev->recv_buf_size,
> +nvdev->recv_sgt, DMA_FROM_DEVICE);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if (nvdev->send_sgt) {
> + dma_vunmap_noncontiguous(>device, nvdev->send_buf);
> + dma_free_noncontiguous(>device, nvdev->send_buf_size,
> +nvdev->send_sgt, DMA_TO_DEVICE);
> + } else {
> + vfree(nvdev->send_buf);
> + }
> +
>   kfree(nvdev->send_section_map);
> 
>   for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
> @@ -348,7 +373,21 @@ static int netvsc_init_buf(struct hv_device *device,
>   buf_size = min_t(unsigned int, buf_size,
>NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
> 
> - net_device->recv_buf = vzalloc(buf_size);
> + if (hv_isolation_type_snp()) {
> + net_device->recv_sgt =
> + dma_alloc_noncontiguous(>device, buf_size,
> + DMA_FROM_DEVICE, GFP_KERNEL, 0);
> + if (!net_device->recv_sgt) {
> + pr_err("Fail to allocate recv buffer buf_size %d.\n.", 
> buf_size);
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> +
> + net_device->recv_buf = (void 
> *)net_device->recv_sgt->sgl->dma_address;

Use sg_dma_address() macro.

> + } else {
> + net_device->recv_buf = vzalloc(buf_size);
> + }
> +
>   if 

RE: [PATCH V2 4/6] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-11-23 Thread Michael Kelley (LINUX)
From: Tianyu Lan  Sent: Tuesday, November 23, 2021 6:31 AM
> 
> hyperv Isolation VM requires bounce buffer support to copy
> data from/to encrypted memory and so enable swiotlb force
> mode to use swiotlb bounce buffer for DMA transaction.
> 
> In Isolation VM with AMD SEV, the bounce buffer needs to be
> accessed via extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
> 
> Hyper-V initalizes swiotlb bounce buffer and default swiotlb
> needs to be disabled. pci_swiotlb_detect_override() and
> pci_swiotlb_detect_4gb() enable the default one. To override
> the setting, hyperv_swiotlb_detect() needs to run before
> these detect functions which depends on the pci_xen_swiotlb_
> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
> _detect() to keep the order.
> 
> Swiotlb bounce buffer code calls set_memory_decrypted()
> to mark bounce buffer visible to host and map it in extra
> address space via memremap. Populate the shared_gpa_boundary
> (vTOM) via swiotlb_unencrypted_base variable.
> 
> The map function memremap() can't work in the early place
> hyperv_iommu_swiotlb_init() and so call swiotlb_update_mem_attributes()
> in the hyperv_iommu_swiotlb_later_init().
> 
> Add Hyper-V dma ops and provide alloc/free and vmap/vunmap noncontiguous
> callback to handle request of  allocating and mapping noncontiguous dma
> memory in vmbus device driver. Netvsc driver will use this. Set dma_ops_
> bypass flag for hv device to use dma direct functions during mapping/unmapping
> dma page.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v1:
>   * Remove hv isolation check in the sev_setup_arch()
> 
>  arch/x86/mm/mem_encrypt.c  |   1 +
>  arch/x86/xen/pci-swiotlb-xen.c |   3 +-
>  drivers/hv/Kconfig |   1 +
>  drivers/hv/vmbus_drv.c |   6 ++
>  drivers/iommu/hyperv-iommu.c   | 164 +
>  include/linux/hyperv.h |  10 ++
>  6 files changed, 184 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 35487305d8af..e48c73b3dd41 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

There is no longer any need to add this #include since code changes to this
file in a previous version of the patch are now gone.

> 
>  #include "mm_internal.h"
> 
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 46df59aeaa06..30fd0600b008 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -4,6 +4,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include 
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
> 
>  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
> -   NULL,
> +   hyperv_swiotlb_detect,
> pci_xen_swiotlb_init,
> NULL);
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index dd12af20e467..d43b4cd88f57 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -9,6 +9,7 @@ config HYPERV
>   select PARAVIRT
>   select X86_HV_CALLBACK_VECTOR if X86
>   select VMAP_PFN
> + select DMA_OPS_BYPASS
>   help
> Select this option to run Linux as a Hyper-V client operating
> system.
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 392c1ac4f819..32dc193e31cd 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "hyperv_vmbus.h"
> 
> @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t 
> *type,
>   return child_device_obj;
>  }
> 
> +static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
>  /*
>   * vmbus_device_register - Register the child device
>   */
> @@ -2118,6 +2120,10 @@ int vmbus_device_register(struct hv_device 
> *child_device_obj)
>   }
>   hv_debug_add_dev_dir(child_device_obj);
> 
> + child_device_obj->device.dma_ops_bypass = true;
> + child_device_obj->device.dma_ops = _iommu_dma_ops;
> + child_device_obj->device.dma_mask = _dma_mask;
> + child_device_obj->device.dma_parms = _device_obj->dma_parms;
>   return 0;
> 
>  err_kset_unregister:
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index e285a220c913..ebcb628e7e8f 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -13,14 +13,21 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include 

RE: [PATCH V2 1/6] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-11-23 Thread Michael Kelley (LINUX)
From: Tianyu Lan  Sent: Tuesday, November 23, 2021 6:31 AM
> 
> In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
> extra address space which is above shared_gpa_boundary (E.G 39 bit
> address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
> physical address will be original physical address + shared_gpa_boundary.
> The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
> memory(vTOM). Memory addresses below vTOM are automatically treated as
> private while memory above vTOM is treated as shared.
> 
> Expose swiotlb_unencrypted_base for platforms to set unencrypted
> memory base offset and platform calls swiotlb_update_mem_attributes()
> to remap swiotlb mem to unencrypted address space. memremap() can
> not be called in the early stage and so put remapping code into
> swiotlb_update_mem_attributes(). Store remap address and use it to copy
> data from/to swiotlb bounce buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v1:
>   * Rework comment in the swiotlb_init_io_tlb_mem()
>   * Make swiotlb_init_io_tlb_mem() back to return void.
> ---
>  include/linux/swiotlb.h |  6 +
>  kernel/dma/swiotlb.c| 53 +
>  2 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 569272871375..f6c3638255d5 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force;
>   * @end: The end address of the swiotlb memory pool. Used to do a quick
>   *   range check to see if the memory was in fact allocated by this
>   *   API.
> + * @vaddr:   The vaddr of the swiotlb memory pool. The swiotlb memory pool
> + *   may be remapped in the memory encrypted case and store virtual
> + *   address for bounce buffer operation.
>   * @nslabs:  The number of IO TLB blocks (in groups of 64) between @start and
>   *   @end. For default swiotlb, this is command line adjustable via
>   *   setup_io_tlb_npages.
> @@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force;
>  struct io_tlb_mem {
>   phys_addr_t start;
>   phys_addr_t end;
> + void *vaddr;
>   unsigned long nslabs;
>   unsigned long used;
>   unsigned int index;
> @@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device 
> *dev)
>  }
>  #endif /* CONFIG_DMA_RESTRICTED_POOL */
> 
> +extern phys_addr_t swiotlb_unencrypted_base;
> +
>  #endif /* __LINUX_SWIOTLB_H */
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 8e840fbbed7c..c303fdeba82f 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -50,6 +50,7 @@
>  #include 
>  #include 
> 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;
> 
>  struct io_tlb_mem io_tlb_default_mem;
> 
> +phys_addr_t swiotlb_unencrypted_base;
> +
>  /*
>   * Max segment that we can provide which (if pages are contingous) will
>   * not be bounced (unless SWIOTLB_FORCE is set).
> @@ -155,6 +158,31 @@ static inline unsigned long nr_slots(u64 val)
>   return DIV_ROUND_UP(val, IO_TLB_SIZE);
>  }
> 
> +/*
> + * Remap swioltb memory in the unencrypted physical address space
> + * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
> + * Isolation VMs).
> + */
> +void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
> +{
> + void *vaddr;
> +
> + if (swiotlb_unencrypted_base) {
> + phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
> +
> + vaddr = memremap(paddr, bytes, MEMREMAP_WB);
> + if (!vaddr) {
> + pr_err("Failed to map the unencrypted memory %llx size 
> %lx.\n",
> +paddr, bytes);
> + return NULL;
> + }
> +
> + return vaddr;
> + }
> +
> + return phys_to_virt(mem->start);
> +}
> +
>  /*
>   * Early SWIOTLB allocation may be too early to allow an architecture to
>   * perform the desired operations.  This function allows the architecture to
> @@ -172,7 +200,14 @@ void __init swiotlb_update_mem_attributes(void)
>   vaddr = phys_to_virt(mem->start);
>   bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
>   set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
> - memset(vaddr, 0, bytes);
> +
> + mem->vaddr = swiotlb_mem_remap(mem, bytes);
> + if (!mem->vaddr) {
> + pr_err("Fail to remap swiotlb mem.\n");
> + return;
> + }
> +
> + memset(mem->vaddr, 0, bytes);
>  }

In the error case, do you want to leave mem->vaddr as NULL?  Or is it
better to leave it as the virtual address of mem-start?  Your code leaves it
as NULL.

The interaction between swiotlb_update_mem_attributes() and the helper
function swiotlb_memo_remap() seems kind of clunky.  phys_to_virt() gets called
twice, for 

RE: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for netvsc driver

2021-09-15 Thread Michael Kelley
From: Tianyu Lan   Sent: Tuesday, September 14, 2021 6:39 
AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma address will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
> these pages via vmap(). After calling vmbus_establish_gpadl() which
> marks these pages visible to host, unmap these pages to release the
> virtual address mapped with physical address below shared_gpa_boundary
> and map them in the extra address space via vmap_pfn().
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v4:
>   * Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
>   * Map pages after calling vmbus_establish_gpadl().
>   * set dma_set_min_align_mask for netvsc driver.
> 
> Change since v3:
>   * Add comment to explain why not to use dma_map_sg()
>   * Fix some error handle.
> ---
>  drivers/net/hyperv/hyperv_net.h   |   7 +
>  drivers/net/hyperv/netvsc.c   | 287 +-
>  drivers/net/hyperv/netvsc_drv.c   |   1 +
>  drivers/net/hyperv/rndis_filter.c |   2 +
>  include/linux/hyperv.h|   5 +
>  5 files changed, 296 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 315278a7cf88..87e8c74398a5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>   u32 total_bytes;
>   u32 send_buf_index;
>   u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
>  };
> 
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,8 @@ struct netvsc_device {
> 
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + struct page **recv_pages;
> + u32 recv_page_count;
>   u32 recv_buf_size; /* allocated bytes */
>   struct vmbus_gpadl recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
> @@ -1082,6 +1085,8 @@ struct netvsc_device {
> 
>   /* Send buffer allocated by us */
>   void *send_buf;
> + struct page **send_pages;
> + u32 send_page_count;
>   u32 send_buf_size;
>   struct vmbus_gpadl send_buf_gpadl_handle;
>   u32 send_section_cnt;
> @@ -1731,4 +1736,6 @@ struct rndis_message {
>  #define RETRY_US_HI  1
>  #define RETRY_MAX2000/* >10 sec */
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +   struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 1f87e570ed2b..7d5254bf043e 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head *head)
>  {
>   struct netvsc_device *nvdev
>   = container_of(head, struct netvsc_device, rcu);
> + unsigned int alloc_unit;
>   int i;
> 
>   kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_pages) {
> + alloc_unit = (nvdev->recv_buf_size /
> + nvdev->recv_page_count) >> PAGE_SHIFT;
> +
> + vunmap(nvdev->recv_buf);
> + for (i = 0; i < nvdev->recv_page_count; i++)
> + __free_pages(nvdev->recv_pages[i], alloc_unit);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if (nvdev->send_pages) {
> + alloc_unit = (nvdev->send_buf_size /
> + nvdev->send_page_count) >> PAGE_SHIFT;
> +
> + vunmap(nvdev->send_buf);
> + for (i = 0; i < nvdev->send_page_count; i++)
> + __free_pages(nvdev->send_pages[i], alloc_unit);
> + } else {
> + vfree(nvdev->send_buf);
> + }
> +
>   kfree(nvdev->send_section_map);
> 
>   for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
> @@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device 
> *net_device, u32 q_idx)
>   return nvchan->mrc.slots ? 0 : -ENOMEM;
>  }
> 
> +void *netvsc_alloc_pages(struct page ***pages_array, unsigned int *array_len,
> +  unsigned long size)
> +{
> + struct page *page, **pages, **vmap_pages;
> + unsigned long pg_count = size >> PAGE_SHIFT;
> + int alloc_unit = MAX_ORDER_NR_PAGES;
> + int i, j, vmap_page_index = 0;
> + void *vaddr;
> +
> + if (pg_count < alloc_unit)
> + alloc_unit = 1;
> +
> + /* vmap() accepts page array with PAGE_SIZE as 

RE: [PATCH V5 11/12] scsi: storvsc: Add Isolation VM support for storvsc driver

2021-09-15 Thread Michael Kelley
From: Tianyu Lan  Sent: Tuesday, September 14, 2021 6:39 AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> mpb_desc() still needs to be handled. Use DMA API(scsi_dma_map/unmap)
> to map these memory during sending/receiving packet and return swiotlb
> bounce buffer dma address. In Isolation VM, swiotlb  bounce buffer is
> marked to be visible to host and the swiotlb force mode is enabled.
> 
> Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to
> keep the original data offset in the bounce buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v4:
>   * use scsi_dma_map/unmap() instead of dma_map/unmap_sg()
>   * Add deleted comments back.
>   * Fix error calculation of  hvpnfs_to_add
> 
> Change since v3:
>   * Rplace dma_map_page with dma_map_sg()
>   * Use for_each_sg() to populate payload->range.pfn_array.
>   * Remove storvsc_dma_map macro
> ---
>  drivers/hv/vmbus_drv.c |  1 +
>  drivers/scsi/storvsc_drv.c | 24 +++-
>  include/linux/hyperv.h |  1 +
>  3 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index b0be287e9a32..9c53f823cde1 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2121,6 +2121,7 @@ int vmbus_device_register(struct hv_device 
> *child_device_obj)
>   hv_debug_add_dev_dir(child_device_obj);
> 
>   child_device_obj->device.dma_mask = _dma_mask;
> + child_device_obj->device.dma_parms = _device_obj->dma_parms;
>   return 0;
> 
>  err_kset_unregister:
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index ebbbc1299c62..d10b450bcf0c 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -21,6 +21,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -1322,6 +1324,7 @@ static void storvsc_on_channel_callback(void *context)
>   continue;
>   }
>   request = (struct storvsc_cmd_request 
> *)scsi_cmd_priv(scmnd);
> + scsi_dma_unmap(scmnd);
>   }
> 
>   storvsc_on_receive(stor_device, packet, request);
> @@ -1735,7 +1738,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
> struct scsi_cmnd *scmnd)
>   struct hv_host_device *host_dev = shost_priv(host);
>   struct hv_device *dev = host_dev->dev;
>   struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd);
> - int i;
>   struct scatterlist *sgl;
>   unsigned int sg_count;
>   struct vmscsi_request *vm_srb;
> @@ -1817,10 +1819,11 @@ static int storvsc_queuecommand(struct Scsi_Host 
> *host, struct scsi_cmnd *scmnd)
>   payload_sz = sizeof(cmd_request->mpb);
> 
>   if (sg_count) {
> - unsigned int hvpgoff, hvpfns_to_add;
>   unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
>   unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
> - u64 hvpfn;
> + struct scatterlist *sg;
> + unsigned long hvpfn, hvpfns_to_add;
> + int j, i = 0;
> 
>   if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
> 
> @@ -1834,8 +1837,11 @@ static int storvsc_queuecommand(struct Scsi_Host 
> *host, struct scsi_cmnd *scmnd)
>   payload->range.len = length;
>   payload->range.offset = offset_in_hvpg;
> 
> + sg_count = scsi_dma_map(scmnd);
> + if (sg_count < 0)
> + return SCSI_MLQUEUE_DEVICE_BUSY;
> 
> - for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
> + for_each_sg(sgl, sg, sg_count, j) {
>   /*
>* Init values for the current sgl entry. hvpgoff
>* and hvpfns_to_add are in units of Hyper-V size

Nit:  The above comment is now out-of-date because hvpgoff has
been removed.

> @@ -1845,10 +1851,9 @@ static int storvsc_queuecommand(struct Scsi_Host 
> *host, struct scsi_cmnd *scmnd)
>* even on other than the first sgl entry, provided
>* they are a multiple of PAGE_SIZE.
>*/
> - hvpgoff = HVPFN_DOWN(sgl->offset);
> - hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
> - hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) -
> - hvpgoff;
> + hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> + hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
> +  sg_dma_len(sg)) - hvpfn;

Good.  This looks correct now.

> 
>  

RE: [PATCH V5 10/12] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-09-15 Thread Michael Kelley
From: Tianyu Lan  Sent: Tuesday, September 14, 2021 6:39 AM
> 
> hyperv Isolation VM requires bounce buffer support to copy
> data from/to encrypted memory and so enable swiotlb force
> mode to use swiotlb bounce buffer for DMA transaction.
> 
> In Isolation VM with AMD SEV, the bounce buffer needs to be
> accessed via extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
> 
> Hyper-V initalizes swiotlb bounce buffer and default swiotlb
> needs to be disabled. pci_swiotlb_detect_override() and
> pci_swiotlb_detect_4gb() enable the default one. To override
> the setting, hyperv_swiotlb_detect() needs to run before
> these detect functions which depends on the pci_xen_swiotlb_
> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
> _detect() to keep the order.
> 
> Swiotlb bounce buffer code calls set_memory_decrypted()
> to mark bounce buffer visible to host and map it in extra
> address space via memremap. Populate the shared_gpa_boundary
> (vTOM) via swiotlb_unencrypted_base variable.
> 
> The map function memremap() can't work in the early place
> hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce
> buffer in the hyperv_iommu_swiotlb_later_init().
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v4:
>* Use swiotlb_unencrypted_base variable to pass shared_gpa_
>  boundary and map bounce buffer inside swiotlb code.
> 
> Change since v3:
>* Get hyperv bounce bufffer size via default swiotlb
>bounce buffer size function and keep default size as
>same as the one in the AMD SEV VM.
> ---
>  arch/x86/include/asm/mshyperv.h |  2 ++
>  arch/x86/mm/mem_encrypt.c   |  3 +-
>  arch/x86/xen/pci-swiotlb-xen.c  |  3 +-
>  drivers/hv/vmbus_drv.c  |  3 ++
>  drivers/iommu/hyperv-iommu.c| 60 +
>  include/linux/hyperv.h  |  1 +
>  6 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 165423e8b67a..2d22f29f90c9 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -182,6 +182,8 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, 
> int vcpu, int vector,
>   struct hv_interrupt_entry *entry);
>  int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry 
> *entry);
>  int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool 
> visible);
> +void *hv_map_memory(void *addr, unsigned long size);
> +void hv_unmap_memory(void *addr);

Aren't these two declarations now spurious?

>  void hv_ghcb_msr_write(u64 msr, u64 value);
>  void hv_ghcb_msr_read(u64 msr, u64 *value);
>  #else /* CONFIG_HYPERV */
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index ff08dc463634..e2db0b8ed938 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "mm_internal.h"
> 
> @@ -202,7 +203,7 @@ void __init sev_setup_arch(void)
>   phys_addr_t total_mem = memblock_phys_mem_size();
>   unsigned long size;
> 
> - if (!sev_active())
> + if (!sev_active() && !hv_is_isolation_supported())
>   return;
> 
>   /*
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 54f9aa7e8457..43bd031aa332 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -4,6 +4,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include 
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
> 
>  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
> -   NULL,
> +   hyperv_swiotlb_detect,
> pci_xen_swiotlb_init,
> NULL);
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 392c1ac4f819..b0be287e9a32 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
> 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t 
> *type,
>   return child_device_obj;
>  }
> 
> +static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
>  /*
>   * vmbus_device_register - Register the child device
>   */
> @@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device 
> *child_device_obj)
>   }
>   hv_debug_add_dev_dir(child_device_obj);
> 
> + child_device_obj->device.dma_mask = _dma_mask;
>   return 0;
> 
>  err_kset_unregister:
> diff --git 

RE: [PATCH V5 09/12] x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-09-15 Thread Michael Kelley
From: Tianyu Lan  Sent: Tuesday, September 14, 2021 6:39 AM
> 
> In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
> extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
> 
> Expose swiotlb_unencrypted_base for platforms to set unencrypted
> memory base offset and call memremap() to map bounce buffer in the
> swiotlb code, store map address and use the address to copy data
> from/to swiotlb bounce buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v4:
>   * Expose swiotlb_unencrypted_base to set unencrypted memory
> offset.
>   * Use memremap() to map bounce buffer if swiotlb_unencrypted_
> base is set.
> 
> Change since v1:
>   * Make swiotlb_init_io_tlb_mem() return error code and return
>   error when dma_map_decrypted() fails.
> ---
>  include/linux/swiotlb.h |  6 ++
>  kernel/dma/swiotlb.c| 41 +++--
>  2 files changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index b0cb2a9973f4..4998ed44ae3d 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -72,6 +72,9 @@ extern enum swiotlb_force swiotlb_force;
>   * @end: The end address of the swiotlb memory pool. Used to do a quick
>   *   range check to see if the memory was in fact allocated by this
>   *   API.
> + * @vaddr:   The vaddr of the swiotlb memory pool. The swiotlb
> + *   memory pool may be remapped in the memory encrypted case and 
> store
> + *   virtual address for bounce buffer operation.
>   * @nslabs:  The number of IO TLB blocks (in groups of 64) between @start and
>   *   @end. For default swiotlb, this is command line adjustable via
>   *   setup_io_tlb_npages.
> @@ -91,6 +94,7 @@ extern enum swiotlb_force swiotlb_force;
>  struct io_tlb_mem {
>   phys_addr_t start;
>   phys_addr_t end;
> + void *vaddr;
>   unsigned long nslabs;
>   unsigned long used;
>   unsigned int index;
> @@ -185,4 +189,6 @@ static inline bool is_swiotlb_for_alloc(struct device 
> *dev)
>  }
>  #endif /* CONFIG_DMA_RESTRICTED_POOL */
> 
> +extern phys_addr_t swiotlb_unencrypted_base;
> +
>  #endif /* __LINUX_SWIOTLB_H */
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 87c40517e822..9e30cc4bd872 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -50,6 +50,7 @@
>  #include 
>  #include 
> 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;
> 
>  struct io_tlb_mem io_tlb_default_mem;
> 
> +phys_addr_t swiotlb_unencrypted_base;
> +
>  /*
>   * Max segment that we can provide which (if pages are contingous) will
>   * not be bounced (unless SWIOTLB_FORCE is set).
> @@ -175,7 +178,7 @@ void __init swiotlb_update_mem_attributes(void)
>   memset(vaddr, 0, bytes);
>  }
> 
> -static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t 
> start,
> +static int swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   unsigned long nslabs, bool late_alloc)
>  {
>   void *vaddr = phys_to_virt(start);
> @@ -196,13 +199,34 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
> *mem, phys_addr_t start,
>   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>   mem->slots[i].alloc_size = 0;
>   }
> +
> + if (set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT))
> + return -EFAULT;
> +
> + /*
> +  * Map memory in the unencrypted physical address space when requested
> +  * (e.g. for Hyper-V AMD SEV-SNP Isolation VMs).
> +  */
> + if (swiotlb_unencrypted_base) {
> + phys_addr_t paddr = __pa(vaddr) + swiotlb_unencrypted_base;

Nit:  Use "start" instead of "__pa(vaddr)" since "start" is already the needed
physical address.

> +
> + vaddr = memremap(paddr, bytes, MEMREMAP_WB);
> + if (!vaddr) {
> + pr_err("Failed to map the unencrypted memory.\n");
> + return -ENOMEM;
> + }
> + }
> +
>   memset(vaddr, 0, bytes);
> + mem->vaddr = vaddr;
> + return 0;
>  }
> 
>  int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int 
> verbose)
>  {
>   struct io_tlb_mem *mem = _tlb_default_mem;
>   size_t alloc_size;
> + int ret;
> 
>   if (swiotlb_force == SWIOTLB_NO_FORCE)
>   return 0;
> @@ -217,7 +241,11 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned 
> long nslabs, int verbose)
>   

RE: [PATCH V5 07/12] Drivers: hv: vmbus: Add SNP support for VMbus channel initiate message

2021-09-15 Thread Michael Kelley
From: Tianyu Lan  Sent: Tuesday, September 14, 2021 6:39 AM
> 
> The monitor pages in the CHANNELMSG_INITIATE_CONTACT msg are shared
> with host in Isolation VM and so it's necessary to use hvcall to set
> them visible to host. In Isolation VM with AMD SEV SNP, the access
> address should be in the extra space which is above shared gpa
> boundary. So remap these pages into the extra address(pa +
> shared_gpa_boundary).
> 
> Introduce monitor_pages_original[] in the struct vmbus_connection
> to store monitor page virtual address returned by hv_alloc_hyperv_
> zeroed_page() and free monitor page via monitor_pages_original in
> the vmbus_disconnect(). The monitor_pages[] is to used to access
> monitor page and it is initialized to be equal with monitor_pages_
> original. The monitor_pages[] will be overridden in the isolation VM
> with va of extra address. Introduce monitor_pages_pa[] to store
> monitor pages' physical address and use it to populate pa in the
> initiate msg.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v4:
>   * Introduce monitor_pages_pa[] to store monitor pages' physical
> address and use it to populate pa in the initiate msg.
>   * Move code of mapping moniter pages in extra address into
> vmbus_connect().
> 
> Change since v3:
>   * Rename monitor_pages_va with monitor_pages_original
>   * free monitor page via monitor_pages_original and
> monitor_pages is used to access monitor page.
> 
> Change since v1:
> * Not remap monitor pages in the non-SNP isolation VM.
> ---
>  drivers/hv/connection.c   | 90 ---
>  drivers/hv/hyperv_vmbus.h |  2 +
>  2 files changed, 86 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 8820ae68f20f..edd8f7dd169f 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -19,6 +19,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
> 
>  #include "hyperv_vmbus.h"
> @@ -102,8 +104,9 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
> *msginfo, u32 version)
>   vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID;
>   }
> 
> - msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
> - msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
> + msg->monitor_page1 = vmbus_connection.monitor_pages_pa[0];
> + msg->monitor_page2 = vmbus_connection.monitor_pages_pa[1];
> +
>   msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
> 
>   /*
> @@ -216,6 +219,65 @@ int vmbus_connect(void)
>   goto cleanup;
>   }
> 
> + vmbus_connection.monitor_pages_original[0]
> + = vmbus_connection.monitor_pages[0];
> + vmbus_connection.monitor_pages_original[1]
> + = vmbus_connection.monitor_pages[1];
> + vmbus_connection.monitor_pages_pa[0]
> + = virt_to_phys(vmbus_connection.monitor_pages[0]);
> + vmbus_connection.monitor_pages_pa[1]
> + = virt_to_phys(vmbus_connection.monitor_pages[1]);
> +
> + if (hv_is_isolation_supported()) {
> + vmbus_connection.monitor_pages_pa[0] +=
> + ms_hyperv.shared_gpa_boundary;
> + vmbus_connection.monitor_pages_pa[1] +=
> + ms_hyperv.shared_gpa_boundary;
> +
> + ret = set_memory_decrypted((unsigned long)
> +vmbus_connection.monitor_pages[0],
> +1);
> + ret |= set_memory_decrypted((unsigned long)
> + vmbus_connection.monitor_pages[1],
> + 1);
> + if (ret)
> + goto cleanup;
> +
> + /*
> +  * Isolation VM with AMD SNP needs to access monitor page via
> +  * address space above shared gpa boundary.
> +  */
> + if (hv_isolation_type_snp()) {
> + vmbus_connection.monitor_pages[0]
> + = memremap(vmbus_connection.monitor_pages_pa[0],
> +HV_HYP_PAGE_SIZE,
> +MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[0]) {
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> +
> + vmbus_connection.monitor_pages[1]
> + = memremap(vmbus_connection.monitor_pages_pa[1],
> +HV_HYP_PAGE_SIZE,
> +MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[1]) {
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> + }
> +
> + /*
> 

RE: [PATCH V5 05/12] x86/hyperv: Add Write/Read MSR registers via ghcb page

2021-09-15 Thread Michael Kelley
From: Tianyu Lan  Sent: Tuesday, September 14, 2021 6:39 AM
> 
> Hyperv provides GHCB protocol to write Synthetic Interrupt
> Controller MSR registers in Isolation VM with AMD SEV SNP
> and these registers are emulated by hypervisor directly.
> Hyperv requires to write SINTx MSR registers twice. First
> writes MSR via GHCB page to communicate with hypervisor
> and then writes wrmsr instruction to talk with paravisor
> which runs in VMPL0. Guest OS ID MSR also needs to be set
> via GHCB page.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v4:
>* Remove hv_get_simp(), hv_get_siefp()  hv_get_synint_*()
>  helper function. Move the logic into hv_get/set_register().
> 
> Change since v3:
>  * Pass old_msg_type to hv_signal_eom() as parameter.
>* Use HV_REGISTER_* marcro instead of HV_X64_MSR_*
>* Add hv_isolation_type_snp() weak function.
>* Add maros to set syinc register in ARM code.
> 
> Change since v1:
>  * Introduce sev_es_ghcb_hv_call_simple() and share code
>  between SEV and Hyper-V code.
> 
> Fix for hyperv: Add Write/Read MSR registers via ghcb page
> ---
>  arch/x86/hyperv/hv_init.c   |  36 +++
>  arch/x86/hyperv/ivm.c   | 103 
>  arch/x86/include/asm/mshyperv.h |  56 -
>  arch/x86/include/asm/sev.h  |   6 ++
>  arch/x86/kernel/sev-shared.c|  63 +++
>  drivers/hv/hv.c |  77 +++-
>  drivers/hv/hv_common.c  |   6 ++
>  include/asm-generic/mshyperv.h  |   2 +
>  8 files changed, 266 insertions(+), 83 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index d57df6825527..a16a83e46a30 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -37,7 +37,7 @@ EXPORT_SYMBOL_GPL(hv_current_partition_id);
>  void *hv_hypercall_pg;
>  EXPORT_SYMBOL_GPL(hv_hypercall_pg);
> 
> -void __percpu **hv_ghcb_pg;
> +union hv_ghcb __percpu **hv_ghcb_pg;
> 
>  /* Storage to save the hypercall page temporarily for hibernation */
>  static void *hv_hypercall_pg_saved;
> @@ -406,7 +406,7 @@ void __init hyperv_init(void)
>   }
> 
>   if (hv_isolation_type_snp()) {
> - hv_ghcb_pg = alloc_percpu(void *);
> + hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
>   if (!hv_ghcb_pg)
>   goto free_vp_assist_page;
>   }
> @@ -424,6 +424,9 @@ void __init hyperv_init(void)
>   guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0);
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);
> 
> + /* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
> +
>   hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
>   VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
>   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> @@ -501,6 +504,7 @@ void __init hyperv_init(void)
> 
>  clean_guest_os_id:
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
>   cpuhp_remove_state(cpuhp);
>  free_ghcb_page:
>   free_percpu(hv_ghcb_pg);
> @@ -522,6 +526,7 @@ void hyperv_cleanup(void)
> 
>   /* Reset our OS id */
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
> 
>   /*
>* Reset hypercall page reference before reset the page,
> @@ -592,30 +597,3 @@ bool hv_is_hyperv_initialized(void)
>   return hypercall_msr.enable;
>  }
>  EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
> -
> -enum hv_isolation_type hv_get_isolation_type(void)
> -{
> - if (!(ms_hyperv.priv_high & HV_ISOLATION))
> - return HV_ISOLATION_TYPE_NONE;
> - return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
> -}
> -EXPORT_SYMBOL_GPL(hv_get_isolation_type);
> -
> -bool hv_is_isolation_supported(void)
> -{
> - if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> - return false;
> -
> - if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> - return false;
> -
> - return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
> -}
> -
> -DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
> -
> -bool hv_isolation_type_snp(void)
> -{
> - return static_branch_unlikely(_type_snp);
> -}
> -EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 79e7fb83472a..5439723446c9 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -6,12 +6,115 @@
>   *  Tianyu Lan 
>   */
> 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
> +
> +union hv_ghcb {
> + struct ghcb ghcb;
> +} __packed __aligned(HV_HYP_PAGE_SIZE);
> +
> +void hv_ghcb_msr_write(u64 msr, u64 value)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long 

RE: [PATCH V5 04/12] Drivers: hv: vmbus: Mark vmbus ring buffer visible to host in Isolation VM

2021-09-15 Thread Michael Kelley
From: Tianyu Lan  Sent: Tuesday, September 14, 2021 6:39 AM
> 
> Mark vmbus ring buffer visible with set_memory_decrypted() when
> establish gpadl handle.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change sincv v4
>   * Change gpadl handle in netvsc and uio driver from u32 to
> struct vmbus_gpadl.
>   * Change vmbus_establish_gpadl()'s gpadl_handle parameter
> to vmbus_gpadl data structure.
> 
> Change since v3:
>   * Change vmbus_teardown_gpadl() parameter and put gpadl handle,
> buffer and buffer size in the struct vmbus_gpadl.
> ---
>  drivers/hv/channel.c| 54 -
>  drivers/net/hyperv/hyperv_net.h |  5 +--
>  drivers/net/hyperv/netvsc.c | 17 ++-
>  drivers/uio/uio_hv_generic.c| 20 ++--
>  include/linux/hyperv.h  | 12 ++--
>  5 files changed, 71 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index f3761c73b074..cf419eb1de77 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> 
> @@ -456,7 +457,7 @@ static int create_gpadl_header(enum hv_gpadl_type type, 
> void *kbuffer,
>  static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
>  enum hv_gpadl_type type, void *kbuffer,
>  u32 size, u32 send_offset,
> -u32 *gpadl_handle)
> +struct vmbus_gpadl *gpadl)
>  {
>   struct vmbus_channel_gpadl_header *gpadlmsg;
>   struct vmbus_channel_gpadl_body *gpadl_body;
> @@ -474,6 +475,15 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   if (ret)
>   return ret;
> 
> + ret = set_memory_decrypted((unsigned long)kbuffer,
> +HVPFN_UP(size));

This should be PFN_UP, not HVPFN_UP.  The numpages parameter to
set_memory_decrypted() is in guest size pages, not Hyper-V size pages.

> + if (ret) {
> + dev_warn(>device_obj->device,
> +  "Failed to set host visibility for new GPADL %d.\n",
> +  ret);
> + return ret;
> + }
> +
>   init_completion(>waitevent);
>   msginfo->waiting_channel = channel;
> 
> @@ -537,7 +547,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   }
> 
>   /* At this point, we received the gpadl created msg */
> - *gpadl_handle = gpadlmsg->gpadl;
> + gpadl->gpadl_handle = gpadlmsg->gpadl;
> + gpadl->buffer = kbuffer;
> + gpadl->size = size;
> +
> 
>  cleanup:
>   spin_lock_irqsave(_connection.channelmsg_lock, flags);
> @@ -549,6 +562,11 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   }
> 
>   kfree(msginfo);
> +
> + if (ret)
> + set_memory_encrypted((unsigned long)kbuffer,
> +  HVPFN_UP(size));

Should be PFN_UP as noted on the previous call to set_memory_decrypted().

> +
>   return ret;
>  }
> 
> @@ -561,10 +579,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   * @gpadl_handle: some funky thing
>   */
>  int vmbus_establish_gpadl(struct vmbus_channel *channel, void *kbuffer,
> -   u32 size, u32 *gpadl_handle)
> +   u32 size, struct vmbus_gpadl *gpadl)
>  {
>   return __vmbus_establish_gpadl(channel, HV_GPADL_BUFFER, kbuffer, size,
> -0U, gpadl_handle);
> +0U, gpadl);
>  }
>  EXPORT_SYMBOL_GPL(vmbus_establish_gpadl);
> 
> @@ -639,6 +657,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   struct vmbus_channel_open_channel *open_msg;
>   struct vmbus_channel_msginfo *open_info = NULL;
>   struct page *page = newchannel->ringbuffer_page;
> + struct vmbus_gpadl gpadl;

I think this local variable was needed in a previous version of the patch, but
is now unused and should be deleted.

>   u32 send_pages, recv_pages;
>   unsigned long flags;
>   int err;
> @@ -675,7 +694,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   goto error_clean_ring;
> 
>   /* Establish the gpadl for the ring buffer */
> - newchannel->ringbuffer_gpadlhandle = 0;
> + newchannel->ringbuffer_gpadlhandle.gpadl_handle = 0;
> 
>   err = __vmbus_establish_gpadl(newchannel, HV_GPADL_RING,
> page_address(newchannel->ringbuffer_page),
> @@ -701,7 +720,8 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   open_msg->header.msgtype = CHANNELMSG_OPENCHANNEL;
>   open_msg->openid = newchannel->offermsg.child_relid;
>   open_msg->child_relid = newchannel->offermsg.child_relid;
> - open_msg->ringbuffer_gpadlhandle = newchannel->ringbuffer_gpadlhandle;
> + 

RE: [PATCH V4 08/13] hyperv/vmbus: Initialize VMbus ring buffer for Isolation VM

2021-09-02 Thread Michael Kelley
From: Tianyu Lan  Sent: Thursday, September 2, 2021 6:36 AM
> 
> On 9/2/2021 8:23 AM, Michael Kelley wrote:
> >> +  } else {
> >> +  pages_wraparound = kcalloc(page_cnt * 2 - 1,
> >> + sizeof(struct page *),
> >> + GFP_KERNEL);
> >> +
> >> +  pages_wraparound[0] = pages;
> >> +  for (i = 0; i < 2 * (page_cnt - 1); i++)
> >> +  pages_wraparound[i + 1] =
> >> +  [i % (page_cnt - 1) + 1];
> >> +
> >> +  ring_info->ring_buffer = (struct hv_ring_buffer *)
> >> +  vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP,
> >> +  PAGE_KERNEL);
> >> +
> >> +  kfree(pages_wraparound);
> >> +  if (!ring_info->ring_buffer)
> >> +  return -ENOMEM;
> >> +  }
> > With this patch, the code is a big "if" statement with two halves -- one
> > when SNP isolation is in effect, and the other when not.  The SNP isolation
> > case does the work using PFNs with the shared_gpa_boundary added,
> > while the other case does the same work but using struct page.  Perhaps
> > I'm missing something, but can both halves be combined and always
> > do the work using PFNs?  The only difference is whether to add the
> > shared_gpa_boundary, and whether to zero the memory when done.
> > So get the starting PFN, then have an "if" statement for whether to
> > add the shared_gpa_boundary.  Then everything else is the same.
> > At the end, use an "if" statement to decide whether to zero the
> > memory.  It would really be better to have the logic in this algorithm
> > coded only once.
> >
> 
> Hi Michael:
>   I have tried this before. But vmap_pfn() only works for those pfns out
> of normal memory. Please see vmap_pfn_apply() for detail and
> return error when the PFN is valid.
> 

Indeed.  This ties into the discussion with Christoph about coming up
with generalized helper functions to assist in handling the
shared_gpa_boundary.   Having a single implementation here in
hv_ringbuffer_init() would be a good goal as well.

Michael



RE: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support

2021-09-02 Thread Michael Kelley
From: Christoph Hellwig  Sent: Thursday, September 2, 2021 1:00 AM
> 
> On Tue, Aug 31, 2021 at 05:16:19PM +, Michael Kelley wrote:
> > As a quick overview, I think there are four places where the
> > shared_gpa_boundary must be applied to adjust the guest physical
> > address that is used.  Each requires mapping a corresponding
> > virtual address range.  Here are the four places:
> >
> > 1)  The so-called "monitor pages" that are a core communication
> > mechanism between the guest and Hyper-V.  These are two single
> > pages, and the mapping is handled by calling memremap() for
> > each of the two pages.  See Patch 7 of Tianyu's series.
> 
> Ah, interesting.
> 
> > 3)  The network driver send and receive buffers.  vmap_phys_range()
> > should work here.
> 
> Actually it won't.  The problem with these buffers is that they are
> physically non-contiguous allocations.  

Indeed you are right.  These buffers are allocated with vzalloc().

> We really have two sensible options:
> 
>  1) use vmap_pfn as in the current series.  But in that case I think
> we should get rid of the other mapping created by vmalloc.  I
> though a bit about finding a way to apply the offset in vmalloc
> itself, but I think it would be too invasive to the normal fast
> path.  So the other sub-option would be to allocate the pages
> manually (maybe even using high order allocations to reduce TLB
> pressure) and then remap them

What's the benefit of getting rid of the other mapping created by
vmalloc if it isn't referenced?  Just page table space?  The default sizes
are a 16 Meg receive buffer and a 1 Meg send buffer for each VMbus
channel used by netvsc, and usually the max number of channels
is 8.  So there's 128 Meg of virtual space to be saved on the receive
buffers,  which could be worth it.

Allocating the pages manually is also an option, but we have to
be careful about high order allocations.  While typically these buffers
are allocated during system boot, these synthetic NICs can be hot
added and removed while the VM is running.   The channel count
can also be changed while the VM is running.  So multiple 16 Meg
receive buffer allocations may need to be done after the system has
been running a long time.

>  2) do away with the contiguous kernel mapping entirely.  This means
> the simple memcpy calls become loops over kmap_local_pfn.  As
> I just found out for the send side that would be pretty easy,
> but the receive side would be more work.  We'd also need to check
> the performance implications.

Doing away with the contiguous kernel mapping entirely seems like
it would result in fairly messy code to access the buffer.  What's the
benefit of doing away with the mapping?  I'm not an expert on the
netvsc driver, but decoding the incoming packets is already fraught
with complexities because of the nature of the protocol with Hyper-V.
The contiguous kernel mapping at least keeps the basics sane.

> 
> > 4) The swiotlb memory used for bounce buffers.  vmap_phys_range()
> > should work here as well.
> 
> Or memremap if it works for 1.
> 
> > Case #2 above does unusual mapping.  The ring buffer consists of a ring
> > buffer header page, followed by one or more pages that are the actual
> > ring buffer.  The pages making up the actual ring buffer are mapped
> > twice in succession.  For example, if the ring buffer has 4 pages
> > (one header page and three ring buffer pages), the contiguous
> > virtual mapping must cover these seven pages:  0, 1, 2, 3, 1, 2, 3.
> > The duplicate contiguous mapping allows the code that is reading
> > or writing the actual ring buffer to not be concerned about wrap-around
> > because writing off the end of the ring buffer is automatically
> > wrapped-around by the mapping.  The amount of data read or
> > written in one batch never exceeds the size of the ring buffer, and
> > after a batch is read or written, the read or write indices are adjusted
> > to put them back into the range of the first mapping of the actual
> > ring buffer pages.  So there's method to the madness, and the
> > technique works pretty well.  But this kind of mapping is not
> > amenable to using vmap_phys_range().
> 
> Hmm.  Can you point me to where this is mapped?  Especially for the
> classic non-isolated case where no vmap/vmalloc mapping is involved
> at all?

The existing code is in hv_ringbuffer_init() in drivers/hv/ring_buffer.c.
The code hasn't changed in a while, so any recent upstream code tree
is valid to look at.  The memory pages are typically allocated
in vmbus_alloc_ring() in drivers/hv/channel.c.

Michael



RE: [PATCH V4 12/13] hv_netvsc: Add Isolation VM support for netvsc driver

2021-09-01 Thread Michael Kelley
From: Michael Kelley  Sent: Wednesday, September 1, 
2021 7:34 PM

[snip]

> > +int netvsc_dma_map(struct hv_device *hv_dev,
> > +  struct hv_netvsc_packet *packet,
> > +  struct hv_page_buffer *pb)
> > +{
> > +   u32 page_count =  packet->cp_partial ?
> > +   packet->page_buf_cnt - packet->rmsg_pgcnt :
> > +   packet->page_buf_cnt;
> > +   dma_addr_t dma;
> > +   int i;
> > +
> > +   if (!hv_is_isolation_supported())
> > +   return 0;
> > +
> > +   packet->dma_range = kcalloc(page_count,
> > +   sizeof(*packet->dma_range),
> > +   GFP_KERNEL);
> > +   if (!packet->dma_range)
> > +   return -ENOMEM;
> > +
> > +   for (i = 0; i < page_count; i++) {
> > +   char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
> > ++ pb[i].offset);
> > +   u32 len = pb[i].len;
> > +
> > +   dma = dma_map_single(_dev->device, src, len,
> > +DMA_TO_DEVICE);
> > +   if (dma_mapping_error(_dev->device, dma)) {
> > +   kfree(packet->dma_range);
> > +   return -ENOMEM;
> > +   }
> > +
> > +   packet->dma_range[i].dma = dma;
> > +   packet->dma_range[i].mapping_size = len;
> > +   pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
> > +   pb[i].offset = offset_in_hvpage(dma);
> > +   pb[i].len = len;
> > +   }
> 
> Just to confirm, this driver does *not* set the DMA min_align_mask
> like storvsc does.  So after the call to dma_map_single(), the offset
> in the page could be different.  That's why you are updating
> the pb[i].offset value.  Alternatively, you could set the DMA
> min_align_mask, which would ensure the offset is unchanged.
> I'm OK with either approach, though perhaps a comment is
> warranted to explain, as this is a subtle issue.
> 

On second thought, I don't think either approach is OK.  The default
alignment in the swiotlb is 2K, and if the length of the data in the
buffer was 3K, the data could cross a page boundary in the bounce
buffer when it originally did not.  This would break the above code
which can only deal with one page at a time.  So I think the netvsc
driver also must set the DMA min_align_mask to 4K, which will
preserve the offset.

Michael



RE: [PATCH V4 05/13] hyperv: Add Write/Read MSR registers via ghcb page

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 
> Hyperv provides GHCB protocol to write Synthetic Interrupt
> Controller MSR registers in Isolation VM with AMD SEV SNP
> and these registers are emulated by hypervisor directly.
> Hyperv requires to write SINTx MSR registers twice. First
> writes MSR via GHCB page to communicate with hypervisor
> and then writes wrmsr instruction to talk with paravisor
> which runs in VMPL0. Guest OS ID MSR also needs to be set
> via GHCB page.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v1:
>  * Introduce sev_es_ghcb_hv_call_simple() and share code
>between SEV and Hyper-V code.
> Change since v3:
>  * Pass old_msg_type to hv_signal_eom() as parameter.
>* Use HV_REGISTER_* marcro instead of HV_X64_MSR_*
>* Add hv_isolation_type_snp() weak function.
>* Add maros to set syinc register in ARM code.
> ---
>  arch/arm64/include/asm/mshyperv.h |  23 ++
>  arch/x86/hyperv/hv_init.c |  36 ++
>  arch/x86/hyperv/ivm.c | 112 ++
>  arch/x86/include/asm/mshyperv.h   |  80 -
>  arch/x86/include/asm/sev.h|   3 +
>  arch/x86/kernel/sev-shared.c  |  63 ++---
>  drivers/hv/hv.c   | 112 --
>  drivers/hv/hv_common.c|   6 ++
>  include/asm-generic/mshyperv.h|   4 +-
>  9 files changed, 345 insertions(+), 94 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mshyperv.h 
> b/arch/arm64/include/asm/mshyperv.h
> index 20070a847304..ced83297e009 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -41,6 +41,29 @@ static inline u64 hv_get_register(unsigned int reg)
>   return hv_get_vpreg(reg);
>  }
> 
> +#define hv_get_simp(val) { val = hv_get_register(HV_REGISTER_SIMP); }
> +#define hv_set_simp(val) hv_set_register(HV_REGISTER_SIMP, val)
> +
> +#define hv_get_siefp(val){ val = hv_get_register(HV_REGISTER_SIEFP); }
> +#define hv_set_siefp(val)hv_set_register(HV_REGISTER_SIEFP, val)
> +
> +#define hv_get_synint_state(int_num, val) {  \
> + val = hv_get_register(HV_REGISTER_SINT0 + int_num); \
> + }
> +
> +#define hv_set_synint_state(int_num, val)\
> + hv_set_register(HV_REGISTER_SINT0 + int_num, val)
> +
> +#define hv_get_synic_state(val) {\
> + val = hv_get_register(HV_REGISTER_SCONTROL);\
> + }
> +
> +#define hv_set_synic_state(val)  \
> + hv_set_register(HV_REGISTER_SCONTROL, val)
> +
> +#define hv_signal_eom(old_msg_type)   \
> + hv_set_register(HV_REGISTER_EOM, 0)
> +
>  /* SMCCC hypercall parameters */
>  #define HV_SMCCC_FUNC_NUMBER 1
>  #define HV_FUNC_ID   ARM_SMCCC_CALL_VAL( \
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index b1aa42f60faa..be6210a3fd2f 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -37,7 +37,7 @@ EXPORT_SYMBOL_GPL(hv_current_partition_id);
>  void *hv_hypercall_pg;
>  EXPORT_SYMBOL_GPL(hv_hypercall_pg);
> 
> -void __percpu **hv_ghcb_pg;
> +union hv_ghcb __percpu **hv_ghcb_pg;
> 
>  /* Storage to save the hypercall page temporarily for hibernation */
>  static void *hv_hypercall_pg_saved;
> @@ -406,7 +406,7 @@ void __init hyperv_init(void)
>   }
> 
>   if (hv_isolation_type_snp()) {
> - hv_ghcb_pg = alloc_percpu(void *);
> + hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
>   if (!hv_ghcb_pg)
>   goto free_vp_assist_page;
>   }
> @@ -424,6 +424,9 @@ void __init hyperv_init(void)
>   guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0);
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);
> 
> + /* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
> +
>   hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
>   VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
>   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> @@ -501,6 +504,7 @@ void __init hyperv_init(void)
> 
>  clean_guest_os_id:
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
>   cpuhp_remove_state(cpuhp);
>  free_ghcb_page:
>   free_percpu(hv_ghcb_pg);
> @@ -522,6 +526,7 @@ void hyperv_cleanup(void)
> 
>   /* Reset our OS id */
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
> 
>   /*
>* Reset hypercall page reference before reset the page,
> @@ -592,30 +597,3 @@ bool hv_is_hyperv_initialized(void)
>   return hypercall_msr.enable;
>  }
>  EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
> -
> -enum hv_isolation_type hv_get_isolation_type(void)
> -{
> - if (!(ms_hyperv.priv_high & HV_ISOLATION))
> - return 

RE: [PATCH V4 12/13] hv_netvsc: Add Isolation VM support for netvsc driver

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma adress will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>   * Add comment to explain why not to use dma_map_sg()
>   * Fix some error handle.
> ---
>  arch/x86/hyperv/ivm.c |   1 +
>  drivers/net/hyperv/hyperv_net.h   |   5 ++
>  drivers/net/hyperv/netvsc.c   | 135 +-
>  drivers/net/hyperv/rndis_filter.c |   2 +
>  include/linux/hyperv.h|   5 ++
>  5 files changed, 145 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 84563b3c9f3a..08d8e01de017 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -317,6 +317,7 @@ void *hv_map_memory(void *addr, unsigned long size)
> 
>   return vaddr;
>  }
> +EXPORT_SYMBOL_GPL(hv_map_memory);
> 
>  void hv_unmap_memory(void *addr)
>  {
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index aa7c9962dbd8..862419912bfb 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>   u32 total_bytes;
>   u32 send_buf_index;
>   u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
>  };
> 
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,7 @@ struct netvsc_device {
> 
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + void *recv_original_buf;
>   u32 recv_buf_size; /* allocated bytes */
>   u32 recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
> @@ -1082,6 +1084,7 @@ struct netvsc_device {
> 
>   /* Send buffer allocated by us */
>   void *send_buf;
> + void *send_original_buf;
>   u32 send_buf_size;
>   u32 send_buf_gpadl_handle;
>   u32 send_section_cnt;
> @@ -1731,4 +1734,6 @@ struct rndis_message {
>  #define RETRY_US_HI  1
>  #define RETRY_MAX2000/* >10 sec */
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +   struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index f19b6a63..edd336b08c2c 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
>   int i;
> 
>   kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_original_buf) {
> + vunmap(nvdev->recv_buf);

In patch 11, you have added a hv_unmap_memory()
function as the inverse of hv_map_memory().  Since this
buffer was mapped with hv_map_memory() and you have
added that function, the cleanup should use
hv_unmap_memory() rather than calling vunmap() directly.

> + vfree(nvdev->recv_original_buf);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if (nvdev->send_original_buf) {
> + vunmap(nvdev->send_buf);

Same here.

> + vfree(nvdev->send_original_buf);
> + } else {
> + vfree(nvdev->send_buf);
> + }
> +
>   kfree(nvdev->send_section_map);
> 
>   for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
> @@ -347,6 +360,7 @@ static int netvsc_init_buf(struct hv_device *device,
>   unsigned int buf_size;
>   size_t map_words;
>   int i, ret = 0;
> + void *vaddr;
> 
>   /* Get receive buffer area. */
>   buf_size = device_info->recv_sections * device_info->recv_section_size;
> @@ -382,6 +396,17 @@ static int netvsc_init_buf(struct hv_device *device,
>   goto cleanup;
>   }
> 
> + if (hv_isolation_type_snp()) {
> + vaddr = hv_map_memory(net_device->recv_buf, buf_size);

Since the netvsc driver is architecture neutral, this code also needs
to compile for ARM64.  A stub will be needed for hv_map_memory()
on the ARM64 side.  Same for hv_unmap_memory() as suggested
above.  Or better, move hv_map_memory() and hv_unmap_memory()
to an architecture neutral module such as hv_common.c.

Or if Christop's approach of creating the vmap_phys_addr() helper
comes to fruition, that's an even better approach since it will already
handle multiple architectures.

> + if (!vaddr) {
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> +
> + net_device->recv_original_buf = net_device->recv_buf;
> + net_device->recv_buf = vaddr;
> + }
> +
>   /* Notify the 

RE: [PATCH V4 13/13] hv_storvsc: Add Isolation VM support for storvsc driver

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 

Per previous comment, the Subject line tag should be "scsi: storvsc: "

> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> mpb_desc() still needs to be handled. Use DMA API(dma_map_sg) to map
> these memory during sending/receiving packet and return swiotlb bounce
> buffer dma address. In Isolation VM, swiotlb  bounce buffer is marked
> to be visible to host and the swiotlb force mode is enabled.
> 
> Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to
> keep the original data offset in the bounce buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>   * Rplace dma_map_page with dma_map_sg()
>   * Use for_each_sg() to populate payload->range.pfn_array.
>   * Remove storvsc_dma_map macro
> ---
>  drivers/hv/vmbus_drv.c |  1 +
>  drivers/scsi/storvsc_drv.c | 41 +++---
>  include/linux/hyperv.h |  1 +
>  3 files changed, 18 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index f068e22a5636..270d526fd9de 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2124,6 +2124,7 @@ int vmbus_device_register(struct hv_device 
> *child_device_obj)
>   hv_debug_add_dev_dir(child_device_obj);
> 
>   child_device_obj->device.dma_mask = _dma_mask;
> + child_device_obj->device.dma_parms = _device_obj->dma_parms;
>   return 0;
> 
>  err_kset_unregister:
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 328bb961c281..4f1793be1fdc 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -21,6 +21,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -1312,6 +1314,9 @@ static void storvsc_on_channel_callback(void *context)
>   continue;
>   }
>   request = (struct storvsc_cmd_request 
> *)scsi_cmd_priv(scmnd);
> + if (scsi_sg_count(scmnd))
> + dma_unmap_sg(>device, 
> scsi_sglist(scmnd),
> +  scsi_sg_count(scmnd), 
> scmnd->sc_data_direction);

Use scsi_dma_unmap(), which does exactly what you have written
above. :-)

>   }
> 
>   storvsc_on_receive(stor_device, packet, request);
> @@ -1725,7 +1730,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
> struct scsi_cmnd *scmnd)
>   struct hv_host_device *host_dev = shost_priv(host);
>   struct hv_device *dev = host_dev->dev;
>   struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd);
> - int i;
>   struct scatterlist *sgl;
>   unsigned int sg_count;
>   struct vmscsi_request *vm_srb;
> @@ -1807,10 +1811,11 @@ static int storvsc_queuecommand(struct Scsi_Host 
> *host, struct scsi_cmnd *scmnd)
>   payload_sz = sizeof(cmd_request->mpb);
> 
>   if (sg_count) {
> - unsigned int hvpgoff, hvpfns_to_add;
>   unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
>   unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
> - u64 hvpfn;
> + struct scatterlist *sg;
> + unsigned long hvpfn, hvpfns_to_add;
> + int j, i = 0;
> 
>   if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
> 
> @@ -1824,31 +1829,16 @@ static int storvsc_queuecommand(struct Scsi_Host 
> *host, struct scsi_cmnd *scmnd)
>   payload->range.len = length;
>   payload->range.offset = offset_in_hvpg;
> 
> + if (dma_map_sg(>device, sgl, sg_count,
> + scmnd->sc_data_direction) == 0)
> + return SCSI_MLQUEUE_DEVICE_BUSY;
> 
> - for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
> - /*
> -  * Init values for the current sgl entry. hvpgoff
> -  * and hvpfns_to_add are in units of Hyper-V size
> -  * pages. Handling the PAGE_SIZE != HV_HYP_PAGE_SIZE
> -  * case also handles values of sgl->offset that are
> -  * larger than PAGE_SIZE. Such offsets are handled
> -  * even on other than the first sgl entry, provided
> -  * they are a multiple of PAGE_SIZE.
> -  */

Any reason not to keep this comment?  It's still correct and
mentions important cases that must be handled.

> - hvpgoff = HVPFN_DOWN(sgl->offset);
> - hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
> - hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) -
> - 

RE: [PATCH V4 11/13] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 
> hyperv Isolation VM requires bounce buffer support to copy
> data from/to encrypted memory and so enable swiotlb force
> mode to use swiotlb bounce buffer for DMA transaction.
> 
> In Isolation VM with AMD SEV, the bounce buffer needs to be
> accessed via extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
> 
> Swiotlb bounce buffer code calls dma_map_decrypted()
> to mark bounce buffer visible to host and map it in extra
> address space. Populate dma memory decrypted ops with hv
> map/unmap function.
> 
> Hyper-V initalizes swiotlb bounce buffer and default swiotlb
> needs to be disabled. pci_swiotlb_detect_override() and
> pci_swiotlb_detect_4gb() enable the default one. To override
> the setting, hyperv_swiotlb_detect() needs to run before
> these detect functions which depends on the pci_xen_swiotlb_
> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
> _detect() to keep the order.
> 
> The map function vmap_pfn() can't work in the early place
> hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce
> buffer in the hyperv_iommu_swiotlb_later_init().
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>* Get hyperv bounce bufffer size via default swiotlb
>bounce buffer size function and keep default size as
>same as the one in the AMD SEV VM.
> ---
>  arch/x86/hyperv/ivm.c   | 28 +++
>  arch/x86/include/asm/mshyperv.h |  2 ++
>  arch/x86/mm/mem_encrypt.c   |  3 +-
>  arch/x86/xen/pci-swiotlb-xen.c  |  3 +-
>  drivers/hv/vmbus_drv.c  |  3 ++
>  drivers/iommu/hyperv-iommu.c| 61 +
>  include/linux/hyperv.h  |  1 +
>  7 files changed, 99 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index e761c67e2218..84563b3c9f3a 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -294,3 +294,31 @@ int hv_set_mem_host_visibility(unsigned long addr, int 
> numpages, bool visible)
> 
>   return __hv_set_mem_host_visibility((void *)addr, numpages, visibility);
>  }
> +
> +/*
> + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
> + */
> +void *hv_map_memory(void *addr, unsigned long size)
> +{
> + unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,
> +   sizeof(unsigned long), GFP_KERNEL);

Should be PAGE_SIZE, not HV_HYP_PAGE_SIZE, since this code
only manipulates guest page tables.  There's no communication with
Hyper-V that requires HV_HYP_PAGE_SIZE.

> + void *vaddr;
> + int i;
> +
> + if (!pfns)
> + return NULL;
> +
> + for (i = 0; i < size / PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(addr + i * PAGE_SIZE) +

Use virt_to_pfn(), not virt_to_hvpfn(), for the same reason.

> + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO);
> + kfree(pfns);
> +
> + return vaddr;
> +}
> +
> +void hv_unmap_memory(void *addr)
> +{
> + vunmap(addr);
> +}
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index b77f4caee3ee..627fcf8d443c 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -252,6 +252,8 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct 
> hv_interrupt_entry *entry);
>  int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>  enum hv_mem_host_visibility visibility);
>  int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool 
> visible);
> +void *hv_map_memory(void *addr, unsigned long size);
> +void hv_unmap_memory(void *addr);
>  void hv_sint_wrmsrl_ghcb(u64 msr, u64 value);
>  void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);
>  void hv_signal_eom_ghcb(void);
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index ff08dc463634..e2db0b8ed938 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "mm_internal.h"
> 
> @@ -202,7 +203,7 @@ void __init sev_setup_arch(void)
>   phys_addr_t total_mem = memblock_phys_mem_size();
>   unsigned long size;
> 
> - if (!sev_active())
> + if (!sev_active() && !hv_is_isolation_supported())
>   return;
> 
>   /*
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 54f9aa7e8457..43bd031aa332 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ 

RE: [PATCH V4 08/13] hyperv/vmbus: Initialize VMbus ring buffer for Isolation VM

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 

Subject tag should be "Drivers: hv: vmbus: "

> VMbus ring buffer are shared with host and it's need to
> be accessed via extra address space of Isolation VM with
> AMD SNP support. This patch is to map the ring buffer
> address in extra address space via vmap_pfn(). Hyperv set
> memory host visibility hvcall smears data in the ring buffer
> and so reset the ring buffer memory to zero after mapping.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>   * Remove hv_ringbuffer_post_init(), merge map
>   operation for Isolation VM into hv_ringbuffer_init()
>   * Call hv_ringbuffer_init() after __vmbus_establish_gpadl().
> ---
>  drivers/hv/Kconfig   |  1 +
>  drivers/hv/channel.c | 19 +++---
>  drivers/hv/ring_buffer.c | 56 ++--
>  3 files changed, 54 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index d1123ceb38f3..dd12af20e467 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -8,6 +8,7 @@ config HYPERV
>   || (ARM64 && !CPU_BIG_ENDIAN))
>   select PARAVIRT
>   select X86_HV_CALLBACK_VECTOR if X86
> + select VMAP_PFN
>   help
> Select this option to run Linux as a Hyper-V client operating
> system.
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index 82650beb3af0..81f8629e4491 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -679,15 +679,6 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   if (!newchannel->max_pkt_size)
>   newchannel->max_pkt_size = VMBUS_DEFAULT_MAX_PKT_SIZE;
> 
> - err = hv_ringbuffer_init(>outbound, page, send_pages, 0);
> - if (err)
> - goto error_clean_ring;
> -
> - err = hv_ringbuffer_init(>inbound, [send_pages],
> -  recv_pages, newchannel->max_pkt_size);
> - if (err)
> - goto error_clean_ring;
> -
>   /* Establish the gpadl for the ring buffer */
>   newchannel->ringbuffer_gpadlhandle = 0;
> 
> @@ -699,6 +690,16 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   if (err)
>   goto error_clean_ring;
> 
> + err = hv_ringbuffer_init(>outbound,
> +  page, send_pages, 0);
> + if (err)
> + goto error_free_gpadl;
> +
> + err = hv_ringbuffer_init(>inbound, [send_pages],
> +  recv_pages, newchannel->max_pkt_size);
> + if (err)
> + goto error_free_gpadl;
> +
>   /* Create and init the channel open message */
>   open_info = kzalloc(sizeof(*open_info) +
>  sizeof(struct vmbus_channel_open_channel),
> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> index 2aee356840a2..24d64d18eb65 100644
> --- a/drivers/hv/ring_buffer.c
> +++ b/drivers/hv/ring_buffer.c
> @@ -17,6 +17,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "hyperv_vmbus.h"
> 
> @@ -183,8 +185,10 @@ void hv_ringbuffer_pre_init(struct vmbus_channel 
> *channel)
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  struct page *pages, u32 page_cnt, u32 max_pkt_size)
>  {
> - int i;
>   struct page **pages_wraparound;
> + unsigned long *pfns_wraparound;
> + u64 pfn;
> + int i;
> 
>   BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
> 
> @@ -192,23 +196,49 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
> *ring_info,
>* First page holds struct hv_ring_buffer, do wraparound mapping for
>* the rest.
>*/
> - pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),
> -GFP_KERNEL);
> - if (!pages_wraparound)
> - return -ENOMEM;
> + if (hv_isolation_type_snp()) {
> + pfn = page_to_pfn(pages) +
> + HVPFN_DOWN(ms_hyperv.shared_gpa_boundary);

Use PFN_DOWN, not HVPFN_DOWN.  This is all done in units of guest page
size, not Hyper-V page size.

> 
> - pages_wraparound[0] = pages;
> - for (i = 0; i < 2 * (page_cnt - 1); i++)
> - pages_wraparound[i + 1] = [i % (page_cnt - 1) + 1];
> + pfns_wraparound = kcalloc(page_cnt * 2 - 1,
> + sizeof(unsigned long), GFP_KERNEL);
> + if (!pfns_wraparound)
> + return -ENOMEM;
> 
> - ring_info->ring_buffer = (struct hv_ring_buffer *)
> - vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL);
> + pfns_wraparound[0] = pfn;
> + for (i = 0; i < 2 * (page_cnt - 1); i++)
> + pfns_wraparound[i + 1] = pfn + i % (page_cnt - 1) + 1;
> 
> - kfree(pages_wraparound);
> + ring_info->ring_buffer = (struct hv_ring_buffer *)
> + vmap_pfn(pfns_wraparound, page_cnt * 2 - 1,
> +  

RE: [PATCH V4 07/13] hyperv/Vmbus: Add SNP support for VMbus channel initiate message

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 

Subject line tag should be "Drivers: hv: vmbus:"

> The monitor pages in the CHANNELMSG_INITIATE_CONTACT msg are shared
> with host in Isolation VM and so it's necessary to use hvcall to set
> them visible to host. In Isolation VM with AMD SEV SNP, the access
> address should be in the extra space which is above shared gpa
> boundary. So remap these pages into the extra address(pa +
> shared_gpa_boundary).
> 
> Introduce monitor_pages_original[] in the struct vmbus_connection
> to store monitor page virtual address returned by hv_alloc_hyperv_
> zeroed_page() and free monitor page via monitor_pages_original in
> the vmbus_disconnect(). The monitor_pages[] is to used to access
> monitor page and it is initialized to be equal with monitor_pages_
> original. The monitor_pages[] will be overridden in the isolation VM
> with va of extra address.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>   * Rename monitor_pages_va with monitor_pages_original
>   * free monitor page via monitor_pages_original and
> monitor_pages is used to access monitor page.
> 
> Change since v1:
> * Not remap monitor pages in the non-SNP isolation VM.
> ---
>  drivers/hv/connection.c   | 75 ---
>  drivers/hv/hyperv_vmbus.h |  1 +
>  2 files changed, 72 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 6d315c1465e0..9a48d8115c87 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include "hyperv_vmbus.h"
> @@ -104,6 +105,12 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
> *msginfo, u32 version)
> 
>   msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
>   msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
> +
> + if (hv_isolation_type_snp()) {
> + msg->monitor_page1 += ms_hyperv.shared_gpa_boundary;
> + msg->monitor_page2 += ms_hyperv.shared_gpa_boundary;
> + }
> +
>   msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
> 
>   /*
> @@ -148,6 +155,35 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
> *msginfo, u32 version)
>   return -ECONNREFUSED;
>   }
> 
> +
> + if (hv_is_isolation_supported()) {
> + if (hv_isolation_type_snp()) {
> + vmbus_connection.monitor_pages[0]
> + = memremap(msg->monitor_page1, HV_HYP_PAGE_SIZE,
> +MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[0])
> + return -ENOMEM;
> +
> + vmbus_connection.monitor_pages[1]
> + = memremap(msg->monitor_page2, HV_HYP_PAGE_SIZE,
> +MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[1]) {
> + memunmap(vmbus_connection.monitor_pages[0]);
> + return -ENOMEM;
> + }
> + }
> +
> + /*
> +  * Set memory host visibility hvcall smears memory
> +  * and so zero monitor pages here.
> +  */
> + memset(vmbus_connection.monitor_pages[0], 0x00,
> +HV_HYP_PAGE_SIZE);
> + memset(vmbus_connection.monitor_pages[1], 0x00,
> +HV_HYP_PAGE_SIZE);
> +
> + }

I still find it somewhat confusing to have the handling of the
shared_gpa_boundary and memory mapping in the function for
negotiating the VMbus version.  I think the code works as written,
but it would seem cleaner and easier to understand to precompute
the physical addresses and do all the mapping and memory zero'ing
in a single place in vmbus_connect().  Then the negotiate version
function can focus on doing only the version negotiation.

> +
>   return ret;
>  }
> 
> @@ -159,6 +195,7 @@ int vmbus_connect(void)
>   struct vmbus_channel_msginfo *msginfo = NULL;
>   int i, ret = 0;
>   __u32 version;
> + u64 pfn[2];
> 
>   /* Initialize the vmbus connection */
>   vmbus_connection.conn_state = CONNECTING;
> @@ -216,6 +253,21 @@ int vmbus_connect(void)
>   goto cleanup;
>   }
> 
> + vmbus_connection.monitor_pages_original[0]
> + = vmbus_connection.monitor_pages[0];
> + vmbus_connection.monitor_pages_original[1]
> + = vmbus_connection.monitor_pages[1];
> +
> + if (hv_is_isolation_supported()) {
> + pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
> + pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
> + if (hv_mark_gpa_visibility(2, pfn,
> + VMBUS_PAGE_VISIBLE_READ_WRITE)) {

In Patch 4 of 

RE: [PATCH V4 06/13] hyperv: Add ghcb hvcall support for SNP VM

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 

Subject line tag should probably be "x86/hyperv:" since the majority
of the code added is under arch/x86.

> hyperv provides ghcb hvcall to handle VMBus
> HVCALL_SIGNAL_EVENT and HVCALL_POST_MESSAGE
> msg in SNP Isolation VM. Add such support.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>   * Add hv_ghcb_hypercall() stub function to avoid
> compile error for ARM.
> ---
>  arch/x86/hyperv/ivm.c  | 71 ++
>  drivers/hv/connection.c|  6 ++-
>  drivers/hv/hv.c|  8 +++-
>  drivers/hv/hv_common.c |  6 +++
>  include/asm-generic/mshyperv.h |  1 +
>  5 files changed, 90 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index f56fe4f73000..e761c67e2218 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -17,10 +17,81 @@
>  #include 
>  #include 
> 
> +#define GHCB_USAGE_HYPERV_CALL   1
> +
>  union hv_ghcb {
>   struct ghcb ghcb;
> + struct {
> + u64 hypercalldata[509];
> + u64 outputgpa;
> + union {
> + union {
> + struct {
> + u32 callcode: 16;
> + u32 isfast  : 1;
> + u32 reserved1   : 14;
> + u32 isnested: 1;
> + u32 countofelements : 12;
> + u32 reserved2   : 4;
> + u32 repstartindex   : 12;
> + u32 reserved3   : 4;
> + };
> + u64 asuint64;
> + } hypercallinput;
> + union {
> + struct {
> + u16 callstatus;
> + u16 reserved1;
> + u32 elementsprocessed : 12;
> + u32 reserved2 : 20;
> + };
> + u64 asunit64;
> + } hypercalloutput;
> + };
> + u64 reserved2;
> + } hypercall;
>  } __packed __aligned(HV_HYP_PAGE_SIZE);
> 
> +u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + if (!hv_ghcb_pg)
> + return -EFAULT;
> +
> + WARN_ON(in_nmi());
> +
> + local_irq_save(flags);
> + ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
> + hv_ghcb = (union hv_ghcb *)*ghcb_base;
> + if (!hv_ghcb) {
> + local_irq_restore(flags);
> + return -EFAULT;
> + }
> +
> + hv_ghcb->ghcb.protocol_version = GHCB_PROTOCOL_MAX;
> + hv_ghcb->ghcb.ghcb_usage = GHCB_USAGE_HYPERV_CALL;
> +
> + hv_ghcb->hypercall.outputgpa = (u64)output;
> + hv_ghcb->hypercall.hypercallinput.asuint64 = 0;
> + hv_ghcb->hypercall.hypercallinput.callcode = control;
> +
> + if (input_size)
> + memcpy(hv_ghcb->hypercall.hypercalldata, input, input_size);
> +
> + VMGEXIT();
> +
> + hv_ghcb->ghcb.ghcb_usage = 0x;
> + memset(hv_ghcb->ghcb.save.valid_bitmap, 0,
> +sizeof(hv_ghcb->ghcb.save.valid_bitmap));
> +
> + local_irq_restore(flags);
> +
> + return hv_ghcb->hypercall.hypercalloutput.callstatus;

The hypercall.hypercalloutput.callstatus value must be saved
in a local variable *before* the call to local_irq_restore().  Then
the local variable is the return value.  Once local_irq_restore()
is called, the GHCB page could get reused.

> +}
> +
>  void hv_ghcb_msr_write(u64 msr, u64 value)
>  {
>   union hv_ghcb *hv_ghcb;
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 5e479d54918c..6d315c1465e0 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -447,6 +447,10 @@ void vmbus_set_event(struct vmbus_channel *channel)
> 
>   ++channel->sig_events;
> 
> - hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
> + if (hv_isolation_type_snp())
> + hv_ghcb_hypercall(HVCALL_SIGNAL_EVENT, >sig_event,
> + NULL, sizeof(u64));

Better to use "sizeof(channel->sig_event)" instead of explicitly coding
the type.

> + else
> + hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
>  }
>  EXPORT_SYMBOL_GPL(vmbus_set_event);
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 97b21256a9db..d4531c64d9d3 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -98,7 +98,13 @@ int hv_post_message(union hv_connection_id connection_id,
>   aligned_msg->payload_size = payload_size;
>   

RE: [PATCH V4 04/13] hyperv: Mark vmbus ring buffer visible to host in Isolation VM

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 
> Mark vmbus ring buffer visible with set_memory_decrypted() when
> establish gpadl handle.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>* Change vmbus_teardown_gpadl() parameter and put gpadl handle,
>buffer and buffer size in the struct vmbus_gpadl.
> ---
>  drivers/hv/channel.c| 36 -
>  drivers/net/hyperv/hyperv_net.h |  1 +
>  drivers/net/hyperv/netvsc.c | 16 +++
>  drivers/uio/uio_hv_generic.c| 14 +++--
>  include/linux/hyperv.h  |  8 +++-
>  5 files changed, 63 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index f3761c73b074..82650beb3af0 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> 
> @@ -474,6 +475,13 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   if (ret)
>   return ret;
> 
> + ret = set_memory_decrypted((unsigned long)kbuffer,
> +HVPFN_UP(size));
> + if (ret) {
> + pr_warn("Failed to set host visibility for new GPADL %d.\n", 
> ret);
> + return ret;
> + }
> +
>   init_completion(>waitevent);
>   msginfo->waiting_channel = channel;
> 
> @@ -549,6 +557,11 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   }
> 
>   kfree(msginfo);
> +
> + if (ret)
> + set_memory_encrypted((unsigned long)kbuffer,
> +  HVPFN_UP(size));
> +
>   return ret;
>  }
> 
> @@ -639,6 +652,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   struct vmbus_channel_open_channel *open_msg;
>   struct vmbus_channel_msginfo *open_info = NULL;
>   struct page *page = newchannel->ringbuffer_page;
> + struct vmbus_gpadl gpadl;
>   u32 send_pages, recv_pages;
>   unsigned long flags;
>   int err;
> @@ -759,7 +773,10 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>  error_free_info:
>   kfree(open_info);
>  error_free_gpadl:
> - vmbus_teardown_gpadl(newchannel, newchannel->ringbuffer_gpadlhandle);
> + gpadl.gpadl_handle = newchannel->ringbuffer_gpadlhandle;
> + gpadl.buffer = page_address(newchannel->ringbuffer_page);
> + gpadl.size = (send_pages + recv_pages) << PAGE_SHIFT;
> + vmbus_teardown_gpadl(newchannel, );
>   newchannel->ringbuffer_gpadlhandle = 0;
>  error_clean_ring:
>   hv_ringbuffer_cleanup(>outbound);
> @@ -806,7 +823,7 @@ EXPORT_SYMBOL_GPL(vmbus_open);
>  /*
>   * vmbus_teardown_gpadl -Teardown the specified GPADL handle
>   */
> -int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
> +int vmbus_teardown_gpadl(struct vmbus_channel *channel, struct vmbus_gpadl 
> *gpadl)
>  {
>   struct vmbus_channel_gpadl_teardown *msg;
>   struct vmbus_channel_msginfo *info;
> @@ -825,7 +842,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
> u32 gpadl_handle)
> 
>   msg->header.msgtype = CHANNELMSG_GPADL_TEARDOWN;
>   msg->child_relid = channel->offermsg.child_relid;
> - msg->gpadl = gpadl_handle;
> + msg->gpadl = gpadl->gpadl_handle;
> 
>   spin_lock_irqsave(_connection.channelmsg_lock, flags);
>   list_add_tail(>msglistentry,
> @@ -859,6 +876,12 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
> u32 gpadl_handle)
>   spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
> 
>   kfree(info);
> +
> + ret = set_memory_encrypted((unsigned long)gpadl->buffer,
> +HVPFN_UP(gpadl->size));
> + if (ret)
> + pr_warn("Fail to set mem host visibility in GPADL teardown 
> %d.\n", ret);
> +
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(vmbus_teardown_gpadl);
> @@ -896,6 +919,7 @@ void vmbus_reset_channel_cb(struct vmbus_channel *channel)
>  static int vmbus_close_internal(struct vmbus_channel *channel)
>  {
>   struct vmbus_channel_close_channel *msg;
> + struct vmbus_gpadl gpadl;
>   int ret;
> 
>   vmbus_reset_channel_cb(channel);
> @@ -934,8 +958,10 @@ static int vmbus_close_internal(struct vmbus_channel 
> *channel)
> 
>   /* Tear down the gpadl for the channel's ring buffer */
>   else if (channel->ringbuffer_gpadlhandle) {
> - ret = vmbus_teardown_gpadl(channel,
> -channel->ringbuffer_gpadlhandle);
> + gpadl.gpadl_handle = channel->ringbuffer_gpadlhandle;
> + gpadl.buffer = page_address(channel->ringbuffer_page);
> + gpadl.size = channel->ringbuffer_pagecount;
> + ret = vmbus_teardown_gpadl(channel, );
>   if (ret) {
>   pr_err("Close failed: teardown gpadl return %d\n", ret);
>   /*
> diff --git 

RE: [PATCH V4 03/13] x86/hyperv: Add new hvcall guest address host visibility support

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 
> Add new hvcall guest address host visibility support to mark
> memory visible to host. Call it inside set_memory_decrypted
> /encrypted(). Add HYPERVISOR feature check in the
> hv_is_isolation_supported() to optimize in non-virtualization
> environment.
> 
> Acked-by: Dave Hansen 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>   * Fix error code handle in the __hv_set_mem_host_visibility().
>   * Move HvCallModifySparseGpaPageHostVisibility near to enum
> hv_mem_host_visibility.
> 
> Change since v2:
>* Rework __set_memory_enc_dec() and call Hyper-V and AMD function
>  according to platform check.
> 
> Change since v1:
>* Use new staic call x86_set_memory_enc to avoid add Hyper-V
>  specific check in the set_memory code.
> ---
>  arch/x86/hyperv/Makefile   |   2 +-
>  arch/x86/hyperv/hv_init.c  |   6 ++
>  arch/x86/hyperv/ivm.c  | 113 +
>  arch/x86/include/asm/hyperv-tlfs.h |  17 +
>  arch/x86/include/asm/mshyperv.h|   4 +-
>  arch/x86/mm/pat/set_memory.c   |  19 +++--
>  include/asm-generic/hyperv-tlfs.h  |   1 +
>  include/asm-generic/mshyperv.h |   1 +
>  8 files changed, 156 insertions(+), 7 deletions(-)
>  create mode 100644 arch/x86/hyperv/ivm.c
> 
> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
> index 48e2c51464e8..5d2de10809ae 100644
> --- a/arch/x86/hyperv/Makefile
> +++ b/arch/x86/hyperv/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -obj-y:= hv_init.o mmu.o nested.o irqdomain.o
> +obj-y:= hv_init.o mmu.o nested.o irqdomain.o ivm.o
>  obj-$(CONFIG_X86_64) += hv_apic.o hv_proc.o
> 
>  ifdef CONFIG_X86_64
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index eba10ed4f73e..b1aa42f60faa 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -603,6 +603,12 @@ EXPORT_SYMBOL_GPL(hv_get_isolation_type);
> 
>  bool hv_is_isolation_supported(void)
>  {
> + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> + return 0;

Use "return false" per previous comment from Wei Liu.

> +
> + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> + return 0;

Use "return false".

> +
>   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
>  }
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> new file mode 100644
> index ..a069c788ce3c
> --- /dev/null
> +++ b/arch/x86/hyperv/ivm.c
> @@ -0,0 +1,113 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Hyper-V Isolation VM interface with paravisor and hypervisor
> + *
> + * Author:
> + *  Tianyu Lan 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
> + *
> + * In Isolation VM, all guest memory is encripted from host and guest

s/encripted/encrypted/

> + * needs to set memory visible to host via hvcall before sharing memory
> + * with host.
> + */
> +int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
> +enum hv_mem_host_visibility visibility)
> +{
> + struct hv_gpa_range_for_visibility **input_pcpu, *input;
> + u16 pages_processed;
> + u64 hv_status;
> + unsigned long flags;
> +
> + /* no-op if partition isolation is not enabled */
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + if (count > HV_MAX_MODIFY_GPA_REP_COUNT) {
> + pr_err("Hyper-V: GPA count:%d exceeds supported:%lu\n", count,
> + HV_MAX_MODIFY_GPA_REP_COUNT);
> + return -EINVAL;
> + }
> +
> + local_irq_save(flags);
> + input_pcpu = (struct hv_gpa_range_for_visibility **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> + input = *input_pcpu;
> + if (unlikely(!input)) {
> + local_irq_restore(flags);
> + return -EINVAL;
> + }
> +
> + input->partition_id = HV_PARTITION_ID_SELF;
> + input->host_visibility = visibility;
> + input->reserved0 = 0;
> + input->reserved1 = 0;
> + memcpy((void *)input->gpa_page_list, pfn, count * sizeof(*pfn));
> + hv_status = hv_do_rep_hypercall(
> + HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
> + 0, input, _processed);
> + local_irq_restore(flags);
> +
> + if (hv_result_success(hv_status))
> + return 0;
> + else
> + return -EFAULT;
> +}
> +EXPORT_SYMBOL(hv_mark_gpa_visibility);

In later comments on Patch 7 of this series, I have suggested that
code in that patch should not call hv_mark_gpa_visibility() directly,
but instead should call set_memory_encrypted() and
set_memory_decrypted().  I'm thinking that those functions should
be the standard way to change the visibility of pages in the 

RE: [PATCH V4 02/13] x86/hyperv: Initialize shared memory boundary in the Isolation VM.

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 
> Hyper-V exposes shared memory boundary via cpuid
> HYPERV_CPUID_ISOLATION_CONFIG and store it in the
> shared_gpa_boundary of ms_hyperv struct. This prepares
> to share memory with host for SNP guest.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v3:
>   * user BIT_ULL to get shared_gpa_boundary
>   * Rename field Reserved* to reserved
> ---
>  arch/x86/kernel/cpu/mshyperv.c |  2 ++
>  include/asm-generic/mshyperv.h | 12 +++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 20557a9d6e25..8bb001198316 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -313,6 +313,8 @@ static void __init ms_hyperv_init_platform(void)
>   if (ms_hyperv.priv_high & HV_ISOLATION) {
>   ms_hyperv.isolation_config_a = 
> cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
>   ms_hyperv.isolation_config_b = 
> cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
> + ms_hyperv.shared_gpa_boundary =
> + BIT_ULL(ms_hyperv.shared_gpa_boundary_bits);
> 
>   pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 
> 0x%x\n",
>   ms_hyperv.isolation_config_a, 
> ms_hyperv.isolation_config_b);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 0924bbd8458e..7537ae1db828 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -35,7 +35,17 @@ struct ms_hyperv_info {
>   u32 max_vp_index;
>   u32 max_lp_index;
>   u32 isolation_config_a;
> - u32 isolation_config_b;
> + union {
> + u32 isolation_config_b;
> + struct {
> + u32 cvm_type : 4;
> + u32 reserved11 : 1;
> + u32 shared_gpa_boundary_active : 1;
> + u32 shared_gpa_boundary_bits : 6;
> + u32 reserved12 : 20;

I'm still curious about the "11" and "12" in the reserved
field names.  Why not just "reserved1" and "reserved2"?
Having the "11" and "12" isn't wrong, but it makes one
wonder why since it's not usual. :-)

> + };
> + };
> + u64 shared_gpa_boundary;
>  };
>  extern struct ms_hyperv_info ms_hyperv;
> 
> --
> 2.25.1




RE: [PATCH V4 01/13] x86/hyperv: Initialize GHCB page in Isolation VM

2021-09-01 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 27, 2021 10:21 AM
> 
> Hyperv exposes GHCB page via SEV ES GHCB MSR for SNP guest
> to communicate with hypervisor. Map GHCB page for all
> cpus to read/write MSR register and submit hvcall request
> via ghcb page.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Chagne since v3:
> * Rename ghcb_base to hv_ghcb_pg and move it out of
> struct ms_hyperv_info.
>   * Allocate hv_ghcb_pg before cpuhp_setup_state() and leverage
> hv_cpu_init() to initialize ghcb page.
> ---
>  arch/x86/hyperv/hv_init.c   | 68 +
>  arch/x86/include/asm/mshyperv.h |  4 ++
>  arch/x86/kernel/cpu/mshyperv.c  |  3 ++
>  include/asm-generic/mshyperv.h  |  1 +
>  4 files changed, 69 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 708a2712a516..eba10ed4f73e 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -36,12 +37,42 @@ EXPORT_SYMBOL_GPL(hv_current_partition_id);
>  void *hv_hypercall_pg;
>  EXPORT_SYMBOL_GPL(hv_hypercall_pg);
> 
> +void __percpu **hv_ghcb_pg;
> +
>  /* Storage to save the hypercall page temporarily for hibernation */
>  static void *hv_hypercall_pg_saved;
> 
>  struct hv_vp_assist_page **hv_vp_assist_page;
>  EXPORT_SYMBOL_GPL(hv_vp_assist_page);
> 
> +static int hyperv_init_ghcb(void)
> +{
> + u64 ghcb_gpa;
> + void *ghcb_va;
> + void **ghcb_base;
> +
> + if (!hv_isolation_type_snp())
> + return 0;
> +
> + if (!hv_ghcb_pg)
> + return -EINVAL;
> +
> + /*
> +  * GHCB page is allocated by paravisor. The address
> +  * returned by MSR_AMD64_SEV_ES_GHCB is above shared
> +  * ghcb boundary and map it here.

I'm not sure what the "shared ghcb boundary" is.  Did you
mean "shared_gpa_boundary"?

> +  */
> + rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
> + ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> + if (!ghcb_va)
> + return -ENOMEM;
> +
> + ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
> + *ghcb_base = ghcb_va;
> +
> + return 0;
> +}
> +
>  static int hv_cpu_init(unsigned int cpu)
>  {
>   union hv_vp_assist_msr_contents msr = { 0 };
> @@ -85,7 +116,7 @@ static int hv_cpu_init(unsigned int cpu)
>   }
>   }
> 
> - return 0;
> + return hyperv_init_ghcb();
>  }
> 
>  static void (*hv_reenlightenment_cb)(void);
> @@ -177,6 +208,14 @@ static int hv_cpu_die(unsigned int cpu)
>  {
>   struct hv_reenlightenment_control re_ctrl;
>   unsigned int new_cpu;
> + void **ghcb_va;
> +
> + if (hv_ghcb_pg) {
> + ghcb_va = (void **)this_cpu_ptr(hv_ghcb_pg);
> + if (*ghcb_va)
> + memunmap(*ghcb_va);
> + *ghcb_va = NULL;
> + }
> 
>   hv_common_cpu_die(cpu);
> 
> @@ -366,10 +405,16 @@ void __init hyperv_init(void)
>   goto common_free;
>   }
> 
> + if (hv_isolation_type_snp()) {
> + hv_ghcb_pg = alloc_percpu(void *);
> + if (!hv_ghcb_pg)
> + goto free_vp_assist_page;
> + }
> +
>   cpuhp = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
> hv_cpu_init, hv_cpu_die);
>   if (cpuhp < 0)
> - goto free_vp_assist_page;
> + goto free_ghcb_page;
> 
>   /*
>* Setup the hypercall page and enable hypercalls.
> @@ -383,10 +428,8 @@ void __init hyperv_init(void)
>   VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
>   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
>   __builtin_return_address(0));
> - if (hv_hypercall_pg == NULL) {
> - wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> - goto remove_cpuhp_state;
> - }
> + if (hv_hypercall_pg == NULL)
> + goto clean_guest_os_id;
> 
>   rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
>   hypercall_msr.enable = 1;
> @@ -456,8 +499,11 @@ void __init hyperv_init(void)
>   hv_query_ext_cap(0);
>   return;
> 
> -remove_cpuhp_state:
> +clean_guest_os_id:
> + wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
>   cpuhp_remove_state(cpuhp);
> +free_ghcb_page:
> + free_percpu(hv_ghcb_pg);
>  free_vp_assist_page:
>   kfree(hv_vp_assist_page);
>   hv_vp_assist_page = NULL;
> @@ -559,3 +605,11 @@ bool hv_is_isolation_supported(void)
>  {
>   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
>  }
> +
> +DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
> +
> +bool hv_isolation_type_snp(void)
> +{
> + return static_branch_unlikely(_type_snp);
> +}
> +EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index adccbc209169..37739a277ac6 100644
> --- 

RE: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support

2021-08-31 Thread Michael Kelley
From: Christoph Hellwig  Sent: Monday, August 30, 2021 5:01 AM
> 
> Sorry for the delayed answer, but I look at the vmap_pfn usage in the
> previous version and tried to come up with a better version.  This
> mostly untested branch:
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hyperv-vmap
> 
> get us there for swiotlb and the channel infrastructure  I've started
> looking at the network driver and didn't get anywhere due to other work.
> 
> As far as I can tell the network driver does gigantic multi-megabyte
> vmalloc allocation for the send and receive buffers, which are then
> passed to the hardware, but always copied to/from when interacting
> with the networking stack.  Did I see that right?  Are these big
> buffers actually required unlike the normal buffer management schemes
> in other Linux network drivers?
> 
> If so I suspect the best way to allocate them is by not using vmalloc
> but just discontiguous pages, and then use kmap_local_pfn where the
> PFN includes the share_gpa offset when actually copying from/to the
> skbs.

As a quick overview, I think there are four places where the
shared_gpa_boundary must be applied to adjust the guest physical
address that is used.  Each requires mapping a corresponding
virtual address range.  Here are the four places:

1)  The so-called "monitor pages" that are a core communication
mechanism between the guest and Hyper-V.  These are two single
pages, and the mapping is handled by calling memremap() for
each of the two pages.  See Patch 7 of Tianyu's series.

2)  The VMbus channel ring buffers.  You have proposed using
your new  vmap_phys_range() helper, but I don't think that works
here.  More details below.

3)  The network driver send and receive buffers.  vmap_phys_range()
should work here.

4) The swiotlb memory used for bounce buffers.  vmap_phys_range()
should work here as well.

Case #2 above does unusual mapping.  The ring buffer consists of a ring
buffer header page, followed by one or more pages that are the actual
ring buffer.  The pages making up the actual ring buffer are mapped
twice in succession.  For example, if the ring buffer has 4 pages
(one header page and three ring buffer pages), the contiguous
virtual mapping must cover these seven pages:  0, 1, 2, 3, 1, 2, 3.
The duplicate contiguous mapping allows the code that is reading
or writing the actual ring buffer to not be concerned about wrap-around
because writing off the end of the ring buffer is automatically
wrapped-around by the mapping.  The amount of data read or
written in one batch never exceeds the size of the ring buffer, and
after a batch is read or written, the read or write indices are adjusted
to put them back into the range of the first mapping of the actual
ring buffer pages.  So there's method to the madness, and the
technique works pretty well.  But this kind of mapping is not
amenable to using vmap_phys_range().

Michael





RE: [PATCH V3 13/13] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-08-20 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 20, 2021 11:04 AM
> 
> On 8/21/2021 12:08 AM, Michael Kelley wrote:
> >>>>  }
> >>> The whole approach here is to do dma remapping on each individual page
> >>> of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map
> >>> each scatterlist entry as a unit?  Each scatterlist entry describes a 
> >>> range of
> >>> physically contiguous memory.  After dma_map_sg(), the resulting dma
> >>> address must also refer to a physically contiguous range in the swiotlb
> >>> bounce buffer memory.   So at the top of the "for" loop over the 
> >>> scatterlist
> >>> entries, do dma_map_sg() if we're in an isolated VM.  Then compute the
> >>> hvpfn value based on the dma address instead of sg_page().  But everything
> >>> else is the same, and the inner loop for populating the pfn_arry is 
> >>> unmodified.
> >>> Furthermore, the dma_range array that you've added is not needed, since
> >>> scatterlist entries already have a dma_address field for saving the mapped
> >>> address, and dma_unmap_sg() uses that field.
> >> I don't use dma_map_sg() here in order to avoid introducing one more
> >> loop(e,g dma_map_sg()). We already have a loop to populate
> >> cmd_request->dma_range[] and so do the dma map in the same loop.
> >>
> > I'm not seeing where the additional loop comes from.  Storvsc
> > already has a loop through the sgl entries.  Retain that loop and call
> > dma_map_sg() with nents set to 1.  Then the sequence is
> > dma_map_sg() --> dma_map_sg_attrs() --> dma_direct_map_sg() ->
> > dma_direct_map_page().  The latter function will call swiotlb_map()
> > to map all pages of the sgl entry as a single operation.
> 
> After dma_map_sg(), we still need to go through scatter list again to
> populate payload->rrange.pfn_array. We may just go through the scatter
> list just once if dma_map_sg() accepts a callback and run it during go
> through scatter list.

Here's some code for what I'm suggesting (not even compile tested).
The only change is what's in the "if" clause of the SNP test.  dma_map_sg()
is called with the nents parameter set to one so that it only
processes one sgl entry each time it is called, and doesn't walk the
entire sgl.  Arguably, we don't even need the SNP test and the else
clause -- just always do what's in the if clause.

The corresponding code in storvsc_on_channel_callback would also
have to be changed.   And we still have to set the min_align_mask
so swiotlb will preserve any offset.  Storsvsc already has things set up
so that higher levels ensure there are no holes between sgl entries,
and that needs to stay true.

if (sg_count) {
unsigned int hvpgoff, hvpfns_to_add;
unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
u64 hvpfn;
int nents;

if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {

payload_sz = (hvpg_count * sizeof(u64) +
  sizeof(struct vmbus_packet_mpb_array));
payload = kzalloc(payload_sz, GFP_ATOMIC);
if (!payload)
return SCSI_MLQUEUE_DEVICE_BUSY;
}

payload->range.len = length;
payload->range.offset = offset_in_hvpg;


for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
/*
 * Init values for the current sgl entry. hvpgoff
 * and hvpfns_to_add are in units of Hyper-V size
 * pages. Handling the PAGE_SIZE != HV_HYP_PAGE_SIZE
 * case also handles values of sgl->offset that are
 * larger than PAGE_SIZE. Such offsets are handled
 * even on other than the first sgl entry, provided
 * they are a multiple of PAGE_SIZE.
 */
hvpgoff = HVPFN_DOWN(sgl->offset);

if (hv_isolation_type_snp()) {
nents = dma_map_sg(dev->device, sgl, 1, 
scmnd->sc_data_direction);
if (nents != 1)

hvpfn = HVPFN_DOWN(sg_dma_address(sgl)) + 
hvpgoff;
} else {
hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
}

hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->

RE: [PATCH V3 13/13] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-08-20 Thread Michael Kelley
From: Tianyu Lan  Sent: Friday, August 20, 2021 8:20 AM
> 
> On 8/20/2021 2:17 AM, Michael Kelley wrote:
> > From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> >
> > I'm not clear on why payload->range.offset needs to be set again.
> > Even after the dma mapping is done, doesn't the offset in the first
> > page have to be the same?  If it wasn't the same, Hyper-V wouldn't
> > be able to process the PFN list correctly.  In fact, couldn't the above
> > code just always set offset_in_hvpg = 0?
> 
> The offset will be changed. The swiotlb bounce buffer is allocated with
> IO_TLB_SIZE(2K) as unit. So the offset here may be changed.
> 

We need to prevent the offset from changing.  The storvsc driver passes
just a PFN list to Hyper-V, plus an overall starting offset and length.  Unlike
the netvsc driver, each entry in the PFN list does *not* have its own offset
and length.  Hyper-V assumes that the list is "dense" and that there are
no holes (i.e., unused memory areas).

For example, consider an original buffer passed into storvsc_queuecommand()
of 8 Kbytes, but aligned with 1 Kbytes at the end of the first page, then
4 Kbytes in the second page, and 3 Kbytes in the beginning of the third page.
The offset of that first 1 Kbytes has to remain as 3 Kbytes.  If bounce 
buffering
moves it to a different offset, there's no way to tell Hyper-V to ignore the
remaining bytes in the first page (at least not without using a different
method to communicate with Hyper-V).   In such a case, the wrong
data will get transferred.  Presumably the easier solution is to set the
min_align_mask field as Christop suggested.

> 
> >
> >>}
> >
> > The whole approach here is to do dma remapping on each individual page
> > of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map
> > each scatterlist entry as a unit?  Each scatterlist entry describes a range 
> > of
> > physically contiguous memory.  After dma_map_sg(), the resulting dma
> > address must also refer to a physically contiguous range in the swiotlb
> > bounce buffer memory.   So at the top of the "for" loop over the scatterlist
> > entries, do dma_map_sg() if we're in an isolated VM.  Then compute the
> > hvpfn value based on the dma address instead of sg_page().  But everything
> > else is the same, and the inner loop for populating the pfn_arry is 
> > unmodified.
> > Furthermore, the dma_range array that you've added is not needed, since
> > scatterlist entries already have a dma_address field for saving the mapped
> > address, and dma_unmap_sg() uses that field.
> 
> I don't use dma_map_sg() here in order to avoid introducing one more
> loop(e,g dma_map_sg()). We already have a loop to populate
> cmd_request->dma_range[] and so do the dma map in the same loop.
> 

I'm not seeing where the additional loop comes from.  Storvsc
already has a loop through the sgl entries.  Retain that loop and call
dma_map_sg() with nents set to 1.  Then the sequence is
dma_map_sg() --> dma_map_sg_attrs() --> dma_direct_map_sg() ->
dma_direct_map_page().  The latter function will call swiotlb_map()
to map all pages of the sgl entry as a single operation.

Michael




RE: [PATCH V3 13/13] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-08-20 Thread Michael Kelley
From: h...@lst.de  Sent: Thursday, August 19, 2021 9:33 PM
> 
> On Thu, Aug 19, 2021 at 06:17:40PM +, Michael Kelley wrote:
> > >
> > > @@ -1824,6 +1848,13 @@ static int storvsc_queuecommand(struct Scsi_Host 
> > > *host, struct scsi_cmnd *scmnd)
> > >   payload->range.len = length;
> > >   payload->range.offset = offset_in_hvpg;
> > >
> > > + cmd_request->dma_range = kcalloc(hvpg_count,
> > > +  sizeof(*cmd_request->dma_range),
> > > +  GFP_ATOMIC);
> >
> > With this patch, it appears that storvsc_queuecommand() is always
> > doing bounce buffering, even when running in a non-isolated VM.
> > The dma_range is always allocated, and the inner loop below does
> > the dma mapping for every I/O page.  The corresponding code in
> > storvsc_on_channel_callback() that does the dma unmap allows for
> > the dma_range to be NULL, but that never happens.
> 
> Maybe I'm missing something in the hyperv code, but I don't think
> dma_map_page would bounce buffer for the non-isolated case.  It
> will just return the physical address.

OK, right.  In the isolated VM case, the swiotlb is in force mode
and will do bounce buffering.  In the non-isolated case,
dma_map_page_attrs() -> dma_direct_map_page() does a lot of
checking but eventually just returns the physical address.  As this
patch is currently coded, it adds a fair amount of overhead
here in storvsc_queuecommand(), plus the overhead of the dma
mapping function deciding to use the identity mapping.  But if
dma_map_sg() is used and the code is simplified a bit, the overhead
will be less in general and will be per sgl entry instead of per page.

> 
> > > + if (offset_in_hvpg) {
> > > + payload->range.offset = dma & 
> > > ~HV_HYP_PAGE_MASK;
> > > + offset_in_hvpg = 0;
> > > + }
> >
> > I'm not clear on why payload->range.offset needs to be set again.
> > Even after the dma mapping is done, doesn't the offset in the first
> > page have to be the same?  If it wasn't the same, Hyper-V wouldn't
> > be able to process the PFN list correctly.  In fact, couldn't the above
> > code just always set offset_in_hvpg = 0?
> 
> Careful.  DMA mapping is supposed to keep the offset in the page, but
> for that the DMA mapping code needs to know what the device considers a
> "page".  For that the driver needs to set the min_align_mask field in
> struct device_dma_parameters.
> 

I see that the swiotlb code gets and uses the min_align_mask field.  But
the NVME driver is the only driver that ever sets it, so the value is zero
in all other cases.  Does swiotlb just use PAGE_SIZE in that that case?  I
couldn't tell from a quick glance at the swiotlb code.

> >
> > The whole approach here is to do dma remapping on each individual page
> > of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map
> > each scatterlist entry as a unit?  Each scatterlist entry describes a range 
> > of
> > physically contiguous memory.  After dma_map_sg(), the resulting dma
> > address must also refer to a physically contiguous range in the swiotlb
> > bounce buffer memory.   So at the top of the "for" loop over the scatterlist
> > entries, do dma_map_sg() if we're in an isolated VM.  Then compute the
> > hvpfn value based on the dma address instead of sg_page().  But everything
> > else is the same, and the inner loop for populating the pfn_arry is 
> > unmodified.
> > Furthermore, the dma_range array that you've added is not needed, since
> > scatterlist entries already have a dma_address field for saving the mapped
> > address, and dma_unmap_sg() uses that field.
> 
> Yes, I think dma_map_sg is the right thing to use here, probably even
> for the non-isolated case so that we can get the hv drivers out of their
> little corner and into being more like a normal kernel driver.  That
> is, use the scsi_dma_map/scsi_dma_unmap helpers, and then iterate over
> the dma addresses one page at a time using for_each_sg_dma_page.
> 

Doing some broader revisions to the Hyper-V storvsc driver is up next on
my to-do list.  Rather than significantly modifying the non-isolated case in
this patch set, I'd suggest factoring it into my broader revisions.

Michael



RE: [PATCH V3 13/13] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-08-19 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> 

Subject line tag should be "scsi: storvsc:"

> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> mpb_desc() still need to handle. Use DMA API to map/umap these

s/need to handle/needs to be handled/

> memory during sending/receiving packet and Hyper-V DMA ops callback
> will use swiotlb function to allocate bounce buffer and copy data
> from/to bounce buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/scsi/storvsc_drv.c | 68 +++---
>  1 file changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 328bb961c281..78320719bdd8 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -21,6 +21,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -427,6 +429,8 @@ struct storvsc_cmd_request {
>   u32 payload_sz;
> 
>   struct vstor_packet vstor_packet;
> + u32 hvpg_count;

This count is really the number of entries in the dma_range
array, right?  If so, perhaps "dma_range_count" would be
a better name so that it is more tightly associated.

> + struct hv_dma_range *dma_range;
>  };
> 
> 
> @@ -509,6 +513,14 @@ struct storvsc_scan_work {
>   u8 tgt_id;
>  };
> 
> +#define storvsc_dma_map(dev, page, offset, size, dir) \
> + dma_map_page(dev, page, offset, size, dir)
> +
> +#define storvsc_dma_unmap(dev, dma_range, dir)   \
> + dma_unmap_page(dev, dma_range.dma,  \
> +dma_range.mapping_size,  \
> +dir ? DMA_FROM_DEVICE : DMA_TO_DEVICE)
> +

Each of these macros is used only once.  IMHO, they don't
add a lot of value.  Just coding dma_map/unmap_page()
inline would be fine and eliminate these lines of code.

>  static void storvsc_device_scan(struct work_struct *work)
>  {
>   struct storvsc_scan_work *wrk;
> @@ -1260,6 +1272,7 @@ static void storvsc_on_channel_callback(void *context)
>   struct hv_device *device;
>   struct storvsc_device *stor_device;
>   struct Scsi_Host *shost;
> + int i;
> 
>   if (channel->primary_channel != NULL)
>   device = channel->primary_channel->device_obj;
> @@ -1314,6 +1327,15 @@ static void storvsc_on_channel_callback(void *context)
>   request = (struct storvsc_cmd_request 
> *)scsi_cmd_priv(scmnd);
>   }
> 
> + if (request->dma_range) {
> + for (i = 0; i < request->hvpg_count; i++)
> + storvsc_dma_unmap(>device,
> + request->dma_range[i],
> + 
> request->vstor_packet.vm_srb.data_in == READ_TYPE);

I think you can directly get the DMA direction as 
request->cmd->sc_data_direction.

> +
> + kfree(request->dma_range);
> + }
> +
>   storvsc_on_receive(stor_device, packet, request);
>   continue;
>   }
> @@ -1810,7 +1832,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
> struct scsi_cmnd *scmnd)
>   unsigned int hvpgoff, hvpfns_to_add;
>   unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
>   unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
> + dma_addr_t dma;
>   u64 hvpfn;
> + u32 size;
> 
>   if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
> 
> @@ -1824,6 +1848,13 @@ static int storvsc_queuecommand(struct Scsi_Host 
> *host, struct scsi_cmnd *scmnd)
>   payload->range.len = length;
>   payload->range.offset = offset_in_hvpg;
> 
> + cmd_request->dma_range = kcalloc(hvpg_count,
> +  sizeof(*cmd_request->dma_range),
> +  GFP_ATOMIC);

With this patch, it appears that storvsc_queuecommand() is always
doing bounce buffering, even when running in a non-isolated VM.
The dma_range is always allocated, and the inner loop below does
the dma mapping for every I/O page.  The corresponding code in
storvsc_on_channel_callback() that does the dma unmap allows for
the dma_range to be NULL, but that never happens.

> + if (!cmd_request->dma_range) {
> + ret = -ENOMEM;

The other memory allocation failure in this function returns
SCSI_MLQUEUE_DEVICE_BUSY.   It may be debatable as to whether
that's the best approach, but that's a topic for a different patch.  I
would suggest being consistent and using the same return code
here.

> + goto free_payload;
> + }
> 
>   

RE: [PATCH V3 12/13] HV/Netvsc: Add Isolation VM support for netvsc driver

2021-08-19 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> 

The Subject line tag should be "hv_netvsc:".

> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() still need to handle. Use DMA API to map/umap these
> memory during sending/receiving packet and Hyper-V DMA ops callback
> will use swiotlb function to allocate bounce buffer and copy data
> from/to bounce buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/net/hyperv/hyperv_net.h   |   6 ++
>  drivers/net/hyperv/netvsc.c   | 144 +-
>  drivers/net/hyperv/rndis_filter.c |   2 +
>  include/linux/hyperv.h|   5 ++
>  4 files changed, 154 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index bc48855dff10..862419912bfb 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>   u32 total_bytes;
>   u32 send_buf_index;
>   u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
>  };
> 
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,7 @@ struct netvsc_device {
> 
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + void *recv_original_buf;
>   u32 recv_buf_size; /* allocated bytes */
>   u32 recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
> @@ -1082,6 +1084,8 @@ struct netvsc_device {
> 
>   /* Send buffer allocated by us */
>   void *send_buf;
> + void *send_original_buf;
> + u32 send_buf_size;
>   u32 send_buf_gpadl_handle;
>   u32 send_section_cnt;
>   u32 send_section_size;
> @@ -1730,4 +1734,6 @@ struct rndis_message {
>  #define RETRY_US_HI  1
>  #define RETRY_MAX2000/* >10 sec */
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +   struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 7bd935412853..fc312e5db4d5 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
>   int i;
> 
>   kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_original_buf) {
> + vunmap(nvdev->recv_buf);
> + vfree(nvdev->recv_original_buf);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if (nvdev->send_original_buf) {
> + vunmap(nvdev->send_buf);
> + vfree(nvdev->send_original_buf);
> + } else {
> + vfree(nvdev->send_buf);
> + }
> +
>   kfree(nvdev->send_section_map);
> 
>   for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
> @@ -330,6 +343,27 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device 
> *net_device, u32 q_idx)
>   return nvchan->mrc.slots ? 0 : -ENOMEM;
>  }
> 
> +static void *netvsc_remap_buf(void *buf, unsigned long size)
> +{
> + unsigned long *pfns;
> + void *vaddr;
> + int i;
> +
> + pfns = kcalloc(size / HV_HYP_PAGE_SIZE, sizeof(unsigned long),
> +GFP_KERNEL);

This assumes that the "size" argument is a multiple of PAGE_SIZE.  I think
that's true in all the use cases, but it would be safer to check.

> + if (!pfns)
> + return NULL;
> +
> + for (i = 0; i < size / HV_HYP_PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(buf + i * HV_HYP_PAGE_SIZE)
> + + (ms_hyperv.shared_gpa_boundary >> HV_HYP_PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, size / HV_HYP_PAGE_SIZE, PAGE_KERNEL_IO);
> + kfree(pfns);
> +
> + return vaddr;
> +}

This function appears to be a duplicate of hv_map_memory() in Patch 11 of this
series.  Is it possible to structure things so there is only one 
implementation?  In
any case, see the comment in hv_map_memory() about PAGE_SIZE vs
HV_HYP_PAGE_SIZE and similar.

> +
>  static int netvsc_init_buf(struct hv_device *device,
>  struct netvsc_device *net_device,
>  const struct netvsc_device_info *device_info)
> @@ -340,6 +374,7 @@ static int netvsc_init_buf(struct hv_device *device,
>   unsigned int buf_size;
>   size_t map_words;
>   int i, ret = 0;
> + void *vaddr;
> 
>   /* Get receive buffer area. */
>   buf_size = device_info->recv_sections * device_info->recv_section_size;
> @@ -375,6 +410,15 @@ static int netvsc_init_buf(struct hv_device *device,
>   goto cleanup;
>   }
> 
> + if (hv_isolation_type_snp()) {
> + vaddr = netvsc_remap_buf(net_device->recv_buf, buf_size);
> + if (!vaddr)
> + goto cleanup;
> +
> + 

RE: [PATCH V3 11/13] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-08-19 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> 
> Hyper-V Isolation VM requires bounce buffer support to copy
> data from/to encrypted memory and so enable swiotlb force
> mode to use swiotlb bounce buffer for DMA transaction.
> 
> In Isolation VM with AMD SEV, the bounce buffer needs to be
> accessed via extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
> 
> Swiotlb bounce buffer code calls dma_map_decrypted()
> to mark bounce buffer visible to host and map it in extra
> address space. Populate dma memory decrypted ops with hv
> map/unmap function.
> 
> Hyper-V initalizes swiotlb bounce buffer and default swiotlb
> needs to be disabled. pci_swiotlb_detect_override() and
> pci_swiotlb_detect_4gb() enable the default one. To override
> the setting, hyperv_swiotlb_detect() needs to run before
> these detect functions which depends on the pci_xen_swiotlb_
> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
> _detect() to keep the order.
> 
> The map function vmap_pfn() can't work in the early place
> hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce
> buffer in the hyperv_iommu_swiotlb_later_init().
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/hyperv/ivm.c   | 28 ++
>  arch/x86/include/asm/mshyperv.h |  2 +
>  arch/x86/xen/pci-swiotlb-xen.c  |  3 +-
>  drivers/hv/vmbus_drv.c  |  3 ++
>  drivers/iommu/hyperv-iommu.c| 65 +
>  include/linux/hyperv.h  |  1 +
>  6 files changed, 101 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index c13ec5560d73..0f05e4d6fc62 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -265,3 +265,31 @@ int hv_set_mem_host_visibility(unsigned long addr, int 
> numpages, bool visible)
> 
>   return __hv_set_mem_host_visibility((void *)addr, numpages, visibility);
>  }
> +
> +/*
> + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
> + */
> +void *hv_map_memory(void *addr, unsigned long size)
> +{
> + unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,
> +   sizeof(unsigned long), GFP_KERNEL);
> + void *vaddr;
> + int i;
> +
> + if (!pfns)
> + return NULL;
> +
> + for (i = 0; i < size / HV_HYP_PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(addr + i * HV_HYP_PAGE_SIZE) +
> + (ms_hyperv.shared_gpa_boundary >> HV_HYP_PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, size / HV_HYP_PAGE_SIZE, PAGE_KERNEL_IO);
> + kfree(pfns);
> +
> + return vaddr;
> +}

This function is manipulating page tables in the guest VM.  It is not involved
in communicating with Hyper-V, or passing PFNs to Hyper-V.  The pfn array
contains guest PFNs, not Hyper-V PFNs.  So it should use PAGE_SIZE
instead of HV_HYP_PAGE_SIZE, and similarly PAGE_SHIFT and virt_to_pfn().
If this code were ever to run on ARM64 in the future with PAGE_SIZE other
than 4 Kbytes, the use of PAGE_SIZE is correct choice.

> +
> +void hv_unmap_memory(void *addr)
> +{
> + vunmap(addr);
> +}
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index a30c60f189a3..b247739f57ac 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -250,6 +250,8 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct 
> hv_interrupt_entry *entry);
>  int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>  enum hv_mem_host_visibility visibility);
>  int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool 
> visible);
> +void *hv_map_memory(void *addr, unsigned long size);
> +void hv_unmap_memory(void *addr);
>  void hv_sint_wrmsrl_ghcb(u64 msr, u64 value);
>  void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);
>  void hv_signal_eom_ghcb(void);
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 54f9aa7e8457..43bd031aa332 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -4,6 +4,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include 
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
> 
>  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
> -   NULL,
> +   hyperv_swiotlb_detect,
> pci_xen_swiotlb_init,
> NULL);
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 57bbbaa4e8f7..f068e22a5636 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -23,6 +23,7 @@
>  

RE: [PATCH V3 08/13] HV/Vmbus: Initialize VMbus ring buffer for Isolation VM

2021-08-16 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> 
> VMbus ring buffer are shared with host and it's need to

s/it's need/it needs/

> be accessed via extra address space of Isolation VM with
> SNP support. This patch is to map the ring buffer
> address in extra address space via ioremap(). HV host

It's actually using vmap_pfn(), not ioremap().

> visibility hvcall smears data in the ring buffer and
> so reset the ring buffer memory to zero after calling
> visibility hvcall.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/hv/Kconfig|  1 +
>  drivers/hv/channel.c  | 10 +
>  drivers/hv/hyperv_vmbus.h |  2 +
>  drivers/hv/ring_buffer.c  | 84 ++-
>  4 files changed, 79 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index d1123ceb38f3..dd12af20e467 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -8,6 +8,7 @@ config HYPERV
>   || (ARM64 && !CPU_BIG_ENDIAN))
>   select PARAVIRT
>   select X86_HV_CALLBACK_VECTOR if X86
> + select VMAP_PFN
>   help
> Select this option to run Linux as a Hyper-V client operating
> system.
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index 4c4717c26240..60ef881a700c 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -712,6 +712,16 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
>   if (err)
>   goto error_clean_ring;
> 
> + err = hv_ringbuffer_post_init(>outbound,
> +   page, send_pages);
> + if (err)
> + goto error_free_gpadl;
> +
> + err = hv_ringbuffer_post_init(>inbound,
> +   [send_pages], recv_pages);
> + if (err)
> + goto error_free_gpadl;
> +
>   /* Create and init the channel open message */
>   open_info = kzalloc(sizeof(*open_info) +
>  sizeof(struct vmbus_channel_open_channel),
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 40bc0eff6665..15cd23a561f3 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -172,6 +172,8 @@ extern int hv_synic_cleanup(unsigned int cpu);
>  /* Interface */
> 
>  void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
> +int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
> + struct page *pages, u32 page_cnt);
> 
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  struct page *pages, u32 pagecnt, u32 max_pkt_size);
> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> index 2aee356840a2..d4f93fca1108 100644
> --- a/drivers/hv/ring_buffer.c
> +++ b/drivers/hv/ring_buffer.c
> @@ -17,6 +17,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "hyperv_vmbus.h"
> 
> @@ -179,43 +181,89 @@ void hv_ringbuffer_pre_init(struct vmbus_channel 
> *channel)
>   mutex_init(>outbound.ring_buffer_mutex);
>  }
> 
> -/* Initialize the ring buffer. */
> -int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
> -struct page *pages, u32 page_cnt, u32 max_pkt_size)
> +int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
> +struct page *pages, u32 page_cnt)
>  {
> + u64 physic_addr = page_to_pfn(pages) << PAGE_SHIFT;
> + unsigned long *pfns_wraparound;
> + void *vaddr;
>   int i;
> - struct page **pages_wraparound;
> 
> - BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
> + if (!hv_isolation_type_snp())
> + return 0;
> +
> + physic_addr += ms_hyperv.shared_gpa_boundary;
> 
>   /*
>* First page holds struct hv_ring_buffer, do wraparound mapping for
>* the rest.
>*/
> - pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),
> + pfns_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(unsigned long),
>  GFP_KERNEL);
> - if (!pages_wraparound)
> + if (!pfns_wraparound)
>   return -ENOMEM;
> 
> - pages_wraparound[0] = pages;
> + pfns_wraparound[0] = physic_addr >> PAGE_SHIFT;
>   for (i = 0; i < 2 * (page_cnt - 1); i++)
> - pages_wraparound[i + 1] = [i % (page_cnt - 1) + 1];
> -
> - ring_info->ring_buffer = (struct hv_ring_buffer *)
> - vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL);
> -
> - kfree(pages_wraparound);
> + pfns_wraparound[i + 1] = (physic_addr >> PAGE_SHIFT) +
> + i % (page_cnt - 1) + 1;
> 
> -
> - if (!ring_info->ring_buffer)
> + vaddr = vmap_pfn(pfns_wraparound, page_cnt * 2 - 1, PAGE_KERNEL_IO);
> + kfree(pfns_wraparound);
> + if (!vaddr)
>   return -ENOMEM;
> 
> - ring_info->ring_buffer->read_index =
> - ring_info->ring_buffer->write_index = 0;
> + /* Clean memory after setting 

RE: [PATCH V3 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support

2021-08-16 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> 
> Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
> security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
> is to add support for these Isolation VM support in Linux.
> 

A general comment about this series:  I have not seen any statements
made about whether either type of Isolated VM is supported for 32-bit
Linux guests.   arch/x86/Kconfig has CONFIG_AMD_MEM_ENCRYPT as
64-bit only, so evidently SEV-SNP Isolated VMs would be 64-bit only.
But I don't know if VBS VMs are any different.

I didn't track down what happens if a 32-bit Linux is booted in
a VM that supports SEV-SNP.  Presumably some kind of message
is output that no encryption is being done.  But at a slightly
higher level, the Hyper-V initialization path should probably
also check for 32-bit and output a clear message that no isolation
is being provided.  At that point, I don't know if it is possible to
continue in non-isolated mode or whether the only choice is to
panic.  Continuing in non-isolated mode might be a bad idea
anyway since presumably the user has explicitly requested an
Isolated VM.

Related, I noticed usage of "unsigned long" for holding physical
addresses, which works when running 64-bit, but not when running
32-bit.  But even if Isolated VMs are always 64-bit, it would be still be
better to clean this up and use phys_addr_t instead.  Unfortunately,
more generic functions like set_memory_encrypted() and
set_memory_decrypted() have physical address arguments that
are of type unsigned long.

Michael



RE: [PATCH V3 07/13] HV/Vmbus: Add SNP support for VMbus channel initiate message

2021-08-13 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> 
> The monitor pages in the CHANNELMSG_INITIATE_CONTACT msg are shared
> with host in Isolation VM and so it's necessary to use hvcall to set
> them visible to host. In Isolation VM with AMD SEV SNP, the access
> address should be in the extra space which is above shared gpa
> boundary. So remap these pages into the extra address(pa +
> shared_gpa_boundary). Introduce monitor_pages_va to store
> the remap address and unmap these va when disconnect vmbus.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v1:
> * Not remap monitor pages in the non-SNP isolation VM.
> ---
>  drivers/hv/connection.c   | 65 +++
>  drivers/hv/hyperv_vmbus.h |  1 +
>  2 files changed, 66 insertions(+)
> 
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 6d315c1465e0..bf0ac3167bd2 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include "hyperv_vmbus.h"
> @@ -104,6 +105,12 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
> *msginfo, u32 version)
> 
>   msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
>   msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
> +
> + if (hv_isolation_type_snp()) {
> + msg->monitor_page1 += ms_hyperv.shared_gpa_boundary;
> + msg->monitor_page2 += ms_hyperv.shared_gpa_boundary;
> + }
> +
>   msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
> 
>   /*
> @@ -148,6 +155,31 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
> *msginfo, u32 version)
>   return -ECONNREFUSED;
>   }
> 
> + if (hv_isolation_type_snp()) {
> + vmbus_connection.monitor_pages_va[0]
> + = vmbus_connection.monitor_pages[0];
> + vmbus_connection.monitor_pages[0]
> + = memremap(msg->monitor_page1, HV_HYP_PAGE_SIZE,
> +MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[0])
> + return -ENOMEM;

This error case causes vmbus_negotiate_version() to return with
vmbus_connection.con_state set to CONNECTED.  But the caller never checks the
returned error code except for ETIMEDOUT.  So the caller will think that
vmbus_negotiate_version() succeeded when it didn't.  There may be some
existing bugs in that error handling code. :-(

> +
> + vmbus_connection.monitor_pages_va[1]
> + = vmbus_connection.monitor_pages[1];
> + vmbus_connection.monitor_pages[1]
> + = memremap(msg->monitor_page2, HV_HYP_PAGE_SIZE,
> +MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[1]) {
> + memunmap(vmbus_connection.monitor_pages[0]);
> + return -ENOMEM;
> + }
> +
> + memset(vmbus_connection.monitor_pages[0], 0x00,
> +HV_HYP_PAGE_SIZE);
> + memset(vmbus_connection.monitor_pages[1], 0x00,
> +HV_HYP_PAGE_SIZE);
> + }
> +

I don't think the memset() calls are needed.  The memory was originally
allocated with hv_alloc_hyperv_zeroed_page(), so it should already be zeroed.

>   return ret;
>  }
> 
> @@ -159,6 +191,7 @@ int vmbus_connect(void)
>   struct vmbus_channel_msginfo *msginfo = NULL;
>   int i, ret = 0;
>   __u32 version;
> + u64 pfn[2];
> 
>   /* Initialize the vmbus connection */
>   vmbus_connection.conn_state = CONNECTING;
> @@ -216,6 +249,16 @@ int vmbus_connect(void)
>   goto cleanup;
>   }
> 
> + if (hv_is_isolation_supported()) {
> + pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
> + pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
> + if (hv_mark_gpa_visibility(2, pfn,
> + VMBUS_PAGE_VISIBLE_READ_WRITE)) {

Note that hv_mark_gpa_visibility() will need an appropriate no-op stub so
that this architecture independent code will compile for ARM64.

> + ret = -EFAULT;
> + goto cleanup;
> + }
> + }
> +
>   msginfo = kzalloc(sizeof(*msginfo) +
> sizeof(struct vmbus_channel_initiate_contact),
> GFP_KERNEL);
> @@ -284,6 +327,8 @@ int vmbus_connect(void)
> 
>  void vmbus_disconnect(void)
>  {
> + u64 pfn[2];
> +
>   /*
>* First send the unload request to the host.
>*/
> @@ -303,6 +348,26 @@ void vmbus_disconnect(void)
>   vmbus_connection.int_page = NULL;
>   }
> 
> + if (hv_is_isolation_supported()) {
> + if (vmbus_connection.monitor_pages_va[0]) {
> + memunmap(vmbus_connection.monitor_pages[0]);
> + 

RE: [PATCH V3 06/13] HV: Add ghcb hvcall support for SNP VM

2021-08-13 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> 
> Hyper-V provides ghcb hvcall to handle VMBus
> HVCALL_SIGNAL_EVENT and HVCALL_POST_MESSAGE
> msg in SNP Isolation VM. Add such support.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/hyperv/ivm.c   | 43 +
>  arch/x86/include/asm/mshyperv.h |  1 +
>  drivers/hv/connection.c |  6 -
>  drivers/hv/hv.c |  8 +-
>  include/asm-generic/mshyperv.h  | 29 ++
>  5 files changed, 85 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index ec0e5c259740..c13ec5560d73 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -15,6 +15,49 @@
>  #include 
>  #include 
> 
> +#define GHCB_USAGE_HYPERV_CALL   1
> +
> +u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + if (!ms_hyperv.ghcb_base)
> + return -EFAULT;
> +
> + WARN_ON(in_nmi());
> +
> + local_irq_save(flags);
> + ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> + hv_ghcb = (union hv_ghcb *)*ghcb_base;
> + if (!hv_ghcb) {
> + local_irq_restore(flags);
> + return -EFAULT;
> + }
> +
> + hv_ghcb->ghcb.protocol_version = GHCB_PROTOCOL_MAX;
> + hv_ghcb->ghcb.ghcb_usage = GHCB_USAGE_HYPERV_CALL;
> +
> + hv_ghcb->hypercall.outputgpa = (u64)output;
> + hv_ghcb->hypercall.hypercallinput.asuint64 = 0;
> + hv_ghcb->hypercall.hypercallinput.callcode = control;
> +
> + if (input_size)
> + memcpy(hv_ghcb->hypercall.hypercalldata, input, input_size);
> +
> + VMGEXIT();
> +
> + hv_ghcb->ghcb.ghcb_usage = 0x;
> + memset(hv_ghcb->ghcb.save.valid_bitmap, 0,
> +sizeof(hv_ghcb->ghcb.save.valid_bitmap));
> +
> + local_irq_restore(flags);
> +
> + return hv_ghcb->hypercall.hypercalloutput.callstatus;
> +}
> +EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);

This function is called from architecture independent code, so it needs a
default no-op stub to enable the code to compile on ARM64.  The stub should
always return failure.

> +
>  void hv_ghcb_msr_write(u64 msr, u64 value)
>  {
>   union hv_ghcb *hv_ghcb;
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 730985676ea3..a30c60f189a3 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -255,6 +255,7 @@ void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);
>  void hv_signal_eom_ghcb(void);
>  void hv_ghcb_msr_write(u64 msr, u64 value);
>  void hv_ghcb_msr_read(u64 msr, u64 *value);
> +u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 
> input_size);
> 
>  #define hv_get_synint_state_ghcb(int_num, val)   \
>   hv_sint_rdmsrl_ghcb(HV_X64_MSR_SINT0 + int_num, val)
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 5e479d54918c..6d315c1465e0 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -447,6 +447,10 @@ void vmbus_set_event(struct vmbus_channel *channel)
> 
>   ++channel->sig_events;
> 
> - hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
> + if (hv_isolation_type_snp())
> + hv_ghcb_hypercall(HVCALL_SIGNAL_EVENT, >sig_event,
> + NULL, sizeof(u64));
> + else
> + hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
>  }
>  EXPORT_SYMBOL_GPL(vmbus_set_event);
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 59f7173c4d9f..e5c9fc467893 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -98,7 +98,13 @@ int hv_post_message(union hv_connection_id connection_id,
>   aligned_msg->payload_size = payload_size;
>   memcpy((void *)aligned_msg->payload, payload, payload_size);
> 
> - status = hv_do_hypercall(HVCALL_POST_MESSAGE, aligned_msg, NULL);
> + if (hv_isolation_type_snp())
> + status = hv_ghcb_hypercall(HVCALL_POST_MESSAGE,
> + (void *)aligned_msg, NULL,
> + sizeof(struct hv_input_post_message));
> + else
> + status = hv_do_hypercall(HVCALL_POST_MESSAGE,
> + aligned_msg, NULL);
> 
>   /* Preemption must remain disabled until after the hypercall
>* so some other thread can't get scheduled onto this cpu and
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 90dac369a2dc..400181b855c1 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -31,6 +31,35 @@
> 
>  union hv_ghcb {
>   struct ghcb ghcb;
> + struct {
> + u64 hypercalldata[509];
> + u64 outputgpa;
> + union {
> + union {
> + struct {
> +

RE: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page

2021-08-13 Thread Michael Kelley
From: Michael Kelley  Sent: Friday, August 13, 2021 
12:31 PM
> To: Tianyu Lan ; KY Srinivasan ; 
> Haiyang Zhang ;
> Stephen Hemminger ; wei@kernel.org; Dexuan Cui 
> ;
> t...@linutronix.de; mi...@redhat.com; b...@alien8.de; x...@kernel.org; 
> h...@zytor.com; dave.han...@linux.intel.com;
> l...@kernel.org; pet...@infradead.org; konrad.w...@oracle.com; 
> boris.ostrov...@oracle.com; jgr...@suse.com;
> sstabell...@kernel.org; j...@8bytes.org; w...@kernel.org; 
> da...@davemloft.net; k...@kernel.org; j...@linux.ibm.com;
> martin.peter...@oracle.com; a...@arndb.de; h...@lst.de; 
> m.szyprow...@samsung.com; robin.mur...@arm.com;
> thomas.lenda...@amd.com; brijesh.si...@amd.com; a...@kernel.org; Tianyu Lan 
> ;
> pgo...@google.com; martin.b.ra...@gmail.com; a...@linux-foundation.org; 
> kirill.shute...@linux.intel.com;
> r...@kernel.org; s...@canb.auug.org.au; saravan...@fb.com; 
> krish.sadhuk...@oracle.com;
> aneesh.ku...@linux.ibm.com; xen-devel@lists.xenproject.org; 
> rient...@google.com; han...@cmpxchg.org;
> t...@kernel.org
> Cc: io...@lists.linux-foundation.org; linux-a...@vger.kernel.org; 
> linux-hyp...@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-s...@vger.kernel.org; net...@vger.kernel.org; 
> vkuznets ;
> parri.and...@gmail.com; dave.han...@intel.com
> Subject: RE: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page
> 
> From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> > Subject: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page
> 
> See previous comments about tag in the Subject line.
> 
> > Hyper-V provides GHCB protocol to write Synthetic Interrupt
> > Controller MSR registers in Isolation VM with AMD SEV SNP
> > and these registers are emulated by hypervisor directly.
> > Hyper-V requires to write SINTx MSR registers twice. First
> > writes MSR via GHCB page to communicate with hypervisor
> > and then writes wrmsr instruction to talk with paravisor
> > which runs in VMPL0. Guest OS ID MSR also needs to be set
> > via GHCB.
> >
> > Signed-off-by: Tianyu Lan 
> > ---
> > Change since v1:
> >  * Introduce sev_es_ghcb_hv_call_simple() and share code
> >between SEV and Hyper-V code.
> > ---
> >  arch/x86/hyperv/hv_init.c   |  33 ++---
> >  arch/x86/hyperv/ivm.c   | 110 +
> >  arch/x86/include/asm/mshyperv.h |  78 +++-
> >  arch/x86/include/asm/sev.h  |   3 +
> >  arch/x86/kernel/cpu/mshyperv.c  |   3 +
> >  arch/x86/kernel/sev-shared.c|  63 ++---
> >  drivers/hv/hv.c | 121 ++--
> >  include/asm-generic/mshyperv.h  |  12 +++-
> >  8 files changed, 329 insertions(+), 94 deletions(-)
> >
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index b3683083208a..ab0b33f621e7 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -423,7 +423,7 @@ void __init hyperv_init(void)
> > goto clean_guest_os_id;
> >
> > if (hv_isolation_type_snp()) {
> > -   ms_hyperv.ghcb_base = alloc_percpu(void *);
> > +   ms_hyperv.ghcb_base = alloc_percpu(union hv_ghcb __percpu *);
> 
> union hv_ghcb isn't defined.  It is not added until patch 6 of the series.
> 

Ignore this comment.  My mistake.

Michael



RE: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page

2021-08-13 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> Subject: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page

See previous comments about tag in the Subject line.

> Hyper-V provides GHCB protocol to write Synthetic Interrupt
> Controller MSR registers in Isolation VM with AMD SEV SNP
> and these registers are emulated by hypervisor directly.
> Hyper-V requires to write SINTx MSR registers twice. First
> writes MSR via GHCB page to communicate with hypervisor
> and then writes wrmsr instruction to talk with paravisor
> which runs in VMPL0. Guest OS ID MSR also needs to be set
> via GHCB.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v1:
>  * Introduce sev_es_ghcb_hv_call_simple() and share code
>between SEV and Hyper-V code.
> ---
>  arch/x86/hyperv/hv_init.c   |  33 ++---
>  arch/x86/hyperv/ivm.c   | 110 +
>  arch/x86/include/asm/mshyperv.h |  78 +++-
>  arch/x86/include/asm/sev.h  |   3 +
>  arch/x86/kernel/cpu/mshyperv.c  |   3 +
>  arch/x86/kernel/sev-shared.c|  63 ++---
>  drivers/hv/hv.c | 121 ++--
>  include/asm-generic/mshyperv.h  |  12 +++-
>  8 files changed, 329 insertions(+), 94 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index b3683083208a..ab0b33f621e7 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -423,7 +423,7 @@ void __init hyperv_init(void)
>   goto clean_guest_os_id;
> 
>   if (hv_isolation_type_snp()) {
> - ms_hyperv.ghcb_base = alloc_percpu(void *);
> + ms_hyperv.ghcb_base = alloc_percpu(union hv_ghcb __percpu *);

union hv_ghcb isn't defined.  It is not added until patch 6 of the series.

>   if (!ms_hyperv.ghcb_base)
>   goto clean_guest_os_id;
> 
> @@ -432,6 +432,9 @@ void __init hyperv_init(void)
>   ms_hyperv.ghcb_base = NULL;
>   goto clean_guest_os_id;
>   }
> +
> + /* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
>   }
> 
>   rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> @@ -523,6 +526,7 @@ void hyperv_cleanup(void)
> 
>   /* Reset our OS id */
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
> 
>   /*
>* Reset hypercall page reference before reset the page,
> @@ -596,30 +600,3 @@ bool hv_is_hyperv_initialized(void)
>   return hypercall_msr.enable;
>  }
>  EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
> -
> -enum hv_isolation_type hv_get_isolation_type(void)
> -{
> - if (!(ms_hyperv.priv_high & HV_ISOLATION))
> - return HV_ISOLATION_TYPE_NONE;
> - return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
> -}
> -EXPORT_SYMBOL_GPL(hv_get_isolation_type);
> -
> -bool hv_is_isolation_supported(void)
> -{
> - if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> - return 0;
> -
> - if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> - return 0;
> -
> - return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
> -}
> -
> -DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
> -
> -bool hv_isolation_type_snp(void)
> -{
> - return static_branch_unlikely(_type_snp);
> -}
> -EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 8c905ffdba7f..ec0e5c259740 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -6,6 +6,8 @@
>   *  Tianyu Lan 
>   */
> 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -13,6 +15,114 @@
>  #include 
>  #include 
> 
> +void hv_ghcb_msr_write(u64 msr, u64 value)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + if (!ms_hyperv.ghcb_base)
> + return;
> +
> + WARN_ON(in_nmi());
> +
> + local_irq_save(flags);
> + ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> + hv_ghcb = (union hv_ghcb *)*ghcb_base;
> + if (!hv_ghcb) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> + ghcb_set_rcx(_ghcb->ghcb, msr);
> + ghcb_set_rax(_ghcb->ghcb, lower_32_bits(value));
> + ghcb_set_rdx(_ghcb->ghcb, value >> 32);

Having used lower_32_bits() in the previous line, perhaps use
upper_32_bits() here?

> +
> + if (sev_es_ghcb_hv_call_simple(_ghcb->ghcb, SVM_EXIT_MSR, 1, 0))
> + pr_warn("Fail to write msr via ghcb %llx.\n", msr);
> +
> + local_irq_restore(flags);
> +}
> +
> +void hv_ghcb_msr_read(u64 msr, u64 *value)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + if (!ms_hyperv.ghcb_base)
> + return;
> +
> + WARN_ON(in_nmi());
> +
> + local_irq_save(flags);
> + 

RE: [PATCH V3 04/13] HV: Mark vmbus ring buffer visible to host in Isolation VM

2021-08-12 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> Subject: [PATCH V3 04/13] HV: Mark vmbus ring buffer visible to host in 
> Isolation VM
> 

Use tag "Drivers: hv: vmbus:" in the Subject line.

> Mark vmbus ring buffer visible with set_memory_decrypted() when
> establish gpadl handle.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/hv/channel.c   | 44 --
>  include/linux/hyperv.h | 11 +++
>  2 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index f3761c73b074..4c4717c26240 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> 
> @@ -465,7 +466,14 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   struct list_head *curr;
>   u32 next_gpadl_handle;
>   unsigned long flags;
> - int ret = 0;
> + int ret = 0, index;
> +
> + index = atomic_inc_return(>gpadl_index) - 1;
> +
> + if (index > VMBUS_GPADL_RANGE_COUNT - 1) {
> + pr_err("Gpadl handle position(%d) has been occupied.\n", index);
> + return -ENOSPC;
> + }
> 
>   next_gpadl_handle =
>   (atomic_inc_return(_connection.next_gpadl_handle) - 1);
> @@ -474,6 +482,13 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   if (ret)
>   return ret;
> 
> + ret = set_memory_decrypted((unsigned long)kbuffer,
> +HVPFN_UP(size));
> + if (ret) {
> + pr_warn("Failed to set host visibility.\n");

Enhance this message a bit.  "Failed to set host visibility for new GPADL\n"
and also output the value of ret.

> + return ret;
> + }
> +
>   init_completion(>waitevent);
>   msginfo->waiting_channel = channel;
> 
> @@ -539,6 +554,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   /* At this point, we received the gpadl created msg */
>   *gpadl_handle = gpadlmsg->gpadl;
> 
> + channel->gpadl_array[index].size = size;
> + channel->gpadl_array[index].buffer = kbuffer;
> + channel->gpadl_array[index].gpadlhandle = *gpadl_handle;
> +

I can see the merits of transparently stashing the memory address and size
that will be needed by vmbus_teardown_gpadl(), so that the callers of
__vmbus_establish_gpadl() don't have to worry about it.  But doing the
stashing transparently is somewhat messy.

Given that the callers are already have memory allocated to save the
GPADL handle, a little refactoring would make for a much cleaner solution.
Instead of having memory allocated for the 32-bit GPADL handle, callers
should allocate the slightly larger struct vmbus_gpadl that you've
defined below.  The calling interfaces can be updated to take a pointer
to this structure instead of a pointer to the 32-bit GPADL handle, and
you can save the memory address and size right along with the GPADL
handle.  This approach touches a few more files, but I think there are
only two callers outside of the channel management code -- netvsc
and hv_uio -- so it's not a big change.

>  cleanup:
>   spin_lock_irqsave(_connection.channelmsg_lock, flags);
>   list_del(>msglistentry);
> @@ -549,6 +568,13 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
> *channel,
>   }
> 
>   kfree(msginfo);
> +
> + if (ret) {
> + set_memory_encrypted((unsigned long)kbuffer,
> +  HVPFN_UP(size));
> + atomic_dec(>gpadl_index);
> + }
> +
>   return ret;
>  }
> 
> @@ -676,6 +702,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
> 
>   /* Establish the gpadl for the ring buffer */
>   newchannel->ringbuffer_gpadlhandle = 0;
> + atomic_set(>gpadl_index, 0);
> 
>   err = __vmbus_establish_gpadl(newchannel, HV_GPADL_RING,
> page_address(newchannel->ringbuffer_page),
> @@ -811,7 +838,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
> u32 gpadl_handle)
>   struct vmbus_channel_gpadl_teardown *msg;
>   struct vmbus_channel_msginfo *info;
>   unsigned long flags;
> - int ret;
> + int ret, i;
> 
>   info = kzalloc(sizeof(*info) +
>  sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
> @@ -859,6 +886,19 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
> u32 gpadl_handle)
>   spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
> 
>   kfree(info);
> +
> + /* Find gpadl buffer virtual address and size. */
> + for (i = 0; i < VMBUS_GPADL_RANGE_COUNT; i++)
> + if (channel->gpadl_array[i].gpadlhandle == gpadl_handle)
> + break;
> +
> + if (set_memory_encrypted((unsigned long)channel->gpadl_array[i].buffer,
> + HVPFN_UP(channel->gpadl_array[i].size)))
> + 

RE: [PATCH V3 03/13] x86/HV: Add new hvcall guest address host visibility support

2021-08-12 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM

[snip]

> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index ad8a5c586a35..1e4a0882820a 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -29,6 +29,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "../mm_internal.h"
> 
> @@ -1980,15 +1982,11 @@ int set_memory_global(unsigned long addr, int 
> numpages)
>   __pgprot(_PAGE_GLOBAL), 0);
>  }
> 
> -static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> +static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool 
> enc)
>  {
>   struct cpa_data cpa;
>   int ret;
> 
> - /* Nothing to do if memory encryption is not active */
> - if (!mem_encrypt_active())
> - return 0;
> -
>   /* Should not be working on unaligned addresses */
>   if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>   addr &= PAGE_MASK;
> @@ -2023,6 +2021,17 @@ static int __set_memory_enc_dec(unsigned long addr, 
> int numpages, bool enc)
>   return ret;
>  }
> 
> +static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> +{
> + if (hv_is_isolation_supported())
> + return hv_set_mem_host_visibility(addr, numpages, !enc);
> +
> + if (mem_encrypt_active())
> + return __set_memory_enc_pgtable(addr, numpages, enc);
> +
> + return 0;
> +}
> +

FYI, this not-yet-accepted patch
https://lore.kernel.org/lkml/ab5a7a983a943e7ca0a7ad28275a2d094c62c371.1623421410.git.ashish.ka...@amd.com/
looks to be providing a generic hook to notify the hypervisor when the
encryption status of a memory range changes.

Michael



RE: [PATCH V3 03/13] x86/HV: Add new hvcall guest address host visibility support

2021-08-12 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> Subject: [PATCH V3 03/13] x86/HV: Add new hvcall guest address host 
> visibility support

Use "x86/hyperv:" tag in the Subject line.

> 
> From: Tianyu Lan 
> 
> Add new hvcall guest address host visibility support to mark
> memory visible to host. Call it inside set_memory_decrypted
> /encrypted(). Add HYPERVISOR feature check in the
> hv_is_isolation_supported() to optimize in non-virtualization
> environment.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v2:
>* Rework __set_memory_enc_dec() and call Hyper-V and AMD function
>  according to platform check.
> 
> Change since v1:
>* Use new staic call x86_set_memory_enc to avoid add Hyper-V
>  specific check in the set_memory code.
> ---
>  arch/x86/hyperv/Makefile   |   2 +-
>  arch/x86/hyperv/hv_init.c  |   6 ++
>  arch/x86/hyperv/ivm.c  | 114 +
>  arch/x86/include/asm/hyperv-tlfs.h |  20 +
>  arch/x86/include/asm/mshyperv.h|   4 +-
>  arch/x86/mm/pat/set_memory.c   |  19 +++--
>  include/asm-generic/hyperv-tlfs.h  |   1 +
>  include/asm-generic/mshyperv.h |   1 +
>  8 files changed, 160 insertions(+), 7 deletions(-)
>  create mode 100644 arch/x86/hyperv/ivm.c
> 
> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
> index 48e2c51464e8..5d2de10809ae 100644
> --- a/arch/x86/hyperv/Makefile
> +++ b/arch/x86/hyperv/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -obj-y:= hv_init.o mmu.o nested.o irqdomain.o
> +obj-y:= hv_init.o mmu.o nested.o irqdomain.o ivm.o
>  obj-$(CONFIG_X86_64) += hv_apic.o hv_proc.o
> 
>  ifdef CONFIG_X86_64
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 0bb4d9ca7a55..b3683083208a 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -607,6 +607,12 @@ EXPORT_SYMBOL_GPL(hv_get_isolation_type);
> 
>  bool hv_is_isolation_supported(void)
>  {
> + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> + return 0;
> +
> + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> + return 0;
> +
>   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;

Could all of the tests in this function be run at initialization time, and
a single Boolean value pre-computed that this function returns?  I don't
think any of tests would change during the lifetime of the Linux instance,
so running the tests every time is slower than it needs to be.

>  }
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> new file mode 100644
> index ..8c905ffdba7f
> --- /dev/null
> +++ b/arch/x86/hyperv/ivm.c
> @@ -0,0 +1,114 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Hyper-V Isolation VM interface with paravisor and hypervisor
> + *
> + * Author:
> + *  Tianyu Lan 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
> + *
> + * In Isolation VM, all guest memory is encripted from host and guest
> + * needs to set memory visible to host via hvcall before sharing memory
> + * with host.
> + */
> +int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
> +enum hv_mem_host_visibility visibility)
> +{
> + struct hv_gpa_range_for_visibility **input_pcpu, *input;
> + u16 pages_processed;
> + u64 hv_status;
> + unsigned long flags;
> +
> + /* no-op if partition isolation is not enabled */
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + if (count > HV_MAX_MODIFY_GPA_REP_COUNT) {
> + pr_err("Hyper-V: GPA count:%d exceeds supported:%lu\n", count,
> + HV_MAX_MODIFY_GPA_REP_COUNT);
> + return -EINVAL;
> + }
> +
> + local_irq_save(flags);
> + input_pcpu = (struct hv_gpa_range_for_visibility **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> + input = *input_pcpu;
> + if (unlikely(!input)) {
> + local_irq_restore(flags);
> + return -EINVAL;
> + }
> +
> + input->partition_id = HV_PARTITION_ID_SELF;
> + input->host_visibility = visibility;
> + input->reserved0 = 0;
> + input->reserved1 = 0;
> + memcpy((void *)input->gpa_page_list, pfn, count * sizeof(*pfn));
> + hv_status = hv_do_rep_hypercall(
> + HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
> + 0, input, _processed);
> + local_irq_restore(flags);
> +
> + if (!(hv_status & HV_HYPERCALL_RESULT_MASK))
> + return 0;

pages_processed should also be checked to ensure that it equals count.
If not, something has gone wrong in the hypercall.

> +
> + return hv_status & HV_HYPERCALL_RESULT_MASK;
> +}
> +EXPORT_SYMBOL(hv_mark_gpa_visibility);
> +
> +static int 

RE: [PATCH V3 02/13] x86/HV: Initialize shared memory boundary in the Isolation VM.

2021-08-12 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> Subject: [PATCH V3 02/13] x86/HV: Initialize shared memory boundary in the 
> Isolation VM.

As with Patch 1, use the "x86/hyperv:" tag in the Subject line.

> 
> From: Tianyu Lan 
> 
> Hyper-V exposes shared memory boundary via cpuid
> HYPERV_CPUID_ISOLATION_CONFIG and store it in the
> shared_gpa_boundary of ms_hyperv struct. This prepares
> to share memory with host for SNP guest.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/kernel/cpu/mshyperv.c |  2 ++
>  include/asm-generic/mshyperv.h | 12 +++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 6b5835a087a3..2b7f396ef1a5 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -313,6 +313,8 @@ static void __init ms_hyperv_init_platform(void)
>   if (ms_hyperv.priv_high & HV_ISOLATION) {
>   ms_hyperv.isolation_config_a = 
> cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
>   ms_hyperv.isolation_config_b = 
> cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
> + ms_hyperv.shared_gpa_boundary =
> + (u64)1 << ms_hyperv.shared_gpa_boundary_bits;

You could use BIT_ULL() here, but it's kind of a shrug.

> 
>   pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 
> 0x%x\n",
>   ms_hyperv.isolation_config_a, 
> ms_hyperv.isolation_config_b);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 4269f3174e58..aa26d24a5ca9 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -35,8 +35,18 @@ struct ms_hyperv_info {
>   u32 max_vp_index;
>   u32 max_lp_index;
>   u32 isolation_config_a;
> - u32 isolation_config_b;
> + union {
> + u32 isolation_config_b;
> + struct {
> + u32 cvm_type : 4;
> + u32 Reserved11 : 1;
> + u32 shared_gpa_boundary_active : 1;
> + u32 shared_gpa_boundary_bits : 6;
> + u32 Reserved12 : 20;

Any reason to name the reserved fields as "11" and "12"?  It
just looks a bit unusual.  And I'd suggest lowercase "r".

> + };
> + };
>   void  __percpu **ghcb_base;
> + u64 shared_gpa_boundary;
>  };
>  extern struct ms_hyperv_info ms_hyperv;
> 
> --
> 2.25.1




RE: [PATCH V3 01/13] x86/HV: Initialize GHCB page in Isolation VM

2021-08-12 Thread Michael Kelley
From: Tianyu Lan  Sent: Monday, August 9, 2021 10:56 AM
> Subject: [PATCH V3 01/13] x86/HV: Initialize GHCB page in Isolation VM

The subject line tag on patches under arch/x86/hyperv is generally 
"x86/hyperv:".
There's some variation in the spelling of "hyperv", but let's go with the all
lowercase "hyperv".

> 
> Hyper-V exposes GHCB page via SEV ES GHCB MSR for SNP guest
> to communicate with hypervisor. Map GHCB page for all
> cpus to read/write MSR register and submit hvcall request
> via GHCB.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/hyperv/hv_init.c   | 66 +++--
>  arch/x86/include/asm/mshyperv.h |  2 +
>  include/asm-generic/mshyperv.h  |  2 +
>  3 files changed, 66 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 708a2712a516..0bb4d9ca7a55 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -42,6 +43,31 @@ static void *hv_hypercall_pg_saved;
>  struct hv_vp_assist_page **hv_vp_assist_page;
>  EXPORT_SYMBOL_GPL(hv_vp_assist_page);
> 
> +static int hyperv_init_ghcb(void)
> +{
> + u64 ghcb_gpa;
> + void *ghcb_va;
> + void **ghcb_base;
> +
> + if (!ms_hyperv.ghcb_base)
> + return -EINVAL;
> +
> + /*
> +  * GHCB page is allocated by paravisor. The address
> +  * returned by MSR_AMD64_SEV_ES_GHCB is above shared
> +  * ghcb boundary and map it here.
> +  */
> + rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
> + ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> + if (!ghcb_va)
> + return -ENOMEM;
> +
> + ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> + *ghcb_base = ghcb_va;
> +
> + return 0;
> +}
> +
>  static int hv_cpu_init(unsigned int cpu)
>  {
>   union hv_vp_assist_msr_contents msr = { 0 };
> @@ -85,6 +111,8 @@ static int hv_cpu_init(unsigned int cpu)
>   }
>   }
> 
> + hyperv_init_ghcb();
> +
>   return 0;
>  }
> 
> @@ -177,6 +205,14 @@ static int hv_cpu_die(unsigned int cpu)
>  {
>   struct hv_reenlightenment_control re_ctrl;
>   unsigned int new_cpu;
> + void **ghcb_va = NULL;

I'm not seeing any reason why this needs to be initialized.

> +
> + if (ms_hyperv.ghcb_base) {
> + ghcb_va = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> + if (*ghcb_va)
> + memunmap(*ghcb_va);
> + *ghcb_va = NULL;
> + }
> 
>   hv_common_cpu_die(cpu);
> 
> @@ -383,9 +419,19 @@ void __init hyperv_init(void)
>   VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
>   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
>   __builtin_return_address(0));
> - if (hv_hypercall_pg == NULL) {
> - wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> - goto remove_cpuhp_state;
> + if (hv_hypercall_pg == NULL)
> + goto clean_guest_os_id;
> +
> + if (hv_isolation_type_snp()) {
> + ms_hyperv.ghcb_base = alloc_percpu(void *);
> + if (!ms_hyperv.ghcb_base)
> + goto clean_guest_os_id;
> +
> + if (hyperv_init_ghcb()) {
> + free_percpu(ms_hyperv.ghcb_base);
> + ms_hyperv.ghcb_base = NULL;
> + goto clean_guest_os_id;
> + }

Having the GHCB setup code here splits the hypercall page setup into
two parts, which is unexpected.  First the memory is allocated
for the hypercall page, then the GHCB stuff is done, then the hypercall
MSR is setup.  Is there a need to do this split?  Also, if the GHCB stuff
fails and you goto clean_guest_os_id, the memory allocated for the
hypercall page is never freed.

It's also unexpected to have hyperv_init_ghcb() called here and called
in hv_cpu_init().  Wouldn't it be possible to setup ghcb_base *before*
cpu_setup_state() is called, so that hv_cpu_init() would take care of
calling hyperv_init_ghcb() for the boot CPU?  That's the pattern used
by the VP assist page, the percpu input page, etc.

>   }
> 
>   rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> @@ -456,7 +502,8 @@ void __init hyperv_init(void)
>   hv_query_ext_cap(0);
>   return;
> 
> -remove_cpuhp_state:
> +clean_guest_os_id:
> + wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
>   cpuhp_remove_state(cpuhp);
>  free_vp_assist_page:
>   kfree(hv_vp_assist_page);
> @@ -484,6 +531,9 @@ void hyperv_cleanup(void)
>*/
>   hv_hypercall_pg = NULL;
> 
> + if (ms_hyperv.ghcb_base)
> + free_percpu(ms_hyperv.ghcb_base);
> +

I don't think this cleanup is necessary.  The primary purpose of
hyperv_cleanup() is to ensure that things like overlay pages are
properly reset in Hyper-V before doing a kexec(), or before
panic'ing and running the kdump kernel.  There's no need to do
general 

RE: [PATCH v4 07/15] x86/paravirt: switch time pvops functions to use static_call()

2021-01-24 Thread Michael Kelley
From: Juergen Gross  Sent: Wednesday, January 20, 2021 5:56 AM
> 
> The time pvops functions are the only ones left which might be
> used in 32-bit mode and which return a 64-bit value.
> 
> Switch them to use the static_call() mechanism instead of pvops, as
> this allows quite some simplification of the pvops implementation.
> 
> Signed-off-by: Juergen Gross 
> ---
> V4:
> - drop paravirt_time.h again
> - don't move Hyper-V code (Michael Kelley)
> ---
>  arch/x86/Kconfig  |  1 +
>  arch/x86/include/asm/mshyperv.h   |  2 +-
>  arch/x86/include/asm/paravirt.h   | 17 ++---
>  arch/x86/include/asm/paravirt_types.h |  6 --
>  arch/x86/kernel/cpu/vmware.c  |  5 +++--
>  arch/x86/kernel/kvm.c |  2 +-
>  arch/x86/kernel/kvmclock.c|  2 +-
>  arch/x86/kernel/paravirt.c| 16 
>  arch/x86/kernel/tsc.c |  2 +-
>  arch/x86/xen/time.c   | 11 ---
>  drivers/clocksource/hyperv_timer.c|  5 +++--
>  drivers/xen/time.c|  2 +-
>  12 files changed, 42 insertions(+), 29 deletions(-)
> 

[snip]

> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 30f76b966857..b4ee331d29a7 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -63,7 +63,7 @@ typedef int (*hyperv_fill_flush_list_func)(
>  static __always_inline void hv_setup_sched_clock(void *sched_clock)
>  {
>  #ifdef CONFIG_PARAVIRT
> - pv_ops.time.sched_clock = sched_clock;
> + paravirt_set_sched_clock(sched_clock);
>  #endif
>  }
> 

This looks fine.

[snip]

> diff --git a/drivers/clocksource/hyperv_timer.c 
> b/drivers/clocksource/hyperv_timer.c
> index ba04cb381cd3..bf3bf20bc6bd 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -445,7 +446,7 @@ static bool __init hv_init_tsc_clocksource(void)
>   clocksource_register_hz(_cs_tsc, NSEC_PER_SEC/100);
> 
>   hv_sched_clock_offset = hv_read_reference_counter();
> - hv_setup_sched_clock(read_hv_sched_clock_tsc);
> + paravirt_set_sched_clock(read_hv_sched_clock_tsc);
> 
>   return true;
>  }
> @@ -470,6 +471,6 @@ void __init hv_init_clocksource(void)
>   clocksource_register_hz(_cs_msr, NSEC_PER_SEC/100);
> 
>   hv_sched_clock_offset = hv_read_reference_counter();
> - hv_setup_sched_clock(read_hv_sched_clock_msr);
> + static_call_update(pv_sched_clock, read_hv_sched_clock_msr);
>  }
>  EXPORT_SYMBOL_GPL(hv_init_clocksource);

The changes to hyperv_timer.c aren't needed and shouldn't be
there, so as to preserve hyperv_timer.c as architecture neutral.  With
your update to hv_setup_sched_clock() in mshyperv.h, the original
code works correctly.  While there are two call sites for
hv_setup_sched_clock(), only one is called.  And once the sched clock
function is set, it is never changed or overridden.

Michael



RE: [PATCH v3 06/15] x86/paravirt: switch time pvops functions to use static_call()

2020-12-17 Thread Michael Kelley
From: Juergen Gross  Sent: Thursday, December 17, 2020 1:31 AM

> The time pvops functions are the only ones left which might be
> used in 32-bit mode and which return a 64-bit value.
> 
> Switch them to use the static_call() mechanism instead of pvops, as
> this allows quite some simplification of the pvops implementation.
> 
> Due to include hell this requires to split out the time interfaces
> into a new header file.
> 
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/Kconfig  |  1 +
>  arch/x86/include/asm/mshyperv.h   | 11 
>  arch/x86/include/asm/paravirt.h   | 14 --
>  arch/x86/include/asm/paravirt_time.h  | 38 +++
>  arch/x86/include/asm/paravirt_types.h |  6 -
>  arch/x86/kernel/cpu/vmware.c  |  5 ++--
>  arch/x86/kernel/kvm.c |  3 ++-
>  arch/x86/kernel/kvmclock.c|  3 ++-
>  arch/x86/kernel/paravirt.c| 16 ---
>  arch/x86/kernel/tsc.c |  3 ++-
>  arch/x86/xen/time.c   | 12 -
>  drivers/clocksource/hyperv_timer.c|  5 ++--
>  drivers/xen/time.c|  3 ++-
>  kernel/sched/sched.h  |  1 +
>  14 files changed, 71 insertions(+), 50 deletions(-)
>  create mode 100644 arch/x86/include/asm/paravirt_time.h
>

[snip]
 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index ffc289992d1b..45942d420626 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -56,17 +56,6 @@ typedef int (*hyperv_fill_flush_list_func)(
>  #define hv_get_raw_timer() rdtsc_ordered()
>  #define hv_get_vector() HYPERVISOR_CALLBACK_VECTOR
> 
> -/*
> - * Reference to pv_ops must be inline so objtool
> - * detection of noinstr violations can work correctly.
> - */
> -static __always_inline void hv_setup_sched_clock(void *sched_clock)
> -{
> -#ifdef CONFIG_PARAVIRT
> - pv_ops.time.sched_clock = sched_clock;
> -#endif
> -}
> -
>  void hyperv_vector_handler(struct pt_regs *regs);
> 
>  static inline void hv_enable_stimer0_percpu_irq(int irq) {}

[snip]

> diff --git a/drivers/clocksource/hyperv_timer.c 
> b/drivers/clocksource/hyperv_timer.c
> index ba04cb381cd3..1ed79993fc50 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  static struct clock_event_device __percpu *hv_clock_event;
>  static u64 hv_sched_clock_offset __ro_after_init;
> @@ -445,7 +446,7 @@ static bool __init hv_init_tsc_clocksource(void)
>   clocksource_register_hz(_cs_tsc, NSEC_PER_SEC/100);
> 
>   hv_sched_clock_offset = hv_read_reference_counter();
> - hv_setup_sched_clock(read_hv_sched_clock_tsc);
> + paravirt_set_sched_clock(read_hv_sched_clock_tsc);
> 
>   return true;
>  }
> @@ -470,6 +471,6 @@ void __init hv_init_clocksource(void)
>   clocksource_register_hz(_cs_msr, NSEC_PER_SEC/100);
> 
>   hv_sched_clock_offset = hv_read_reference_counter();
> - hv_setup_sched_clock(read_hv_sched_clock_msr);
> + static_call_update(pv_sched_clock, read_hv_sched_clock_msr);
>  }
>  EXPORT_SYMBOL_GPL(hv_init_clocksource);

These Hyper-V changes are problematic as we want to keep hyperv_timer.c
architecture independent.  While only the code for x86/x64 is currently
accepted upstream, code for ARM64 support is in progress.   So we need
to use hv_setup_sched_clock() in hyperv_timer.c, and have the per-arch
implementation in mshyperv.h.

Michael



Re: [Xen-devel] [PATCH v3 3/3] x86/hyperv: L0 assisted TLB flush

2020-02-17 Thread Michael Kelley
From: Wei Liu  On Behalf Of Wei Liu

[snip]

> diff --git a/xen/arch/x86/guest/hyperv/util.c 
> b/xen/arch/x86/guest/hyperv/util.c
> new file mode 100644
> index 00..0abb37b05f
> --- /dev/null
> +++ b/xen/arch/x86/guest/hyperv/util.c
> @@ -0,0 +1,74 @@
> +/**
> 
> + * arch/x86/guest/hyperv/util.c
> + *
> + * Hyper-V utility functions
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; If not, see https://www.gnu.org/licenses/.
> + *
> + * Copyright (c) 2020 Microsoft.
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#include "private.h"
> +
> +int cpumask_to_vpset(struct hv_vpset *vpset,
> + const cpumask_t *mask)
> +{
> +int nr = 1;
> +unsigned int cpu, vcpu_bank, vcpu_offset;
> +unsigned int max_banks = ms_hyperv.max_vp_index / 64;
> +
> +/* Up to 64 banks can be represented by valid_bank_mask */
> +if ( max_banks > 64 )
> +return -E2BIG;
> +
> +/* Clear all banks to avoid flushing unwanted CPUs */
> +for ( vcpu_bank = 0; vcpu_bank < max_banks; vcpu_bank++ )
> +vpset->bank_contents[vcpu_bank] = 0;
> +
> +vpset->valid_bank_mask = 0;
> +vpset->format = HV_GENERIC_SET_SPARSE_4K;
> +
> +for_each_cpu ( cpu, mask )
> +{
> +unsigned int vcpu = hv_vp_index(cpu);
> +
> +vcpu_bank = vcpu / 64;
> +vcpu_offset = vcpu % 64;
> +
> +__set_bit(vcpu_offset, >bank_contents[vcpu_bank]);
> +__set_bit(vcpu_bank, >valid_bank_mask);

This approach to setting the bits in the valid_bank_mask causes a bug.
If an entire 64-bit word in the bank_contents array is zero because there
are no CPUs in that range, the corresponding bit in valid_bank_mask still
must be set to tell Hyper-V that the 64-bit word is present in the array
and should be processed, even though the content is zero.  A zero bit
in valid_bank_mask indicates that the corresponding 64-bit word in the
array is not present, and every 64-bit word above it has been shifted down.
That's why the similar Linux function sets valid_bank_mask the way that
it does.

Michael

> +
> +if ( vcpu_bank >= nr )
> +nr = vcpu_bank + 1;
> +}
> +
> +return nr;
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> --
> 2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 3/3] x86/hyperv: L0 assisted TLB flush

2020-02-14 Thread Michael Kelley
From: Wei Liu  On Behalf Of Wei Liu Sent: Friday, 
February 14, 2020 4:35 AM
> 
> Implement L0 assisted TLB flush for Xen on Hyper-V. It takes advantage
> of several hypercalls:
> 
>  * HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST
>  * HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX
>  * HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE
>  * HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX
> 
> Pick the most efficient hypercalls available.
> 
> Signed-off-by: Wei Liu 
> ---
> v2:
> 1. Address Roger and Jan's comments re types etc.
> 2. Fix pointer arithmetic.
> 3. Misc improvement to code.
> ---
>  xen/arch/x86/guest/hyperv/Makefile  |   1 +
>  xen/arch/x86/guest/hyperv/private.h |   9 ++
>  xen/arch/x86/guest/hyperv/tlb.c | 172 +++-
>  xen/arch/x86/guest/hyperv/util.c|  74 
>  4 files changed, 255 insertions(+), 1 deletion(-)
>  create mode 100644 xen/arch/x86/guest/hyperv/util.c
> 
> diff --git a/xen/arch/x86/guest/hyperv/Makefile 
> b/xen/arch/x86/guest/hyperv/Makefile
> index 18902c33e9..0e39410968 100644
> --- a/xen/arch/x86/guest/hyperv/Makefile
> +++ b/xen/arch/x86/guest/hyperv/Makefile
> @@ -1,2 +1,3 @@
>  obj-y += hyperv.o
>  obj-y += tlb.o
> +obj-y += util.o
> diff --git a/xen/arch/x86/guest/hyperv/private.h 
> b/xen/arch/x86/guest/hyperv/private.h
> index 509bedaafa..79a77930a0 100644
> --- a/xen/arch/x86/guest/hyperv/private.h
> +++ b/xen/arch/x86/guest/hyperv/private.h
> @@ -24,12 +24,21 @@
> 
>  #include 
>  #include 
> +#include 
> 
>  DECLARE_PER_CPU(void *, hv_input_page);
>  DECLARE_PER_CPU(void *, hv_vp_assist);
>  DECLARE_PER_CPU(unsigned int, hv_vp_index);
> 
> +static inline unsigned int hv_vp_index(unsigned int cpu)
> +{
> +return per_cpu(hv_vp_index, cpu);
> +}
> +
>  int hyperv_flush_tlb(const cpumask_t *mask, const void *va,
>   unsigned int flags);
> 
> +/* Returns number of banks, -ev if error */
> +int cpumask_to_vpset(struct hv_vpset *vpset, const cpumask_t *mask);
> +
>  #endif /* __XEN_HYPERV_PRIVIATE_H__  */
> diff --git a/xen/arch/x86/guest/hyperv/tlb.c b/xen/arch/x86/guest/hyperv/tlb.c
> index 48f527229e..f68e14f151 100644
> --- a/xen/arch/x86/guest/hyperv/tlb.c
> +++ b/xen/arch/x86/guest/hyperv/tlb.c
> @@ -19,15 +19,185 @@
>   * Copyright (c) 2020 Microsoft.
>   */
> 
> +#include 
>  #include 
>  #include 
> 
> +#include 
> +#include 
> +#include 
> +
>  #include "private.h"
> 
> +/*
> + * It is possible to encode up to 4096 pages using the lower 12 bits
> + * in an element of gva_list
> + */
> +#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
> +
> +static unsigned int fill_gva_list(uint64_t *gva_list, const void *va,
> +  unsigned int order)
> +{
> +unsigned long start = (unsigned long)va;
> +unsigned long end = start + (PAGE_SIZE << order) - 1;
> +unsigned int n = 0;
> +
> +do {
> +unsigned long remain = end - start;

The calculated value here isn't actually the remaining bytes in the
range to flush -- it's one less than the remaining bytes in the range
to flush because of the -1 in the calculation of 'end'.   That difference
will mess up the comparison below against HV_TLB_FLUSH_UNIT
in the case that there are exactly 4096 page remaining to be
flushed.  It should take the "=" case, but won't.  Also, the
'-1' in 'remain - 1' in the else clause becomes unneeded, and
the 'start = end' assignment then propagates the error.

In the parallel code in Linux, if you follow the call sequence to get to
fill_gav_list(), the 'end' argument is really the address of the first byte
of the first page that isn't in the flush range (i.e., one beyond the true
'end') and so is a bit misnamed.

I think the calculation of 'end' should drop the -1, and perhaps 'end'
should be renamed.

Michael

> +
> +gva_list[n] = start & PAGE_MASK;
> +
> +/*
> + * Use lower 12 bits to encode the number of additional pages
> + * to flush
> + */
> +if ( remain >= HV_TLB_FLUSH_UNIT )
> +{
> +gva_list[n] |= ~PAGE_MASK;
> +start += HV_TLB_FLUSH_UNIT;
> +}
> +else if ( remain )
> +{
> +gva_list[n] |= (remain - 1) >> PAGE_SHIFT;
> +start = end;
> +}
> +
> +n++;
> +} while ( start < end );
> +
> +return n;
> +}
> +


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 2/7] x86/hyperv: setup hypercall page

2020-01-23 Thread Michael Kelley
From: Jan Beulich  Sent: Thursday, January 23, 2020 3:19 AM
> 
> On 22.01.2020 21:23, Wei Liu wrote:
> > --- a/xen/arch/x86/e820.c
> > +++ b/xen/arch/x86/e820.c
> > @@ -36,6 +36,22 @@ boolean_param("e820-verbose", e820_verbose);
> >  struct e820map e820;
> >  struct e820map __initdata e820_raw;
> >
> > +static unsigned int find_phys_addr_bits(void)
> > +{
> > +uint32_t eax;
> > +unsigned int phys_bits = 36;
> > +
> > +eax = cpuid_eax(0x8000);
> > +if ( (eax >> 16) == 0x8000 && eax >= 0x8008 )
> > +{
> > +phys_bits = (uint8_t)cpuid_eax(0x8008);
> > +if ( phys_bits > PADDR_BITS )
> > +phys_bits = PADDR_BITS;
> > +}
> > +
> > +return phys_bits;
> > +}
> 
> Instead of this, how about pulling further ahead the call to
> early_cpu_init() in setup.c? (Otherwise the function wants to
> be __init at least.)
> 
> > @@ -357,6 +373,21 @@ static unsigned long __init find_max_pfn(void)
> >  max_pfn = end;
> >  }
> >
> > +#ifdef CONFIG_HYPERV_GUEST
> > +{
> > +   /*
> > +* We reserve the top-most page for hypercall page. Adjust
> > +* max_pfn if necessary.
> > +*/
> > +unsigned int phys_bits = find_phys_addr_bits();
> > +unsigned long hcall_pfn =
> > +  ((1ull << phys_bits) - 1) >> PAGE_SHIFT;
> > +
> > +if ( max_pfn >= hcall_pfn )
> > +  max_pfn = hcall_pfn - 1;
> > +}
> > +#endif
> 
> This wants abstracting away: There shouldn't be Hyper-V specific
> code in here if at all possible, and the adjustment also shouldn't
> be made unless actually running on Hyper-V.
> 
> > --- a/xen/arch/x86/guest/hyperv/hyperv.c
> > +++ b/xen/arch/x86/guest/hyperv/hyperv.c
> > @@ -18,17 +18,27 @@
> >   *
> >   * Copyright (c) 2019 Microsoft.
> >   */
> > +#include 
> >  #include 
> 
> Please sort alphabetically.
> 
> > +#include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  struct ms_hyperv_info __read_mostly ms_hyperv;
> >
> > -static const struct hypervisor_ops ops = {
> > -.name = "Hyper-V",
> > -};
> > +static uint64_t generate_guest_id(void)
> > +{
> > +uint64_t id = 0;
> 
> Pointless initializer.
> 
> > +id = (uint64_t)HV_XEN_VENDOR_ID << 48;
> > +id |= (xen_major_version() << 16) | xen_minor_version();
> > +
> > +return id;
> > +}
> >
> > +static const struct hypervisor_ops ops;
> >  const struct hypervisor_ops *__init hyperv_probe(void)
> 
> Blank line between these two please (if you can't get away without
> this declaration in the first place).
> 
> > @@ -72,6 +82,43 @@ const struct hypervisor_ops *__init hyperv_probe(void)
> >  return 
> >  }
> >
> > +static void __init setup_hypercall_page(void)
> > +{
> > +union hv_x64_msr_hypercall_contents hypercall_msr;
> > +union hv_guest_os_id guest_id;
> > +unsigned long mfn;
> > +
> > +rdmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id.raw);
> > +if ( !guest_id.raw )
> > +{
> > +guest_id.raw = generate_guest_id();
> > +wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id.raw);
> > +}
> > +
> > +rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> > +if ( !hypercall_msr.enable )
> > +{
> > +mfn = ((1ull << paddr_bits) - 1) >> HV_HYP_PAGE_SHIFT;
> 
> Along the lines of the abstracting-away request above: How is
> anyone to notice what else needs changing if it is decided
> that this page gets moved elsewhere?
> 
> > +hypercall_msr.enable = 1;
> > +hypercall_msr.guest_physical_address = mfn;
> > +wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> 
> So on Hyper-V the hypervisor magically populates this page as
> a side effect of the MSR write?
> 

Yes, that's exactly what happens. :-)  Hyper-V takes the guest
physical address and remaps it to a different physical page that
Hyper-V has set up with whatever is needed to execute the hypercall
sequence. The original physical page corresponding to the guest physical
address is not modified, but it remains unused and inaccessible to the
guest.  When/if the MSR is written again with the enable flag set to 0,
the mapping is flipped back, and the original physical page, with its
original contents, reappears. There are a few other pages with this
same behavior.  The Hyper-V TLFS refers to these "GPA Overlay
Pages".  See Section 5.2.1 of the TLFS.

Michael
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 2/7] x86/hyperv: setup hypercall page

2020-01-22 Thread Michael Kelley
From: Wei Liu  On Behalf Of Wei Liu  Sent: Wednesday, 
January 22, 2020 12:24 PM
> 
> Use the top-most addressable page for that purpose. Adjust e820 code
> accordingly.
> 
> We also need to register Xen's guest OS ID to Hyper-V. Use 0x300 as the
> OS type.
> 
> Signed-off-by: Wei Liu 
> ---
> XXX the decision on Xen's vendor ID is pending.
> 
> v4:
> 1. Use fixmap
> 2. Follow routines listed in TLFS
> ---
>  xen/arch/x86/e820.c | 41 +++
>  xen/arch/x86/guest/hyperv/hyperv.c  | 53 +++--
>  xen/include/asm-x86/guest/hyperv-tlfs.h |  5 ++-
>  3 files changed, 86 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
> index 082f9928a1..5a4ef27a0b 100644
> --- a/xen/arch/x86/e820.c
> +++ b/xen/arch/x86/e820.c
> @@ -36,6 +36,22 @@ boolean_param("e820-verbose", e820_verbose);
>  struct e820map e820;
>  struct e820map __initdata e820_raw;
> 
> +static unsigned int find_phys_addr_bits(void)
> +{
> +uint32_t eax;
> +unsigned int phys_bits = 36;
> +
> +eax = cpuid_eax(0x8000);
> +if ( (eax >> 16) == 0x8000 && eax >= 0x8008 )
> +{
> +phys_bits = (uint8_t)cpuid_eax(0x8008);
> +if ( phys_bits > PADDR_BITS )
> +phys_bits = PADDR_BITS;
> +}
> +
> +return phys_bits;
> +}
> +
>  /*
>   * This function checks if the entire range  is mapped with type.
>   *
> @@ -357,6 +373,21 @@ static unsigned long __init find_max_pfn(void)
>  max_pfn = end;
>  }
> 
> +#ifdef CONFIG_HYPERV_GUEST
> +{
> + /*
> +  * We reserve the top-most page for hypercall page. Adjust
> +  * max_pfn if necessary.
> +  */
> +unsigned int phys_bits = find_phys_addr_bits();
> +unsigned long hcall_pfn =
> +  ((1ull << phys_bits) - 1) >> PAGE_SHIFT;
> +
> +if ( max_pfn >= hcall_pfn )
> +  max_pfn = hcall_pfn - 1;
> +}
> +#endif
> +
>  return max_pfn;
>  }
> 
> @@ -420,7 +451,7 @@ static uint64_t __init mtrr_top_of_ram(void)
>  {
>  uint32_t eax, ebx, ecx, edx;
>  uint64_t mtrr_cap, mtrr_def, addr_mask, base, mask, top;
> -unsigned int i, phys_bits = 36;
> +unsigned int i, phys_bits;
> 
>  /* By default we check only Intel systems. */
>  if ( e820_mtrr_clip == -1 )
> @@ -446,13 +477,7 @@ static uint64_t __init mtrr_top_of_ram(void)
>   return 0;
> 
>  /* Find the physical address size for this CPU. */
> -eax = cpuid_eax(0x8000);
> -if ( (eax >> 16) == 0x8000 && eax >= 0x8008 )
> -{
> -phys_bits = (uint8_t)cpuid_eax(0x8008);
> -if ( phys_bits > PADDR_BITS )
> -phys_bits = PADDR_BITS;
> -}
> +phys_bits = find_phys_addr_bits();
>  addr_mask = ((1ull << phys_bits) - 1) & ~((1ull << 12) - 1);
> 
>  rdmsrl(MSR_MTRRcap, mtrr_cap);
> diff --git a/xen/arch/x86/guest/hyperv/hyperv.c 
> b/xen/arch/x86/guest/hyperv/hyperv.c
> index 8d38313d7a..f986c1a805 100644
> --- a/xen/arch/x86/guest/hyperv/hyperv.c
> +++ b/xen/arch/x86/guest/hyperv/hyperv.c
> @@ -18,17 +18,27 @@
>   *
>   * Copyright (c) 2019 Microsoft.
>   */
> +#include 
>  #include 
> 
> +#include 
>  #include 
>  #include 
> +#include 
> 
>  struct ms_hyperv_info __read_mostly ms_hyperv;
> 
> -static const struct hypervisor_ops ops = {
> -.name = "Hyper-V",
> -};
> +static uint64_t generate_guest_id(void)
> +{
> +uint64_t id = 0;
> +
> +id = (uint64_t)HV_XEN_VENDOR_ID << 48;
> +id |= (xen_major_version() << 16) | xen_minor_version();
> +
> +return id;
> +}
> 
> +static const struct hypervisor_ops ops;
>  const struct hypervisor_ops *__init hyperv_probe(void)
>  {
>  uint32_t eax, ebx, ecx, edx;
> @@ -72,6 +82,43 @@ const struct hypervisor_ops *__init hyperv_probe(void)
>  return 
>  }
> 
> +static void __init setup_hypercall_page(void)
> +{
> +union hv_x64_msr_hypercall_contents hypercall_msr;
> +union hv_guest_os_id guest_id;
> +unsigned long mfn;
> +
> +rdmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id.raw);
> +if ( !guest_id.raw )
> +{
> +guest_id.raw = generate_guest_id();
> +wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id.raw);
> +}
> +
> +rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> +if ( !hypercall_msr.enable )
> +{
> +mfn = ((1ull << paddr_bits) - 1) >> HV_HYP_PAGE_SHIFT;
> +hypercall_msr.enable = 1;
> +hypercall_msr.guest_physical_address = mfn;
> +wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> +} else {
> +mfn = hypercall_msr.guest_physical_address;
> +}
> +
> +set_fixmap_x(FIX_X_HYPERV_HCALL, mfn << PAGE_SHIFT);
> +}
> +
> +static void __init setup(void)
> +{
> +setup_hypercall_page();
> +}
> +
> +static const struct hypervisor_ops ops = {
> +.name = "Hyper-V",
> +.setup = setup,
> +};
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/include/asm-x86/guest/hyperv-tlfs.h b/xen/include/asm-
> 

Re: [Xen-devel] [PATCH v3 3/5] x86/hyperv: provide percpu hypercall input page

2020-01-07 Thread Michael Kelley
From: Wei Liu  Sent: Tuesday, January 7, 2020 8:34 AM
> 
> On Mon, Jan 06, 2020 at 11:27:18AM +0100, Jan Beulich wrote:
> > On 05.01.2020 17:47, Wei Liu wrote:
> > > Hyper-V's input / output argument must be 8 bytes aligned an not cross
> > > page boundary. The easiest way to satisfy those requirements is to use
> > > percpu page.
> >
> > I'm not sure "easiest" is really true here. Others could consider adding
> > __aligned() attributes as easy or even easier (by being even more
> > transparent to use sites). Could we settle on "One way ..."?
> 
> Do you mean something like
> 
>struct foo __aligned(8);
> 
>hv_do_hypercall(OP, virt_to_maddr(), ...);
> 
> ?
> 
> I don't think this is transparent to user sites. Plus, foo is on stack
> which is 1) difficult to get its maddr, 2) may cross page boundary.
> 
> If I misunderstood what you meant, please give me an example here.
> 
> >
> > > @@ -83,14 +84,33 @@ static void __init setup_hypercall_page(void)
> > >  wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> > >  }
> > >
> > > +static void setup_hypercall_pcpu_arg(void)
> > > +{
> > > +void *mapping;
> > > +
> > > +mapping = alloc_xenheap_page();
> > > +if ( !mapping )
> > > +panic("Failed to allocate hypercall input page for %u\n",
> >
> > "... for CPU%u\n" please.
> >
> > > +  smp_processor_id());
> > > +
> > > +this_cpu(hv_pcpu_input_arg) = mapping;
> >
> > When offlining and then re-onlining a CPU, the prior page will be
> > leaked.
> 
> Right. Thanks for catching this one.
> 
> >
> > > --- a/xen/include/asm-x86/guest/hyperv.h
> > > +++ b/xen/include/asm-x86/guest/hyperv.h
> > > @@ -51,6 +51,8 @@ static inline uint64_t hv_scale_tsc(uint64_t tsc, 
> > > uint64_t scale,
> > >
> > >  #ifdef CONFIG_HYPERV_GUEST
> > >
> > > +#include 
> > > +
> > >  #include 
> > >
> > >  struct ms_hyperv_info {
> > > @@ -63,6 +65,8 @@ struct ms_hyperv_info {
> > >  };
> > >  extern struct ms_hyperv_info ms_hyperv;
> > >
> > > +DECLARE_PER_CPU(void *, hv_pcpu_input_arg);
> >
> > Will this really be needed outside of the file that defines it?
> >
> 
> This can live in a private header for the time being.
> 
> > Also, while looking at this I notice that - despite my earlier
> > comment when giving the respective, sort-of-conditional ack -
> > there are (still) many apparently pointless __packed attributes
> > in hyperv-tlfs.h. Care to comment on this?
> 
> Again, that's a straight import from Linux. I tried not to deviate too
> much. A commit in Linux (ec084491727b0) claims "compiler can add
> alignment padding to structures or reorder struct members for
> randomization and optimization".
> 
> I just checked all the packed structures. They seem to have all the
> required manual paddings already. I can only assume they tried to erred
> on the safe side.

Correct.  The __packed attribute was added only about a year ago
after somebody on LKML noticed that the structures were not packed.
Some discussion ensued, but the consensus was to add __packed due
to general  paranoia about what the compiler might do even though
individual fields are aligned to their natural boundary.

Michael

> 
> Wei.
> 
> >
> > Jan

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 8/8] x86/hyperv: setup VP assist page

2019-12-29 Thread Michael Kelley
From: Wei Liu  On Behalf Of Wei Liu  Sent: Sunday, 
December 29, 2019 10:34 AM
> 
> VP assist page is rather important as we need to toggle some bits in
> that page such that L1 guest can make hypercalls directly to L0 Hyper-V.
> 
> Preemptively split out set_vp_assist page which will be used in the resume
> path.
> 
> Signed-off-by: Wei Liu 
> ---
>  xen/arch/x86/guest/hyperv/hyperv.c | 35 ++
>  xen/include/asm-x86/guest/hyperv.h |  1 +
>  2 files changed, 36 insertions(+)
> 
> diff --git a/xen/arch/x86/guest/hyperv/hyperv.c 
> b/xen/arch/x86/guest/hyperv/hyperv.c
> index da3a8cd85d..a88b9ae6d9 100644
> --- a/xen/arch/x86/guest/hyperv/hyperv.c
> +++ b/xen/arch/x86/guest/hyperv/hyperv.c
> @@ -30,6 +30,7 @@ void *hv_hypercall;
>  static struct page_info *hv_hypercall_page;
>  DEFINE_PER_CPU_READ_MOSTLY(struct hyperv_pcpu_page, hv_pcpu_input_arg);
>  DEFINE_PER_CPU_READ_MOSTLY(unsigned int, hv_vp_index);
> +DEFINE_PER_CPU_READ_MOSTLY(struct hyperv_pcpu_page, hv_vp_assist);
> 
>  static const struct hypervisor_ops ops;
>  const struct hypervisor_ops *__init hyperv_probe(void)
> @@ -125,17 +126,51 @@ static void setup_vp_index(void)
>  this_cpu(hv_vp_index) = vp_index_msr;
>  }
> 
> +static void set_vp_assist(void)
> +{
> +uint64_t val = paddr_to_pfn(this_cpu(hv_vp_assist).maddr);
> +
> +val = (val << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) | 

I'd recommend using HV_HYP_PAGE_SHIFT instead of
HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT.  On the Linux side,
I'm planning to remove that #define and the similar
*_PAGE_ADDRESS_MASK in favor of the newer HV_HYP_PAGE_* values.
There's nothing special about the VP assist page, so using the generic
#defines based on the Hyper-V page size is reasonable.

Michael

> +HV_X64_MSR_VP_ASSIST_PAGE_ENABLE;
> +
> +wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
> +}
> +
> +static void setup_vp_assist(void)
> +{
> +struct page_info *pg;
> +void *mapping;
> +unsigned int cpu = smp_processor_id();
> +
> +pg = alloc_domheap_page(NULL, 0);
> +if ( !pg )
> +panic("Failed to allocate vp_assist page for %u\n", cpu);
> +
> +mapping = __map_domain_page_global(pg);
> +if ( !mapping )
> +panic("Failed to map vp_assist page for %u\n", cpu);
> +
> +clear_page(mapping);
> +
> +this_cpu(hv_vp_assist).maddr = page_to_maddr(pg);
> +this_cpu(hv_vp_assist).mapping = mapping;
> +
> +set_vp_assist();
> +}
> +
>  static void __init setup(void)
>  {
>  setup_hypercall_page();
>  setup_hypercall_pcpu_arg();
>  setup_vp_index();
> +setup_vp_assist();
>  }
> 
>  static void ap_setup(void)
>  {
>  setup_hypercall_pcpu_arg();
>  setup_vp_index();
> +setup_vp_assist();
>  }
> 
>  static const struct hypervisor_ops ops = {
> diff --git a/xen/include/asm-x86/guest/hyperv.h 
> b/xen/include/asm-x86/guest/hyperv.h
> index 4b635829f3..917f4e02c2 100644
> --- a/xen/include/asm-x86/guest/hyperv.h
> +++ b/xen/include/asm-x86/guest/hyperv.h
> @@ -71,6 +71,7 @@ struct hyperv_pcpu_page {
>  };
>  DECLARE_PER_CPU(struct hyperv_pcpu_page, hv_pcpu_input_arg);
>  DECLARE_PER_CPU(unsigned int, hv_vp_index);
> +DECLARE_PER_CPU(struct hyperv_pcpu_page, hv_vp_assist);
> 
>  const struct hypervisor_ops *hyperv_probe(void);
> 
> --
> 2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 4/8] x86/hyperv: setup hypercall page

2019-12-29 Thread Michael Kelley
From: Wei Liu  On Behalf Of Wei Liu  Sent: Sunday, 
December 29, 2019 10:34 AM
> 
> Signed-off-by: Wei Liu 
> ---
>  xen/arch/x86/guest/hyperv/hyperv.c | 41 +++---
>  1 file changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/x86/guest/hyperv/hyperv.c 
> b/xen/arch/x86/guest/hyperv/hyperv.c
> index c6a26c5453..438910c8cb 100644
> --- a/xen/arch/x86/guest/hyperv/hyperv.c
> +++ b/xen/arch/x86/guest/hyperv/hyperv.c
> @@ -19,16 +19,17 @@
>   * Copyright (c) 2019 Microsoft.
>   */
>  #include 
> +#include 
> 
>  #include 
>  #include 
> 
>  struct ms_hyperv_info __read_mostly ms_hyperv;
> 
> -static const struct hypervisor_ops ops = {
> -.name = "Hyper-V",
> -};
> +void *hv_hypercall;
> +static struct page_info *hv_hypercall_page;
> 
> +static const struct hypervisor_ops ops;
>  const struct hypervisor_ops *__init hyperv_probe(void)
>  {
>  uint32_t eax, ebx, ecx, edx;
> @@ -71,6 +72,40 @@ const struct hypervisor_ops *__init hyperv_probe(void)
>  return 
>  }
> 
> +static void __init setup_hypercall_page(void)
> +{
> +union hv_x64_msr_hypercall_contents hypercall_msr;
> +
> +/* Unfortunately there isn't a really good way to unwind Xen to
> + * not use Hyper-V hooks, so panic if anything goes wrong.
> + *
> + * In practice if page allocation fails this early on it is
> + * unlikely we can get a working system later.
> + */
> +hv_hypercall_page = alloc_domheap_page(NULL, 0);
> +if ( !hv_hypercall_page )
> +panic("Failed to allocate Hyper-V hypercall page\n");
> +
> +hv_hypercall = __map_domain_page_global(hv_hypercall_page);
> +if ( !hv_hypercall )
> +panic("Failed to map Hyper-V hypercall page\n");
> +
> +rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> +hypercall_msr.enable = 1;
> +hypercall_msr.guest_physical_address = page_to_maddr(hv_hypercall_page);

The "guest_physical_address" field is actually the guest physical page number.
So the physical address needs to be right shifted 12 bits before being stored
here.  I'd recommend using HV_HYP_PAGE_SHIFT from hyperv-tlfs.h as
the shift value; it was introduced to deal with the possibility that the page
size used and expected by the Hyper-V interface is different from the page
size used by the guest VM (which can happen on ARM64, though not on x86).

Michael

> +wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> +}
> +
> +static void __init setup(void)
> +{
> +setup_hypercall_page();
> +}
> +
> +static const struct hypervisor_ops ops = {
> +.name = "Hyper-V",
> +.setup = setup,
> +};
> +
>  /*
>   * Local variables:
>   * mode: C
> --
> 2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 6/6] x86: implement Hyper-V clock source

2019-12-18 Thread Michael Kelley
From: Durrant, Paul  Sent: Wednesday, December 18, 2019 
7:24 AM

> > From: Wei Liu  On Behalf Of Wei Liu
> > Sent: 18 December 2019 14:43

[snip]

> > +
> > +static inline uint64_t read_hyperv_timer(void)
> > +{
> > +uint64_t scale, offset, ret, tsc;
> > +uint32_t seq;
> > +const struct ms_hyperv_tsc_page *tsc_page = hyperv_tsc;
> > +
> > +do {
> > +seq = tsc_page->tsc_sequence;
> > +
> > +/* Seq 0 is special. It means the TSC enlightenment is not
> > + * available at the moment. The reference time can only be
> > + * obtained from the Reference Counter MSR.
> > + */
> > +if ( seq == 0 )
> 
> Older versions of the spec used to use 0x I think, although when I 
> look again they
> seem to have been retro-actively fixed. In any case I think you should treat 
> both
> 0x and 0 as invalid.

FWIW, the 0x was just a bug in the spec.  Hyper-V implementations only
set the value to 0 to indicate invalid.  The equivalent Linux code checks only 
for 0.

Michael

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-30 Thread Michael Kelley
From: Nadav Amit  Sent: Thursday, July 18, 2019 5:59 PM
> 
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. Introduce
> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
> and hyper-v are only compile-tested).
> 
> While the updated smp infrastructure is capable of running a function on
> a single local core, it is not optimized for this case. The multiple
> function calls and the indirect branch introduce some overhead, and
> might make local TLB flushes slower than they were before the recent
> changes.
> 
> Before calling the SMP infrastructure, check if only a local TLB flush
> is needed to restore the lost performance in this common case. This
> requires to check mm_cpumask() one more time, but unless this mask is
> updated very frequently, this should impact performance negatively.
> 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Sasha Levin 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: Juergen Gross 
> Cc: Paolo Bonzini 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Boris Ostrovsky 
> Cc: linux-hyp...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org
> Cc: k...@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit 
> ---
>  arch/x86/hyperv/mmu.c | 10 +++---
>  arch/x86/include/asm/paravirt.h   |  6 ++--
>  arch/x86/include/asm/paravirt_types.h |  4 +--
>  arch/x86/include/asm/tlbflush.h   |  8 ++---
>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>  arch/x86/kernel/kvm.c | 11 +--
>  arch/x86/kernel/paravirt.c|  2 +-
>  arch/x86/mm/tlb.c | 47 ++-
>  arch/x86/xen/mmu_pv.c | 11 +++
>  include/trace/events/xen.h|  2 +-
>  10 files changed, 62 insertions(+), 41 deletions(-)
> 

For the Hyper-V parts --
Reviewed-by: Michael Kelley 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel