Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On 06/27/2018 11:58 AM, Michael S. Tsirkin wrote: On Wed, Jun 27, 2018 at 11:00:05AM +0800, Wei Wang wrote: On 06/27/2018 10:41 AM, Michael S. Tsirkin wrote: On Wed, Jun 27, 2018 at 09:24:18AM +0800, Wei Wang wrote: On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote: On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote: On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote: On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote: + if (!arrays) + return NULL; + + for (i = 0; i < max_array_num; i++) { So we are getting a ton of memory here just to free it up a bit later. Why doesn't get_from_free_page_list get the pages from free list for us? We could also avoid the 1st allocation then - just build a list of these. That wouldn't be a good choice for us. If we check how the regular allocation works, there are many many things we need to consider when pages are allocated to users. For example, we need to take care of the nr_free counter, we need to check the watermark and perform the related actions. Also the folks working on arch_alloc_page to monitor page allocation activities would get a surprise..if page allocation is allowed to work in this way. mm/ code is well positioned to handle all this correctly. I'm afraid that would be a re-implementation of the alloc functions, A re-factoring - you can share code. The main difference is locking. and that would be much more complex than what we have. I think your idea of passing a list of pages is better. Best, Wei How much memory is this allocating anyway? For every 2TB memory that the guest has, we allocate 4MB. Hmm I guess I'm missing something, I don't see it: + max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER); + entries_per_page = PAGE_SIZE / sizeof(__le64); + entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER); + max_array_num = max_entries / entries_per_array + + !!(max_entries % entries_per_array); Looks like you always allocate the max number? Yes. We allocated the max number and then free what's not used. For example, a 16TB guest, we allocate Four 4MB buffers and pass the 4 buffers to get_from_free_page_list. If it uses 3, then the remaining 1 "4MB buffer" will end up being freed. For today's guests, max_array_num is usually 1. Best, Wei I see, it's based on total ram pages. It's reasonable but might get out of sync if memory is onlined quickly. So you want to detect that there's more free memory than can fit and retry the reporting. - AFAIK, memory hotplug isn't expected to happen during live migration today. Hypervisors (e.g. QEMU) explicitly forbid this. - Allocating buffers based on total ram pages already gives some headroom for newly plugged memory if that could happen in any case. Also, we can think about why people plug in more memory - usually because the existing memory isn't enough, which implies that the free page list is very likely to be close to empty. - This method could be easily scaled if people really need more headroom for hot-plugged memory. For example, calculation based on "X * total_ram_pages", X could be a number passed from the hypervisor. - This is an optimization feature, and reporting less free memory in that rare case doesn't hurt anything. So I think it is good to start from a fundamental implementation, which doesn't confuse people, and complexities can be added when there is a real need in the future. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On Wed, Jun 27, 2018 at 11:00:05AM +0800, Wei Wang wrote: > On 06/27/2018 10:41 AM, Michael S. Tsirkin wrote: > > On Wed, Jun 27, 2018 at 09:24:18AM +0800, Wei Wang wrote: > > > On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote: > > > > On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote: > > > > > On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote: > > > > > > On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote: > > > > > > > > > > > > > > > + if (!arrays) > > > > > > > > > + return NULL; > > > > > > > > > + > > > > > > > > > + for (i = 0; i < max_array_num; i++) { > > > > > > > > So we are getting a ton of memory here just to free it up a bit > > > > > > > > later. > > > > > > > > Why doesn't get_from_free_page_list get the pages from free > > > > > > > > list for us? > > > > > > > > We could also avoid the 1st allocation then - just build a list > > > > > > > > of these. > > > > > > > That wouldn't be a good choice for us. If we check how the regular > > > > > > > allocation works, there are many many things we need to consider > > > > > > > when pages > > > > > > > are allocated to users. > > > > > > > For example, we need to take care of the nr_free > > > > > > > counter, we need to check the watermark and perform the related > > > > > > > actions. > > > > > > > Also the folks working on arch_alloc_page to monitor page > > > > > > > allocation > > > > > > > activities would get a surprise..if page allocation is allowed to > > > > > > > work in > > > > > > > this way. > > > > > > > > > > > > > mm/ code is well positioned to handle all this correctly. > > > > > I'm afraid that would be a re-implementation of the alloc functions, > > > > A re-factoring - you can share code. The main difference is locking. > > > > > > > > > and > > > > > that would be much more complex than what we have. I think your idea > > > > > of > > > > > passing a list of pages is better. > > > > > > > > > > Best, > > > > > Wei > > > > How much memory is this allocating anyway? > > > > > > > For every 2TB memory that the guest has, we allocate 4MB. > > Hmm I guess I'm missing something, I don't see it: > > > > > > + max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER); > > + entries_per_page = PAGE_SIZE / sizeof(__le64); > > + entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER); > > + max_array_num = max_entries / entries_per_array + > > + !!(max_entries % entries_per_array); > > > > Looks like you always allocate the max number? > > Yes. We allocated the max number and then free what's not used. > For example, a 16TB guest, we allocate Four 4MB buffers and pass the 4 > buffers to get_from_free_page_list. If it uses 3, then the remaining 1 "4MB > buffer" will end up being freed. > > For today's guests, max_array_num is usually 1. > > Best, > Wei I see, it's based on total ram pages. It's reasonable but might get out of sync if memory is onlined quickly. So you want to detect that there's more free memory than can fit and retry the reporting. > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On 06/27/2018 10:41 AM, Michael S. Tsirkin wrote: On Wed, Jun 27, 2018 at 09:24:18AM +0800, Wei Wang wrote: On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote: On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote: On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote: On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote: + if (!arrays) + return NULL; + + for (i = 0; i < max_array_num; i++) { So we are getting a ton of memory here just to free it up a bit later. Why doesn't get_from_free_page_list get the pages from free list for us? We could also avoid the 1st allocation then - just build a list of these. That wouldn't be a good choice for us. If we check how the regular allocation works, there are many many things we need to consider when pages are allocated to users. For example, we need to take care of the nr_free counter, we need to check the watermark and perform the related actions. Also the folks working on arch_alloc_page to monitor page allocation activities would get a surprise..if page allocation is allowed to work in this way. mm/ code is well positioned to handle all this correctly. I'm afraid that would be a re-implementation of the alloc functions, A re-factoring - you can share code. The main difference is locking. and that would be much more complex than what we have. I think your idea of passing a list of pages is better. Best, Wei How much memory is this allocating anyway? For every 2TB memory that the guest has, we allocate 4MB. Hmm I guess I'm missing something, I don't see it: + max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER); + entries_per_page = PAGE_SIZE / sizeof(__le64); + entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER); + max_array_num = max_entries / entries_per_array + + !!(max_entries % entries_per_array); Looks like you always allocate the max number? Yes. We allocated the max number and then free what's not used. For example, a 16TB guest, we allocate Four 4MB buffers and pass the 4 buffers to get_from_free_page_list. If it uses 3, then the remaining 1 "4MB buffer" will end up being freed. For today's guests, max_array_num is usually 1. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On Wed, Jun 27, 2018 at 09:24:18AM +0800, Wei Wang wrote: > On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote: > > On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote: > > > On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote: > > > > On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote: > > > > > > > > > > > + if (!arrays) > > > > > > > + return NULL; > > > > > > > + > > > > > > > + for (i = 0; i < max_array_num; i++) { > > > > > > So we are getting a ton of memory here just to free it up a bit > > > > > > later. > > > > > > Why doesn't get_from_free_page_list get the pages from free list > > > > > > for us? > > > > > > We could also avoid the 1st allocation then - just build a list > > > > > > of these. > > > > > That wouldn't be a good choice for us. If we check how the regular > > > > > allocation works, there are many many things we need to consider when > > > > > pages > > > > > are allocated to users. > > > > > For example, we need to take care of the nr_free > > > > > counter, we need to check the watermark and perform the related > > > > > actions. > > > > > Also the folks working on arch_alloc_page to monitor page allocation > > > > > activities would get a surprise..if page allocation is allowed to > > > > > work in > > > > > this way. > > > > > > > > > mm/ code is well positioned to handle all this correctly. > > > I'm afraid that would be a re-implementation of the alloc functions, > > A re-factoring - you can share code. The main difference is locking. > > > > > and > > > that would be much more complex than what we have. I think your idea of > > > passing a list of pages is better. > > > > > > Best, > > > Wei > > How much memory is this allocating anyway? > > > > For every 2TB memory that the guest has, we allocate 4MB. Hmm I guess I'm missing something, I don't see it: + max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER); + entries_per_page = PAGE_SIZE / sizeof(__le64); + entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER); + max_array_num = max_entries / entries_per_array + + !!(max_entries % entries_per_array); Looks like you always allocate the max number? > This is the same > for both cases. > For today's guests, usually there will be only one 4MB allocated and passed > to get_from_free_page_list. > > Best, > Wei > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote: On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote: On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote: On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote: + if (!arrays) + return NULL; + + for (i = 0; i < max_array_num; i++) { So we are getting a ton of memory here just to free it up a bit later. Why doesn't get_from_free_page_list get the pages from free list for us? We could also avoid the 1st allocation then - just build a list of these. That wouldn't be a good choice for us. If we check how the regular allocation works, there are many many things we need to consider when pages are allocated to users. For example, we need to take care of the nr_free counter, we need to check the watermark and perform the related actions. Also the folks working on arch_alloc_page to monitor page allocation activities would get a surprise..if page allocation is allowed to work in this way. mm/ code is well positioned to handle all this correctly. I'm afraid that would be a re-implementation of the alloc functions, A re-factoring - you can share code. The main difference is locking. and that would be much more complex than what we have. I think your idea of passing a list of pages is better. Best, Wei How much memory is this allocating anyway? For every 2TB memory that the guest has, we allocate 4MB. This is the same for both cases. For today's guests, usually there will be only one 4MB allocated and passed to get_from_free_page_list. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote: > On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin wrote: > > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote: > >> > > > > Might not neccessarily be something wrong, but it's very limited to > >> > > > > prohibit the MAC of VF from changing when enslaved by failover. > >> > > > You mean guest changing MAC? I'm not sure why we prohibit that. > >> > > I think Sridhar and Jiri might be better person to answer it. My > >> > > impression was that sync'ing the MAC address change between all 3 > >> > > devices is challenging, as the failover driver uses MAC address to > >> > > match net_device internally. > >> > >> Yes. The MAC address is assigned by the hypervisor and it needs to manage > >> the movement > >> of the MAC between the PF and VF. Allowing the guest to change the MAC > >> will require > >> synchronization between the hypervisor and the PF/VF drivers. Most of the > >> VF drivers > >> don't allow changing guest MAC unless it is a trusted VF. > > > > OK but it's a policy thing. Maybe it's a trusted VF. Who knows? > > For example I can see host just > > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it. > > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest. > > That's why I think pairing using MAC is fragile IMHO. When VF's MAC > got changed before virtio attempts to match and pair the device, it > ends up with no pairing found out at all. Guest seems to match on the hardware mac and ignore whatever is set by user. Makes sense to me and should not be fragile. > UUID is better. > > -Siwei > > > > > -- > > MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin wrote: > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote: >> > > > > Might not neccessarily be something wrong, but it's very limited to >> > > > > prohibit the MAC of VF from changing when enslaved by failover. >> > > > You mean guest changing MAC? I'm not sure why we prohibit that. >> > > I think Sridhar and Jiri might be better person to answer it. My >> > > impression was that sync'ing the MAC address change between all 3 >> > > devices is challenging, as the failover driver uses MAC address to >> > > match net_device internally. >> >> Yes. The MAC address is assigned by the hypervisor and it needs to manage >> the movement >> of the MAC between the PF and VF. Allowing the guest to change the MAC will >> require >> synchronization between the hypervisor and the PF/VF drivers. Most of the VF >> drivers >> don't allow changing guest MAC unless it is a trusted VF. > > OK but it's a policy thing. Maybe it's a trusted VF. Who knows? > For example I can see host just > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it. > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest. That's why I think pairing using MAC is fragile IMHO. When VF's MAC got changed before virtio attempts to match and pair the device, it ends up with no pairing found out at all. UUID is better. -Siwei > > -- > MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 0/5] Add virtio-iommu driver
On Thu, Jun 21, 2018 at 08:06:50PM +0100, Jean-Philippe Brucker wrote: > Implement the base virtio-iommu driver, following version 0.7 of the > specification [1]. > > Changes since last version [2]: > * Address comments, thanks again for the review. > * As suggested, add a DT binding description in patch 1. > * Depend on VIRTIO_MMIO=y to fix a build failure¹ > * Switch to v0.7 of the spec, which changes resv_mem parameters and adds > an MMIO flag. These are trivial but not backward compatible. Once > device or driver is upstream, updates to the spec will rely on feature > bits to stay compatible with this code. > * Implement the new tlb_sync interface, by splitting add_req() and > sync_req(). I noticed a small improvement on netperf stream because > the synchronous iommu_unmap() also benefits from this. Other > experiments, such as using kmem_cache for requests instead of kmalloc, > didn't show any improvement. > > Driver is available on branch virtio-iommu/v0.7 [3], and the x86+ACPI > prototype is on branch virtio-iommu/devel. That x86 code hasn't changed, > it still requires the DMA ifdeffery and I lack the expertise to tidy it > up. The kvmtool example device has been cleaned up and is available on > branch virtio-iommu/v0.7 [4]. > > Feedback welcome! > > Thanks, > Jean So as I pointed out new virtio 0 device isn't really welcome ;) No one bothered implementing virtio 1 in MMIO for all the work that was put in defining it. The fact that the MMIO part of the spec doesn't seem to allow for transitional devices might or might not have something to do with this. So why make it an MMIO device at all? A system with an IOMMU will have a PCI bus, won't it? And it's pretty common to have the IOMMU be a PCI device on the root bus. Will remove need to bother with dt bindings etc. > [1] virtio-iommu specification v0.7 > https://www.spinics.net/lists/linux-virtualization/msg34127.html > [2] virtio-iommu driver v1 > https://www.spinics.net/lists/kvm/msg164322.html > [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.7 > > http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/virtio-iommu/v0.7 > [4] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.7 > > --- > ¹ A word on the module story. Because of complex dependencies IOMMU > drivers cannot yet be .ko modules. Enabling it is outside the scope of > this series but I have a small prototype on branch virtio-iommu/ > module-devel. It seems desirable since some distros currently ship the > transport code as module and are unlikely to change this on our account. > Note that this series works fine with arm64 defconfig, which selects > VIRTIO_MMIO=y. > > I could use some help to clean this up. Currently my solution is to split > virtio-iommu into a module and a 3-lines built-in stub, which isn't > graceful but could have been worse. > > Keeping the whole virtio-iommu driver as builtin would require accessing > any virtio utility through get_symbol. So far we only have seven of > those and could keep a list of pointer ops, but I find this solution > terrible. If virtio or virtio_mmio is a module, virtio-iommu also needs > to be one. > > The minimal set of changes to make a module out of an OF-based IOMMU > driver seems to be: > * Export IOMMU symbols used by drivers > * Don't give up deferring probe in of_iommu > * Move IOMMU OF tables to .rodata > * Create a static virtio-iommu stub that declares the virtio-iommu OF > table entry. The build system doesn't pick up IOMMU_OF_DECLARE when > done from an object destined to be built as module :( > * Create a device_link between endpoint and IOMMU, to ensure that > removing the IOMMU driver also unbinds the endpoint driver. Putting > this in IOMMU core seems like a better approach since even when not > built as module, unbinding an IOMMU device via sysfs will cause > crashes. > > With this, as long as virtio-mmio isn't loaded, the OF code defers probe > of endpoints that depend on virtio-iommu. Once virtio-mmio is loaded, > the probe is still deferred until virtio-iommu registers itself to the > virtio bus. Once virtio-iommu is loaded, probe of endpoints managed by > the IOMMU follows. > > I'll investigate ACPI IORT when I find some time, though I don't expect > much complication and suggest we tackle one problem at a time. Since it > is a luxury that requires changes to the IOMMU core, module support is > left as a future improvement. > --- > > Jean-Philippe Brucker (5): > dt-bindings: virtio: Specify #iommu-cells value for a virtio-iommu > iommu: Add virtio-iommu driver > iommu/virtio: Add probe request > iommu/virtio: Add event queue > vfio: Allow type-1 IOMMU instantiation for ARM > > .../devicetree/bindings/virtio/mmio.txt |8 + > MAINTAINERS |6 + > drivers/iommu/Kconfig | 11 + >
Re: [virtio-dev] [PATCH v2 3/5] iommu/virtio: Add probe request
On 21/06/18 20:06, Jean-Philippe Brucker wrote: > +static int viommu_add_resv_mem(struct viommu_endpoint *vdev, > +struct virtio_iommu_probe_resv_mem *mem, > +size_t len) > +{ > + struct iommu_resv_region *region = NULL; > + unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; > + > + u64 start = le64_to_cpu(mem->start); > + u64 end = le64_to_cpu(mem->end); > + size_t size = end - start + 1; > + > + if (len < sizeof(*mem)) > + return -EINVAL; > + > + switch (mem->subtype) { > + default: > + if (mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_RESERVED && > + mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_MSI) Hm, I messed it up while rebasing. I'll remove this condition. Thanks, Jean > + dev_warn(vdev->dev, "unknown resv mem subtype 0x%x\n", > + mem->subtype); > + /* Fall-through */ > + case VIRTIO_IOMMU_RESV_MEM_T_RESERVED: > + region = iommu_alloc_resv_region(start, size, 0, > + IOMMU_RESV_RESERVED); > + break; > + case VIRTIO_IOMMU_RESV_MEM_T_MSI: > + region = iommu_alloc_resv_region(start, size, prot, > + IOMMU_RESV_MSI); > + break; > + } > + > + list_add(>resv_regions, >list); > + return 0; > +} ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/5] dt-bindings: virtio: Specify #iommu-cells value for a virtio-iommu
On 25/06/18 20:27, Rob Herring wrote: > On Thu, Jun 21, 2018 at 08:06:51PM +0100, Jean-Philippe Brucker wrote: >> A virtio-mmio node may represent a virtio-iommu device. This is discovered >> by the virtio driver at probe time, but the DMA topology isn't >> discoverable and must be described by firmware. For DT the standard IOMMU >> description is used, as specified in bindings/iommu/iommu.txt and >> bindings/pci/pci-iommu.txt. Like many other IOMMUs, virtio-iommu >> distinguishes masters by their endpoint IDs, which requires one IOMMU cell >> in the "iommus" property. >> >> Signed-off-by: Jean-Philippe Brucker >> --- >> Documentation/devicetree/bindings/virtio/mmio.txt | 8 >> 1 file changed, 8 insertions(+) >> >> diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt >> b/Documentation/devicetree/bindings/virtio/mmio.txt >> index 5069c1b8e193..337da0e3a87f 100644 >> --- a/Documentation/devicetree/bindings/virtio/mmio.txt >> +++ b/Documentation/devicetree/bindings/virtio/mmio.txt >> @@ -8,6 +8,14 @@ Required properties: >> - reg: control registers base address and size including >> configuration space >> - interrupts: interrupt generated by the device >> >> +Required properties for virtio-iommu: >> + >> +- #iommu-cells: When the node describes a virtio-iommu device, it is >> +linked to DMA masters using the "iommus" property as >> +described in devicetree/bindings/iommu/iommu.txt. For >> +virtio-iommu #iommu-cells must be 1, each cell describing >> +a single endpoint ID. > > The iommus property should also be documented for the client side. Isn't section "IOMMU master node" of iommu.txt sufficient? Since the iommus property applies to any DMA master, not only virtio-mmio devices, the canonical description in iommu.txt seems the best place for it, and I'm not sure what to add in this file. Maybe a short example below the virtio_block one? Thanks, Jean ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, Jun 26, 2018 at 05:08:13PM +0200, Cornelia Huck wrote: > On Fri, 22 Jun 2018 17:05:04 -0700 > Siwei Liu wrote: > > > On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin wrote: > > > I suspect the diveregence will be lost on most users though > > > simply because they don't even care about vfio. They just > > > want things to go fast. > > > > Like Jason said, VF isn't faster than virtio-net in all cases. It > > depends on the workload and performance metrics: throughput, latency, > > or packet per second. > > So, will it be guest/admin-controllable then where the traffic flows > through? Just because we do have a vf available after negotiation of > the feature bit, it does not necessarily mean we want to use it? Do we > (the guest) even want to make it visible in that case? I think these ideas belong to what Alex Duyck wanted to do: some kind of advanced device that isn't tied to any network interfaces and allows workload and performance specific tuning. Way out of scope for a simple failover, and more importantly, no one is looking at even enumerating the problems involved, much less solving them. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, Jun 26, 2018 at 06:03:16PM +0200, Cornelia Huck wrote: > Ok, that makes me conclude that we definitely need to involve the > libvirt folks before we proceed further with defining QEMU interfaces. That's always a wise thing to do. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, 26 Jun 2018 18:38:51 +0300 "Michael S. Tsirkin" wrote: > On Tue, Jun 26, 2018 at 05:17:32PM +0200, Cornelia Huck wrote: > > On Tue, 26 Jun 2018 04:50:25 +0300 > > "Michael S. Tsirkin" wrote: > > > > > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote: > > > > > > > > Might not neccessarily be something wrong, but it's very > > > > > > > > limited to > > > > > > > > prohibit the MAC of VF from changing when enslaved by failover. > > > > > > > > > > > > > > > You mean guest changing MAC? I'm not sure why we prohibit that. > > > > > > > > > > > > > I think Sridhar and Jiri might be better person to answer it. My > > > > > > impression was that sync'ing the MAC address change between all 3 > > > > > > devices is challenging, as the failover driver uses MAC address to > > > > > > match net_device internally. > > > > > > > > Yes. The MAC address is assigned by the hypervisor and it needs to > > > > manage the movement > > > > of the MAC between the PF and VF. Allowing the guest to change the MAC > > > > will require > > > > synchronization between the hypervisor and the PF/VF drivers. Most of > > > > the VF drivers > > > > don't allow changing guest MAC unless it is a trusted VF. > > > > > > OK but it's a policy thing. Maybe it's a trusted VF. Who knows? > > > For example I can see host just > > > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it. > > > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest. > > > > > > > So, what I get from this is that QEMU needs to be able to control all > > of standby, uuid, and mac to accommodate the different setups > > (respectively have libvirt/management software set it up). Is the host > > able to find out respectively define whether a VF is trusted? > > You do it with ip link I think but QEMU doesn't normally do this, > it relies on libvirt to poke at host kernel and supply the info. > Ok, that makes me conclude that we definitely need to involve the libvirt folks before we proceed further with defining QEMU interfaces. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, Jun 26, 2018 at 05:17:32PM +0200, Cornelia Huck wrote: > On Tue, 26 Jun 2018 04:50:25 +0300 > "Michael S. Tsirkin" wrote: > > > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote: > > > > > > > Might not neccessarily be something wrong, but it's very limited > > > > > > > to > > > > > > > prohibit the MAC of VF from changing when enslaved by failover. > > > > > > You mean guest changing MAC? I'm not sure why we prohibit that. > > > > > I think Sridhar and Jiri might be better person to answer it. My > > > > > impression was that sync'ing the MAC address change between all 3 > > > > > devices is challenging, as the failover driver uses MAC address to > > > > > match net_device internally. > > > > > > Yes. The MAC address is assigned by the hypervisor and it needs to manage > > > the movement > > > of the MAC between the PF and VF. Allowing the guest to change the MAC > > > will require > > > synchronization between the hypervisor and the PF/VF drivers. Most of the > > > VF drivers > > > don't allow changing guest MAC unless it is a trusted VF. > > > > OK but it's a policy thing. Maybe it's a trusted VF. Who knows? > > For example I can see host just > > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it. > > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest. > > > > So, what I get from this is that QEMU needs to be able to control all > of standby, uuid, and mac to accommodate the different setups > (respectively have libvirt/management software set it up). Is the host > able to find out respectively define whether a VF is trusted? You do it with ip link I think but QEMU doesn't normally do this, it relies on libvirt to poke at host kernel and supply the info. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, 26 Jun 2018 04:50:25 +0300 "Michael S. Tsirkin" wrote: > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote: > > > > > > Might not neccessarily be something wrong, but it's very limited to > > > > > > prohibit the MAC of VF from changing when enslaved by failover. > > > > > You mean guest changing MAC? I'm not sure why we prohibit that. > > > > I think Sridhar and Jiri might be better person to answer it. My > > > > impression was that sync'ing the MAC address change between all 3 > > > > devices is challenging, as the failover driver uses MAC address to > > > > match net_device internally. > > > > Yes. The MAC address is assigned by the hypervisor and it needs to manage > > the movement > > of the MAC between the PF and VF. Allowing the guest to change the MAC > > will require > > synchronization between the hypervisor and the PF/VF drivers. Most of the > > VF drivers > > don't allow changing guest MAC unless it is a trusted VF. > > OK but it's a policy thing. Maybe it's a trusted VF. Who knows? > For example I can see host just > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it. > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest. > So, what I get from this is that QEMU needs to be able to control all of standby, uuid, and mac to accommodate the different setups (respectively have libvirt/management software set it up). Is the host able to find out respectively define whether a VF is trusted? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Fri, 22 Jun 2018 17:05:04 -0700 Siwei Liu wrote: > On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin wrote: > > I suspect the diveregence will be lost on most users though > > simply because they don't even care about vfio. They just > > want things to go fast. > > Like Jason said, VF isn't faster than virtio-net in all cases. It > depends on the workload and performance metrics: throughput, latency, > or packet per second. So, will it be guest/admin-controllable then where the traffic flows through? Just because we do have a vf available after negotiation of the feature bit, it does not necessarily mean we want to use it? Do we (the guest) even want to make it visible in that case? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, Jun 26, 2018 at 01:55:09PM +0200, Cornelia Huck wrote: > On Tue, 26 Jun 2018 04:46:03 +0300 > "Michael S. Tsirkin" wrote: > > > On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote: > > > On Fri, 22 Jun 2018 22:05:50 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote: > > > > > On Thu, 21 Jun 2018 21:20:13 +0300 > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote: > > > > > > > OK, so what about the following: > > > > > > > > > > > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that > > > > > > > indicates > > > > > > > that we have a new uuid field in the virtio-net config space > > > > > > > - in QEMU, add a property for virtio-net that allows to specify a > > > > > > > uuid, > > > > > > > offer VIRTIO_NET_F_STANDBY_UUID if set > > > > > > > - when configuring, set the property to the group UUID of the > > > > > > > vfio-pci > > > > > > > device > > > > > > > - in the guest, use the uuid from the virtio-net device's config > > > > > > > space > > > > > > > if applicable; else, fall back to matching by MAC as done today > > > > > > > > > > > > > > That should work for all virtio transports. > > > > > > > > > > > > True. I'm a bit unhappy that it's virtio net specific though > > > > > > since down the road I expect we'll have a very similar feature > > > > > > for scsi (and maybe others). > > > > > > > > > > > > But we do not have a way to have fields that are portable > > > > > > both across devices and transports, and I think it would > > > > > > be a useful addition. How would this work though? Any idea? > > > > > > > > > > Can we introduce some kind of device-independent config space area? > > > > > Pushing back the device-specific config space by a certain value if > > > > > the > > > > > appropriate feature is negotiated and use that for things like the > > > > > uuid? > > > > > > > > So config moves back and forth? > > > > Reminds me of the msi vector mess we had with pci. > > > > > > Yes, that would be a bit unfortunate. > > > > > > > I'd rather have every transport add a new config. > > > > > > You mean via different mechanisms? > > > > I guess so. > > Is there an alternate mechanism for pci to use? (Not so familiar with > it.) We have a device and transport config capability. We could add a generic config capability too. > For ccw, this needs more thought. We already introduced two commands > for reading/writing the config space (a concept that does not really > exist on s390). There's the generic read configuration data command, > but the data returned by it is not really generic enough. So we would > need one new command (or two, if we need to write as well). I'm not > sure about that yet. > > > > > > > > > > > > But regardless of that, I'm not sure whether extending this approach > > > > > to > > > > > other device types is the way to go. Tying together two different > > > > > devices is creating complicated situations at least in the hypervisor > > > > > (even if it's fairly straightforward in the guest). [I have not come > > > > > around again to look at the "how to handle visibility in QEMU" > > > > > questions due to lack of cycles, sorry about that.] > > > > > > > > > > So, what's the goal of this approach? Only to allow migration with > > > > > vfio-pci, or also to plug in a faster device and use it instead of an > > > > > already attached paravirtualized device? > > > > > > > > These are two sides of the same coin, I think the second approach > > > > is closer to what we are doing here. > > > > > > Thinking about it, do we need any knob to keep the vfio device > > > invisible if the virtio device is not present? IOW, how does the > > > hypervisor know that the vfio device is supposed to be paired with a > > > virtio device? It seems we need an explicit tie-in. > > > > If we are going the way of the bridge, both bridge and > > virtio would have some kind of id. > > So the presence of the id would indicate "this is one part of a pair"? I guess so, yes. > > > > When pairing using mac, I'm less sure. PAss vfio device mac to qemu > > as a property? > > That feels a bit odd. "This is the vfio device's mac, use this instead > of your usual mac property"? As we have not designed the QEMU interface > yet, just go with the id in any case? The guest can still match by mac. OK > > > > > What about migration of vfio devices that are not easily replaced by a > > > > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and > > > > > currently only) supported device is dasd (disks) -- which can do a lot > > > > > of specialized things that virtio-blk does not support (and should not > > > > > or even cannot support). > > > > > > > > But maybe virtio-scsi can? > > > > > > I don't think so. Dasds have some channel commands
Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote: > On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote: > > On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote: > > > > > > > > > > > > > > > + if (!arrays) > > > > > + return NULL; > > > > > + > > > > > + for (i = 0; i < max_array_num; i++) { > > > > So we are getting a ton of memory here just to free it up a bit later. > > > > Why doesn't get_from_free_page_list get the pages from free list for us? > > > > We could also avoid the 1st allocation then - just build a list > > > > of these. > > > That wouldn't be a good choice for us. If we check how the regular > > > allocation works, there are many many things we need to consider when > > > pages > > > are allocated to users. > > > For example, we need to take care of the nr_free > > > counter, we need to check the watermark and perform the related actions. > > > Also the folks working on arch_alloc_page to monitor page allocation > > > activities would get a surprise..if page allocation is allowed to work in > > > this way. > > > > > mm/ code is well positioned to handle all this correctly. > > I'm afraid that would be a re-implementation of the alloc functions, A re-factoring - you can share code. The main difference is locking. > and > that would be much more complex than what we have. I think your idea of > passing a list of pages is better. > > Best, > Wei How much memory is this allocating anyway? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote: On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote: + if (!arrays) + return NULL; + + for (i = 0; i < max_array_num; i++) { So we are getting a ton of memory here just to free it up a bit later. Why doesn't get_from_free_page_list get the pages from free list for us? We could also avoid the 1st allocation then - just build a list of these. That wouldn't be a good choice for us. If we check how the regular allocation works, there are many many things we need to consider when pages are allocated to users. For example, we need to take care of the nr_free counter, we need to check the watermark and perform the related actions. Also the folks working on arch_alloc_page to monitor page allocation activities would get a surprise..if page allocation is allowed to work in this way. mm/ code is well positioned to handle all this correctly. I'm afraid that would be a re-implementation of the alloc functions, and that would be much more complex than what we have. I think your idea of passing a list of pages is better. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Tue, 26 Jun 2018 04:46:03 +0300 "Michael S. Tsirkin" wrote: > On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote: > > On Fri, 22 Jun 2018 22:05:50 +0300 > > "Michael S. Tsirkin" wrote: > > > > > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote: > > > > On Thu, 21 Jun 2018 21:20:13 +0300 > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote: > > > > > > OK, so what about the following: > > > > > > > > > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that > > > > > > indicates > > > > > > that we have a new uuid field in the virtio-net config space > > > > > > - in QEMU, add a property for virtio-net that allows to specify a > > > > > > uuid, > > > > > > offer VIRTIO_NET_F_STANDBY_UUID if set > > > > > > - when configuring, set the property to the group UUID of the > > > > > > vfio-pci > > > > > > device > > > > > > - in the guest, use the uuid from the virtio-net device's config > > > > > > space > > > > > > if applicable; else, fall back to matching by MAC as done today > > > > > > > > > > > > That should work for all virtio transports. > > > > > > > > > > True. I'm a bit unhappy that it's virtio net specific though > > > > > since down the road I expect we'll have a very similar feature > > > > > for scsi (and maybe others). > > > > > > > > > > But we do not have a way to have fields that are portable > > > > > both across devices and transports, and I think it would > > > > > be a useful addition. How would this work though? Any idea? > > > > > > > > Can we introduce some kind of device-independent config space area? > > > > Pushing back the device-specific config space by a certain value if the > > > > appropriate feature is negotiated and use that for things like the > > > > uuid? > > > > > > So config moves back and forth? > > > Reminds me of the msi vector mess we had with pci. > > > > Yes, that would be a bit unfortunate. > > > > > I'd rather have every transport add a new config. > > > > You mean via different mechanisms? > > I guess so. Is there an alternate mechanism for pci to use? (Not so familiar with it.) For ccw, this needs more thought. We already introduced two commands for reading/writing the config space (a concept that does not really exist on s390). There's the generic read configuration data command, but the data returned by it is not really generic enough. So we would need one new command (or two, if we need to write as well). I'm not sure about that yet. > > > > > > > > But regardless of that, I'm not sure whether extending this approach to > > > > other device types is the way to go. Tying together two different > > > > devices is creating complicated situations at least in the hypervisor > > > > (even if it's fairly straightforward in the guest). [I have not come > > > > around again to look at the "how to handle visibility in QEMU" > > > > questions due to lack of cycles, sorry about that.] > > > > > > > > So, what's the goal of this approach? Only to allow migration with > > > > vfio-pci, or also to plug in a faster device and use it instead of an > > > > already attached paravirtualized device? > > > > > > These are two sides of the same coin, I think the second approach > > > is closer to what we are doing here. > > > > Thinking about it, do we need any knob to keep the vfio device > > invisible if the virtio device is not present? IOW, how does the > > hypervisor know that the vfio device is supposed to be paired with a > > virtio device? It seems we need an explicit tie-in. > > If we are going the way of the bridge, both bridge and > virtio would have some kind of id. So the presence of the id would indicate "this is one part of a pair"? > > When pairing using mac, I'm less sure. PAss vfio device mac to qemu > as a property? That feels a bit odd. "This is the vfio device's mac, use this instead of your usual mac property"? As we have not designed the QEMU interface yet, just go with the id in any case? The guest can still match by mac. > > > > What about migration of vfio devices that are not easily replaced by a > > > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and > > > > currently only) supported device is dasd (disks) -- which can do a lot > > > > of specialized things that virtio-blk does not support (and should not > > > > or even cannot support). > > > > > > But maybe virtio-scsi can? > > > > I don't think so. Dasds have some channel commands that don't map > > easily to scsi commands. > > There's always a choice of adding these to the spec. > E.g. FC extensions were proposed, I don't remember why they > are still stuck. FC extensions are a completely different kind of enhancements, though. For a start, they are not unique to a certain transport. Also, we have a whole list of special dasd issues. Weird disk layout for eckd, low-level disk
Re: [PATCH v6 3/3] x86: paravirt: make native_save_fl extern inline
* Nick Desaulniers wrote: > On Thu, Jun 21, 2018 at 7:24 PM Ingo Molnar wrote: > > * Nick Desaulniers wrote: > > > > > native_save_fl() is marked static inline, but by using it as > > > a function pointer in arch/x86/kernel/paravirt.c, it MUST be outlined. > > > > > --- a/arch/x86/include/asm/irqflags.h > > > +++ b/arch/x86/include/asm/irqflags.h > > > @@ -13,7 +13,7 @@ > > > * Interrupt control: > > > */ > > > > > > -static inline unsigned long native_save_fl(void) > > > +extern inline unsigned long native_save_fl(void) > > > { > > > unsigned long flags; > > > > > > > What's the code generation effect of this on say a defconfig kernel vmlinux > > with > > paravirt enabled? > > Starting with this patch set applied: > $ make CC=gcc-8 -j46 > $ objdump -d vmlinux | grep native_save_fl --context=3 > 81059140 : > 81059140: 9cpushfq > 81059141: 58pop%rax > 81059142: c3retq > $ git checkout HEAD~3 > $ make CC=gcc-8 -j46 > $ objdump -d vmlinux | grep native_save_fl --context=3 > 81079410 : > 81079410: 9cpushfq > 81079411: 58pop%rax > 81079412: c3retq > > Mainly, this is to prevent the compiler from adding a stack protector > to the outlined version, as the stack protector clobbers %rcx, but > paravirt expects %rcx to be preserved. More info can be found: > https://lkml.org/lkml/2018/5/24/1242-- Ok! Acked-by: Ingo Molnar What's the planned upstreaming route for these patches/fixes? Thanks, Ingo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization