Re: [RFC PATCH v1] vsock: add mergeable rx buffer description
在 2021/6/11 上午2:39, Jiang Wang 写道: Mergeable rx buffer is already supported by virtio-net, and it can save memory for big packets. It will also be beneficial for the vsock devices, so add it to the spec. A lot of duplication with the virtio-net mergeable rx buffer description. I wonder whether we can have a generic feature like VIRTIO_F_MRG_BUFFER instead. --- V0 -> V1: I send similar patch with vsock dgram before and already got some comments. This version fixed those,such as use present tense for feature bit etc. Also the feature bit value is 3, because we expect to have some other featue bits defined soon. virtio-vsock.tex | 78 +++- 1 file changed, 77 insertions(+), 1 deletion(-) diff --git a/virtio-vsock.tex b/virtio-vsock.tex index da7e641..d529291 100644 --- a/virtio-vsock.tex +++ b/virtio-vsock.tex @@ -16,7 +16,9 @@ \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues} \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits} -There are currently no feature bits defined for this device. +\begin{description} +\item[VIRTIO_VSOCK_F_MRG_RXBUF (3)] Driver can merge receive buffers. +\end{description} \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout} @@ -64,6 +66,8 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op Packets transmitted or received contain a header before the payload: +If feature VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following header. + \begin{lstlisting} struct virtio_vsock_hdr { le64 src_cid; @@ -79,6 +83,15 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op }; \end{lstlisting} +If feature VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, use the following header. +\begin{lstlisting} +struct virtio_vsock_hdr_mrg_rxbuf { + struct virtio_vsock_hdr hdr; + le16 num_buffers; +}; +\end{lstlisting} + + The upper 32 bits of src_cid and dst_cid are reserved and zeroed. Most packets simply transfer data but control packets are also used for @@ -170,6 +183,23 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows communicating updates any time a change in buffer space occurs. +\drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Setting Up Receive Buffers} +\begin{itemize} +\item If VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, each buffer MUST be at +least the size of the struct virtio_vsock_hdr_mgr_rxbuf. +\end{itemize} + +\begin{note} +Obviously each buffer can be split across multiple descriptor elements. +\end{note} + +\devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Setting Up Receive Buffers} +The device MUST set \field{num_buffers} to the number of descriptors used when +transmitting the packet. Interesting, does this mean it works for tx? Virtio-net had: "The driver MUST set num_buffers to zero." + +The device MUST use only a single descriptor if VIRTIO_VSOCK_F_MRG_RXBUF +is not negotiated. Though virtio-net using something similar. But I think we need to clarify, what we really need is "a single buffer" not "a single descriptor". + \drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management} VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has sufficient free buffer space for the payload. @@ -188,6 +218,7 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / De The driver queues outgoing packets on the tx virtqueue and incoming packet receive buffers on the rx virtqueue. Packets are of the following form: +If VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following. \begin{lstlisting} struct virtio_vsock_packet { struct virtio_vsock_hdr hdr; @@ -195,9 +226,41 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / De }; \end{lstlisting} +Otherwise, use the following form: +\begin{lstlisting} +struct virtio_vsock_packet_mrg_rxbuf { +struct virtio_vsock_hdr_mrg_rxbuf hdr; +u8 data[]; +}; +\end{lstlisting} + Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for incoming packets are write-only. +When transmitting packets to the device, \field{num_buffers} is not used. + +\begin{enumerate} +\item \field{num_buffers} indicates how many descriptors Similarly, I think what we want here is "buffers" instead of "descriptors". Thanks + this packet is spread over (including this one). + This is valid only if VIRTIO_VSOCK_F_MRG_RXBUF is negotiated. + This allows receipt of large packets
Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver
Hi Linus, On 10-06-21, 22:46, Linus Walleij wrote: > Hi Viresh! > > thanks for working on this, it's a really interesting driver. > > My first question is conceptual: > > We previously have Geerts driver for virtualization: > drivers/gpio/gpio-aggregator.c > > The idea with the aggregator is that a host script sets up a > unique gpiochip for the virtualized instance using some poking > in sysfs and pass that to the virtual machine. > So this is Linux acting as virtualization host by definition. > > I think virtio is more abstract and intended for the usecase > where the hypervisor is not Linux, so this should be mentioned > in the commit, possibly also in Kconfig so users immediately > know what usecases the two different drivers are for. Well, not actually. The host can actually be anything. It can be a Xen based dom0, which runs some proprietary firmware, or Qemu running over Linux. It is left for the host to decide how it wants to club together the GPIO pins from host and access them, with Linux host userspace it would be playing with /dev/gpiochipN, while for a raw one it may be accessing registers directly. And so the backend running at host, needs to pass the gpiochip configurations and only the host understand it. The way I test it for now is by running this with Qemu over my x86 box, so my host side is indeed playing with sysfs Linux. > Possibly both could be used: aggregator to pick out the GPIOs > you want into a synthetic GPIO chip, and the actual talk > between the hypervisor/host and the guest using virtio, even > with linux-on-linux. Not sure if I understand the aggregator thing for now, but we see the backend running at host (which talks to this Linux driver at guest) as a userspace thing and not a kernel driver. Not sure if aggregator can be used like that, but anyway.. > Yet another usecase would be to jit this with remoteproc/rpmsg > and let a specific signal processor or real-time executive on > another CPU with a few GPIOs around present these to > Linux using this mechanism. Well that would certainly interest > Bjorn and other rpmsg stakeholders, so they should have > a look so that this provides what they need they day they > need it. (CCed Bjorn and also Google who may want this for > their Android emulators.) I am not very clear on the rpmsg thing, I know couple of folks at project Stratos were talking about it :) @Alex, want to chime in here for me ? :) > On Thu, Jun 10, 2021 at 2:16 PM Viresh Kumar wrote: > > > +static const char **parse_gpio_names(struct virtio_device *vdev, > > + struct virtio_gpio_config *config) > > I really like this end-to-end plug-and-play that even provides > the names over virtio. The credit goes to Enrico for this :) > I think my patch to the gpiolib to make it mandatory for names to > be unique per-chip made it in, but please make that part of the spec > so that we don't get the problem with non-unique names here. Oh, that's nice. I will surely do that. > I suppose the spec can be augmented later to also accept config > settings like open drain pull up/down etc but no need to specify > more than the basic for now. That's the plan. > But to be able to add more in the future, the client needs some > kind of query mechanism or version number so the driver can > adapt and not announce something the underlying virtio device > cannot do. Do we have this? A bitmask for features, a version > number that increase monotonically for new features to be > presented or similar? > > Because otherwise we have to bump this: > +#define VIRTIO_ID_GPIO 41 /* virtio GPIO */ > > every time we add something new (and we will). Yes, Virtio presents features for this. The patch 2/3 already uses one for IRQs. We won't need to bump up the IDs :) -- viresh ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver
On 10-06-21, 19:40, Jean-Philippe Brucker wrote: > On Thu, Jun 10, 2021 at 12:16:46PM +, Viresh Kumar via Stratos-dev wrote: Fixed everything else you suggested. > > +struct virtio_gpio_config { > > + char name[32]; > > + __u16 ngpio; > > + __u16 padding; > > + __u32 gpio_names_size; > > + char gpio_names[0]; > > A variable-size array here will make it very difficult to append new > fields to virtio_gpio_config, when adding new features to the device. (New > fields cannot be inserted in between, since older drivers will expect > existing fields at a specific offset.) Yes, I thought about that earlier and though maybe we will be able to play with that using the virtio-features, I mean a different layout of config structure if we really need to add a field in config, based on the feature flag. > You could replace it with a reference to the string array, for example > "__u16 gpio_names_offset" declaring the offset between the beginning of > device-specific config and the string array. But, I like this idea more and it does make it very flexible. Will adapt to it. > The 'name' field could also be indirect to avoid setting a fixed > 32-char size, but that's not as important. Yeah, 32 bytes is really enough. One won't be able to make any sense out of a bigger name anyway :) > > +} __packed; > > No need for __packed, because the fields are naturally aligned (as > required by the virtio spec) Yeah, I know, but I tend to add that for structures which aren't very simple (like the request/response ones), just to avoid human errors and hours of debugging someone need to go through. __packed won't harm at least :) -- viresh ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 2/3] gpio: virtio: Add IRQ support
Hi Viresh! thanks for this interesting patch! On Thu, Jun 10, 2021 at 2:16 PM Viresh Kumar wrote: > This patch adds IRQ support for the virtio GPIO driver. Note that this > uses the irq_bus_lock/unlock() callbacks since the operations over > virtio can sleep. > > Signed-off-by: Viresh Kumar > drivers/gpio/gpio-virtio.c | 256 ++- > include/uapi/linux/virtio_gpio.h | 15 ++ You also need to select GPIOLIB_IRQCHIP in the Kconfig entry for the gpio-virtio driver, because you make use of it. > +static void virtio_gpio_irq_mask(struct irq_data *d) > +{ > + struct gpio_chip *gc = irq_data_to_gpio_chip(d); > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + struct vgpio_line *line = &vgpio->lines[d->hwirq]; > + > + line->masked = true; > + line->masked_pending = true; > +} > + > +static void virtio_gpio_irq_unmask(struct irq_data *d) > +{ > + struct gpio_chip *gc = irq_data_to_gpio_chip(d); > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + struct vgpio_line *line = &vgpio->lines[d->hwirq]; > + > + line->masked = false; > + line->masked_pending = true; > +} This looks dangerous in combination with this: > +static void virtio_gpio_interrupt(struct virtqueue *vq) > +{ (...) > + local_irq_disable(); > + ret = generic_handle_irq(irq); > + local_irq_enable(); Nominally slow IRQs like those being marshalled over virtio should be nested, handle_nested_irq(irq); but are they? Or are they just quite slow not super slow? If a threaded IRQF_ONESHOT was requested the IRQ core will kick the thread and *MASK* this IRQ, which means it will call back to your .irq_mask() function and expect it to be masked from this point. But the IRQ will not actually be masked until you issue your callbacks in the .irq_bus_sync_unlock() callback right? So from this point until .irq_bus_sync_unlock() get called and actually mask the IRQ, it could be fired again? I suppose the IRQ handler is reentrant? This would violate the API. I would say that from this point and until you sync you need a spinlock or other locking primitive to stop this IRQ from fireing again, and a spinlock will imply local_irq_disable() so this gets really complex. I would say only using nesting IRQs or guarantee this some other way, one way would be to specify that whatever is at the other side of virtio cannot send another GPIO IRQ message before the last one is handled, so you would need to send a specific (new) VIRTIO_GPIO_REQ_IRQ_ACK after all other messages have been sent in .irq_bus_sync_unlock() so that the next GPIO IRQ can be dispatched after that. (Is this how messaged signalled interrupts work? No idea. When in doubt ask the IRQ maintainers.) Thanks, Linus Walleij ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver
Hi Viresh! thanks for working on this, it's a really interesting driver. My first question is conceptual: We previously have Geerts driver for virtualization: drivers/gpio/gpio-aggregator.c The idea with the aggregator is that a host script sets up a unique gpiochip for the virtualized instance using some poking in sysfs and pass that to the virtual machine. So this is Linux acting as virtualization host by definition. I think virtio is more abstract and intended for the usecase where the hypervisor is not Linux, so this should be mentioned in the commit, possibly also in Kconfig so users immediately know what usecases the two different drivers are for. Possibly both could be used: aggregator to pick out the GPIOs you want into a synthetic GPIO chip, and the actual talk between the hypervisor/host and the guest using virtio, even with linux-on-linux. Yet another usecase would be to jit this with remoteproc/rpmsg and let a specific signal processor or real-time executive on another CPU with a few GPIOs around present these to Linux using this mechanism. Well that would certainly interest Bjorn and other rpmsg stakeholders, so they should have a look so that this provides what they need they day they need it. (CCed Bjorn and also Google who may want this for their Android emulators.) On Thu, Jun 10, 2021 at 2:16 PM Viresh Kumar wrote: > +static const char **parse_gpio_names(struct virtio_device *vdev, > + struct virtio_gpio_config *config) I really like this end-to-end plug-and-play that even provides the names over virtio. I think my patch to the gpiolib to make it mandatory for names to be unique per-chip made it in, but please make that part of the spec so that we don't get the problem with non-unique names here. I suppose the spec can be augmented later to also accept config settings like open drain pull up/down etc but no need to specify more than the basic for now. But to be able to add more in the future, the client needs some kind of query mechanism or version number so the driver can adapt and not announce something the underlying virtio device cannot do. Do we have this? A bitmask for features, a version number that increase monotonically for new features to be presented or similar? Because otherwise we have to bump this: +#define VIRTIO_ID_GPIO 41 /* virtio GPIO */ every time we add something new (and we will). Yours, Linus Walleij ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [PATCH 02/19] asm-generic/hyperv: convert hyperv statuses to strings
> -Original Message- > From: Nuno Das Neves > Sent: Friday, May 28, 2021 3:43 PM > To: linux-hyp...@vger.kernel.org; linux-ker...@vger.kernel.org > Cc: virtualization@lists.linux-foundation.org; Michael Kelley > ; virem...@linux.microsoft.com; Sunil > Muthuswamy ; wei@kernel.org; vkuznets > ; Lillian Grassin-Drake > ; KY Srinivasan > Subject: [PATCH 02/19] asm-generic/hyperv: convert hyperv statuses to strings > > Allow hyperv hypercall failures to be debugged more easily with dmesg. > This will be used in the mshv module. > > Signed-off-by: Nuno Das Neves > --- > arch/x86/hyperv/hv_init.c | 2 +- > arch/x86/hyperv/hv_proc.c | 10 +++--- > include/asm-generic/hyperv-tlfs.h | 52 ++- > include/asm-generic/mshyperv.h| 8 + > 4 files changed, 44 insertions(+), 28 deletions(-) > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > index bb0ae4b5c00f..722bafdb2225 100644 > --- a/arch/x86/hyperv/hv_init.c > +++ b/arch/x86/hyperv/hv_init.c > @@ -349,7 +349,7 @@ static void __init hv_get_partition_id(void) > status = hv_do_hypercall(HVCALL_GET_PARTITION_ID, NULL, output_page); > if (!hv_result_success(status)) { > /* No point in proceeding if this failed */ > - pr_err("Failed to get partition ID: %lld\n", status); > + pr_err("Failed to get partition ID: %s\n", > hv_status_to_string(status)); > BUG(); > } > hv_current_partition_id = output_page->partition_id; > diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c > index 59cf9a9e0975..30951e778577 100644 > --- a/arch/x86/hyperv/hv_proc.c > +++ b/arch/x86/hyperv/hv_proc.c > @@ -117,7 +117,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 > num_pages) >page_count, 0, input_page, NULL); > local_irq_restore(flags); > if (!hv_result_success(status)) { > - pr_err("Failed to deposit pages: %lld\n", status); > + pr_err("Failed to deposit pages: %s\n", > hv_status_to_string(status)); > ret = hv_status_to_errno(status); > goto err_free_allocations; > } > @@ -172,8 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 > apic_id) > > if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) { > if (!hv_result_success(status)) { > - pr_err("%s: cpu %u apic ID %u, %lld\n", > __func__, > -lp_index, apic_id, status); > + pr_err("%s: cpu %u apic ID %u, %s\n", __func__, > +lp_index, apic_id, > hv_status_to_string(status)); > ret = hv_status_to_errno(status); > } > break; > @@ -222,8 +222,8 @@ int hv_call_create_vp(int node, u64 partition_id, u32 > vp_index, u32 flags) > > if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) { > if (!hv_result_success(status)) { > - pr_err("%s: vcpu %u, lp %u, %lld\n", __func__, > -vp_index, flags, status); > + pr_err("%s: vcpu %u, lp %u, %s\n", __func__, > +vp_index, flags, > hv_status_to_string(status)); > ret = hv_status_to_errno(status); > } > break; > diff --git a/include/asm-generic/hyperv-tlfs.h > b/include/asm-generic/hyperv-tlfs.h > index fe6d41d0b114..40ff7cdd4a2b 100644 > --- a/include/asm-generic/hyperv-tlfs.h > +++ b/include/asm-generic/hyperv-tlfs.h > @@ -189,28 +189,36 @@ enum HV_GENERIC_SET_FORMAT { > #define HV_HYPERCALL_REP_START_MASK GENMASK_ULL(59, 48) > > /* hypercall status code */ > -#define HV_STATUS_SUCCESS0x0 > -#define HV_STATUS_INVALID_HYPERCALL_CODE 0x2 > -#define HV_STATUS_INVALID_HYPERCALL_INPUT0x3 > -#define HV_STATUS_INVALID_ALIGNMENT 0x4 > -#define HV_STATUS_INVALID_PARAMETER 0x5 > -#define HV_STATUS_ACCESS_DENIED 0x6 > -#define HV_STATUS_INVALID_PARTITION_STATE0x7 > -#define HV_STATUS_OPERATION_DENIED 0x8 > -#define HV_STATUS_UNKNOWN_PROPERTY 0x9 > -#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE0xA > -#define HV_STATUS_INSUFFICIENT_MEMORY0xB > -#define HV_STATUS_INVALID_PARTITION_ID 0xD > -#define HV_STATUS_INVALID_VP_INDEX 0xE > -#define HV_STATUS_NOT_FOUND 0x10 > -#define HV_STATUS_INVALID_PORT_ID0x11 > -#define HV_STATUS_INVALID_CONNECTION_ID 0x12 > -#define HV_STATUS_INSUFFICIENT_BUFFERS 0x13 > -#define HV_STATUS_NOT_ACKNOWLEDGED 0x14 > -#define HV_STATUS_INVALID_VP_STATE 0x15 > -#define HV_STATUS_NO_RESOURC
[RFC PATCH v1] vsock: add mergeable rx buffer description
Mergeable rx buffer is already supported by virtio-net, and it can save memory for big packets. It will also be beneficial for the vsock devices, so add it to the spec. --- V0 -> V1: I send similar patch with vsock dgram before and already got some comments. This version fixed those,such as use present tense for feature bit etc. Also the feature bit value is 3, because we expect to have some other featue bits defined soon. virtio-vsock.tex | 78 +++- 1 file changed, 77 insertions(+), 1 deletion(-) diff --git a/virtio-vsock.tex b/virtio-vsock.tex index da7e641..d529291 100644 --- a/virtio-vsock.tex +++ b/virtio-vsock.tex @@ -16,7 +16,9 @@ \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues} \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits} -There are currently no feature bits defined for this device. +\begin{description} +\item[VIRTIO_VSOCK_F_MRG_RXBUF (3)] Driver can merge receive buffers. +\end{description} \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout} @@ -64,6 +66,8 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op Packets transmitted or received contain a header before the payload: +If feature VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following header. + \begin{lstlisting} struct virtio_vsock_hdr { le64 src_cid; @@ -79,6 +83,15 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op }; \end{lstlisting} +If feature VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, use the following header. +\begin{lstlisting} +struct virtio_vsock_hdr_mrg_rxbuf { + struct virtio_vsock_hdr hdr; + le16 num_buffers; +}; +\end{lstlisting} + + The upper 32 bits of src_cid and dst_cid are reserved and zeroed. Most packets simply transfer data but control packets are also used for @@ -170,6 +183,23 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows communicating updates any time a change in buffer space occurs. +\drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Setting Up Receive Buffers} +\begin{itemize} +\item If VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, each buffer MUST be at +least the size of the struct virtio_vsock_hdr_mgr_rxbuf. +\end{itemize} + +\begin{note} +Obviously each buffer can be split across multiple descriptor elements. +\end{note} + +\devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Setting Up Receive Buffers} +The device MUST set \field{num_buffers} to the number of descriptors used when +transmitting the packet. + +The device MUST use only a single descriptor if VIRTIO_VSOCK_F_MRG_RXBUF +is not negotiated. + \drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management} VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has sufficient free buffer space for the payload. @@ -188,6 +218,7 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / De The driver queues outgoing packets on the tx virtqueue and incoming packet receive buffers on the rx virtqueue. Packets are of the following form: +If VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following. \begin{lstlisting} struct virtio_vsock_packet { struct virtio_vsock_hdr hdr; @@ -195,9 +226,41 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / De }; \end{lstlisting} +Otherwise, use the following form: +\begin{lstlisting} +struct virtio_vsock_packet_mrg_rxbuf { +struct virtio_vsock_hdr_mrg_rxbuf hdr; +u8 data[]; +}; +\end{lstlisting} + Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for incoming packets are write-only. +When transmitting packets to the device, \field{num_buffers} is not used. + +\begin{enumerate} +\item \field{num_buffers} indicates how many descriptors + this packet is spread over (including this one). + This is valid only if VIRTIO_VSOCK_F_MRG_RXBUF is negotiated. + This allows receipt of large packets without having to allocate large + buffers: a packet that does not fit in a single buffer can flow + over to the next buffer, and so on. In this case, there will be + at least \field{num_buffers} used buffers in the virtqueue, and the device + chains them together to form a single packet in a way similar to + how it would store it in a single buffer spread over multiple + descriptors. + The other buffers will not begin with a struct virtio_vsock_hdr. + + If VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, then only one + descriptor is used. + +\item If + \field{num_buffers} is one,
RE: [PATCH 01/19] x86/hyperv: convert hyperv statuses to linux error codes
> -Original Message- > From: Nuno Das Neves > Sent: Friday, May 28, 2021 3:43 PM > To: linux-hyp...@vger.kernel.org; linux-ker...@vger.kernel.org > Cc: virtualization@lists.linux-foundation.org; Michael Kelley > ; virem...@linux.microsoft.com; Sunil > Muthuswamy ; wei@kernel.org; vkuznets > ; Lillian Grassin-Drake > ; KY Srinivasan > Subject: [PATCH 01/19] x86/hyperv: convert hyperv statuses to linux error > codes > > Return linux-friendly error codes from hypercall wrapper functions. > This will be needed in the mshv module. > > Signed-off-by: Nuno Das Neves > --- > arch/x86/hyperv/hv_proc.c | 30 ++--- > arch/x86/include/asm/mshyperv.h | 1 + > include/asm-generic/hyperv-tlfs.h | 32 +-- > 3 files changed, 50 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c > index 68a0843d4750..59cf9a9e0975 100644 > --- a/arch/x86/hyperv/hv_proc.c > +++ b/arch/x86/hyperv/hv_proc.c > @@ -14,6 +14,30 @@ > > #include > > +int hv_status_to_errno(u64 hv_status) > +{ > + switch (hv_result(hv_status)) { > + case HV_STATUS_SUCCESS: > + return 0; > + case HV_STATUS_INVALID_PARAMETER: > + case HV_STATUS_UNKNOWN_PROPERTY: > + case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE: > + case HV_STATUS_INVALID_VP_INDEX: > + case HV_STATUS_INVALID_REGISTER_VALUE: > + case HV_STATUS_INVALID_LP_INDEX: > + return -EINVAL; > + case HV_STATUS_ACCESS_DENIED: > + case HV_STATUS_OPERATION_DENIED: > + return -EACCES; > + case HV_STATUS_NOT_ACKNOWLEDGED: > + case HV_STATUS_INVALID_VP_STATE: > + case HV_STATUS_INVALID_PARTITION_STATE: > + return -EBADFD; > + } > + return -ENOTRECOVERABLE; > +} > +EXPORT_SYMBOL_GPL(hv_status_to_errno); > + > /* > * See struct hv_deposit_memory. The first u64 is partition ID, the rest > * are GPAs. > @@ -94,7 +118,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 > num_pages) > local_irq_restore(flags); > if (!hv_result_success(status)) { > pr_err("Failed to deposit pages: %lld\n", status); > - ret = hv_result(status); > + ret = hv_status_to_errno(status); > goto err_free_allocations; > } > > @@ -150,7 +174,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 > apic_id) > if (!hv_result_success(status)) { > pr_err("%s: cpu %u apic ID %u, %lld\n", > __func__, > lp_index, apic_id, status); > - ret = hv_result(status); > + ret = hv_status_to_errno(status); > } > break; > } > @@ -200,7 +224,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 > vp_index, u32 flags) > if (!hv_result_success(status)) { > pr_err("%s: vcpu %u, lp %u, %lld\n", __func__, > vp_index, flags, status); > - ret = hv_result(status); > + ret = hv_status_to_errno(status); > } > break; > } > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h > index 67ff0d637e55..c6eb01f3864d 100644 > --- a/arch/x86/include/asm/mshyperv.h > +++ b/arch/x86/include/asm/mshyperv.h > @@ -169,6 +169,7 @@ int hyperv_flush_guest_mapping_range(u64 as, > int hyperv_fill_flush_guest_mapping_list( > struct hv_guest_mapping_flush_list *flush, > u64 start_gfn, u64 end_gfn); > +int hv_status_to_errno(u64 hv_status); > > extern bool hv_root_partition; > > diff --git a/include/asm-generic/hyperv-tlfs.h > b/include/asm-generic/hyperv-tlfs.h > index 515c3fb06ab3..fe6d41d0b114 100644 > --- a/include/asm-generic/hyperv-tlfs.h > +++ b/include/asm-generic/hyperv-tlfs.h > @@ -189,16 +189,28 @@ enum HV_GENERIC_SET_FORMAT { > #define HV_HYPERCALL_REP_START_MASK GENMASK_ULL(59, 48) > > /* hypercall status code */ > -#define HV_STATUS_SUCCESS0 > -#define HV_STATUS_INVALID_HYPERCALL_CODE 2 > -#define HV_STATUS_INVALID_HYPERCALL_INPUT3 > -#define HV_STATUS_INVALID_ALIGNMENT 4 > -#define HV_STATUS_INVALID_PARAMETER 5 > -#define HV_STATUS_OPERATION_DENIED 8 > -#define HV_STATUS_INSUFFICIENT_MEMORY11 > -#define HV_STATUS_INVALID_PORT_ID17 > -#define HV_STATUS_INVALID_CONNECTION_ID 18 > -#define HV_STATUS_INSUFFICIENT_BUFFERS 19 > +#define HV_STATUS_SUCCESS0x0 > +#define HV_STATUS_INVALID_HYPERCALL_CODE 0x2 > +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x3 > +#define HV_STATUS_INVALID_ALIGNMENT 0x4 > +#define HV_STATUS_INVALID_PARAMETER 0x5 > +#
[PATCH v5] virtio-vsock: add description for datagram type
Add supports for datagram type for virtio-vsock. Datagram sockets are connectionless and unreliable. To avoid contention with stream and other sockets, add two more virtqueues and a new feature bit to identify if those two new queues exist or not. Also add descriptions for resource management of datagram, which does not use the existing credit update mechanism associated with stream sockets. Signed-off-by: Jiang Wang --- V2: addressed the comments for the previous version. V3: add description for the mergeable receive buffer. V4: add a feature bit for stream and reserver a bit for seqpacket. Fix mrg_rxbuf related sentences. V5: removed mergeable rx buffer part. It will go to a separate patch. Fixed comments about tx, rx, feature bit etc. virtio-vsock.tex | 71 +++- 1 file changed, 60 insertions(+), 11 deletions(-) diff --git a/virtio-vsock.tex b/virtio-vsock.tex index da7e641..26a62ac 100644 --- a/virtio-vsock.tex +++ b/virtio-vsock.tex @@ -9,14 +9,37 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID} \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues} \begin{description} -\item[0] rx -\item[1] tx +\item[0] stream rx +\item[1] stream tx +\item[2] datagram rx +\item[3] datagram tx +\item[4] event +\end{description} +The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_F_DRGAM is set. Otherwise, it +only uses 3 queues, as the following. + +\begin{description} +\item[0] stream rx +\item[1] stream tx \item[2] event \end{description} +When behavior differs between stream and datagram rx/tx virtqueues +their full names are used. Common behavior is simply described in +terms of rx/tx virtqueues and applies to both stream and datagram +virtqueues. + \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits} -There are currently no feature bits defined for this device. +\begin{description} +\item[VIRTIO_VSOCK_F_STREAM (0)] Device has support for stream socket type. +\end{description} + +\begin{description} +\item[VIRTIO_VSOCK_F_DGRAM (2)] Device has support for datagram socket type. +\end{description} + +If no feature bits are defined, assume device only supports stream socket type. \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout} @@ -107,6 +130,9 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device / Device Operation / Virtqueue Flow Control} +Flow control applies to stream sockets; datagram sockets do not have +flow control. + The tx virtqueue carries packets initiated by applications and replies to received packets. The rx virtqueue carries packets initiated by the device and replies to previously transmitted packets. @@ -140,12 +166,15 @@ \subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Opera consists of a (cid, port number) tuple. The header fields used for this are \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}. -Currently only stream sockets are supported. \field{type} is 1 for stream -socket types. +Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 for stream +socket types. \field{type} is 3 for dgram socket types. Stream sockets provide in-order, guaranteed, connection-oriented delivery without message boundaries. +Datagram sockets provide unordered, unreliable, connectionless messages +with message boundaries and a maximum length. + \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management} \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of stream sockets. The guest and the device publish how much buffer space is @@ -162,7 +191,7 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / u32 peer_free = peer_buf_alloc - (tx_cnt - peer_fwd_cnt); \end{lstlisting} -If there is insufficient buffer space, the sender waits until virtqueue buffers +For stream sockets, if there is insufficient buffer space, the sender waits until virtqueue buffers are returned and checks \field{buf_alloc} and \field{fwd_cnt} again. Sending the VIRTIO_VSOCK_OP_CREDIT_REQUEST packet queries how much buffer space is available. The reply to this query is a VIRTIO_VSOCK_OP_CREDIT_UPDATE packet. @@ -170,22 +199,33 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows communicating updates any time a change in buffer space occurs. +Unlike stream sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE or +VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram buffer management +is split to two parts: sender side and receiver side. For the sender side, if the +v
Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver
Hi, Not being very familiar with GPIO, I just have a few general comments and one on the config space layout On Thu, Jun 10, 2021 at 12:16:46PM +, Viresh Kumar via Stratos-dev wrote: > +static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio, > +u8 txdata, u8 *rxdata) > +{ > + struct virtio_gpio_response *res = &vgpio->cres; > + struct virtio_gpio_request *req = &vgpio->creq; > + struct scatterlist *sgs[2], req_sg, res_sg; > + struct device *dev = &vgpio->vdev->dev; > + unsigned long time_left; > + unsigned int len; > + int ret; > + > + req->type = cpu_to_le16(type); > + req->gpio = cpu_to_le16(gpio); > + req->data = txdata; > + > + sg_init_one(&req_sg, req, sizeof(*req)); > + sg_init_one(&res_sg, res, sizeof(*res)); > + sgs[0] = &req_sg; > + sgs[1] = &res_sg; > + > + mutex_lock(&vgpio->lock); > + ret = virtqueue_add_sgs(vgpio->command_vq, sgs, 1, 1, res, GFP_KERNEL); > + if (ret) { > + dev_err(dev, "failed to add request to vq\n"); > + goto out; > + } > + > + reinit_completion(&vgpio->completion); > + virtqueue_kick(vgpio->command_vq); > + > + time_left = wait_for_completion_timeout(&vgpio->completion, HZ / 10); > + if (!time_left) { > + dev_err(dev, "virtio GPIO backend timeout\n"); > + return -ETIMEDOUT; mutex is still held > + } > + > + WARN_ON(res != virtqueue_get_buf(vgpio->command_vq, &len)); > + if (unlikely(res->status != VIRTIO_GPIO_STATUS_OK)) { > + dev_err(dev, "virtio GPIO request failed: %d\n", gpio); > + return -EINVAL; and here > + } > + > + if (rxdata) > + *rxdata = res->data; > + > +out: > + mutex_unlock(&vgpio->lock); > + > + return ret; > +} > + > +static int virtio_gpio_request(struct gpio_chip *gc, unsigned int gpio) > +{ > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + > + return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_ACTIVATE, gpio, 0, NULL); > +} > + > +static void virtio_gpio_free(struct gpio_chip *gc, unsigned int gpio) > +{ > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + > + virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_DEACTIVATE, gpio, 0, NULL); > +} > + > +static int virtio_gpio_get_direction(struct gpio_chip *gc, unsigned int gpio) > +{ > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + u8 direction; > + int ret; > + > + ret = virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_GET_DIRECTION, gpio, 0, > + &direction); > + if (ret) > + return ret; > + > + return direction; > +} > + > +static int virtio_gpio_direction_input(struct gpio_chip *gc, unsigned int > gpio) > +{ > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + > + return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_DIRECTION_IN, gpio, 0, > +NULL); > +} > + > +static int virtio_gpio_direction_output(struct gpio_chip *gc, unsigned int > gpio, > + int value) > +{ > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + > + return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_DIRECTION_OUT, gpio, (u8) (that dangling cast looks a bit odd to me) > +value, NULL); > +} > + > +static int virtio_gpio_get(struct gpio_chip *gc, unsigned int gpio) > +{ > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + u8 value; > + int ret; > + > + ret = virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_GET_VALUE, gpio, 0, > &value); > + if (ret) > + return ret; > + > + return value; > +} > + > +static void virtio_gpio_set(struct gpio_chip *gc, unsigned int gpio, int > value) > +{ > + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); > + > + virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_SET_VALUE, gpio, value, NULL); > +} > + > +static void virtio_gpio_command(struct virtqueue *vq) > +{ > + struct virtio_gpio *vgpio = vq->vdev->priv; > + > + complete(&vgpio->completion); > +} > + > +static int virtio_gpio_alloc_vqs(struct virtio_device *vdev) > +{ > + struct virtio_gpio *vgpio = vdev->priv; > + const char * const names[] = { "command" }; > + vq_callback_t *cbs[] = { > + virtio_gpio_command, > + }; > + struct virtqueue *vqs[1] = {NULL}; > + int ret; > + > + ret = virtio_find_vqs(vdev, 1, vqs, cbs, names, NULL); > + if (ret) { > + dev_err(&vdev->dev, "failed to allocate vqs: %d\n", ret); > + return ret; > + } > + > + vgpio->command_vq = vqs[0]; > + > + /* Mark the device ready to perform operations from within probe() */ > + virtio_device_ready(vgpio->vdev); May fit better in the parent function > + return 0; > +} > + > +static void virtio_gpio_free_vqs(struct virtio_device *vdev) > +{ > + vdev->config->reset(vdev); > + vdev->config->del_vqs(vdev); > +} > + > +
Re: [Stratos-dev] [PATCH V3 1/3] gpio: Add virtio-gpio driver
On Thu, Jun 10, 2021 at 04:00:39PM +, Enrico Weigelt, metux IT consult via Stratos-dev wrote: > On 10.06.21 15:22, Arnd Bergmann wrote: > > > Can you give an example of how this would be hooked up to other drivers > > using those gpios. Can you give an example of how using the "gpio-keys" or > > "gpio-leds" drivers in combination with virtio-gpio looks like in the DT? > > Connecting between self-probing bus'es and DT is generally tricky. IMHO > we don't have any generic mechanism for that. DT does have a generic description of PCI endpoints, which virtio-iommu relies on to express the relation between IOMMU and endpoint nodes [1]. I think the problem here is similar: the client node needs a phandle to the GPIO controller which may use virtio-pci transport? Note that it mostly works if the device is on the root PCI bus. Behind a bridge the OS may change the device's bus number as needed, so the BDF reference in DT is only valid if the software providing the DT description (VMM or firmware) initializes bus numbers accordingly (and I don't remember if Linux supports this case well). Thanks, Jean [1] Documentation/devicetree/bindings/virtio/iommu.txt > > I've made a few attempts, but nothing practically useful, which would be > accepted by the corresponding maintainers, yet. We'd either need some > very special logic in DT probing or pseudo-bus'es for the mapping. > (DT wants to do those connections via phandle's, which in turn need the > referenced nodes to be present in the DT). > > > From what I can tell, both the mmio and pci variants of virtio can have > > their > > dev->of_node populated, but I don't see the logic in > > register_virtio_device() > > that looks up the of_node of the virtio_device that the of_gpio code then > > tries to refer to. > > Have you ever successfully bound a virtio device via DT ? > > > --mtx > > -- > --- > Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert > werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren > GPG/PGP-Schlüssel zu. > --- > Enrico Weigelt, metux IT consult > Free software and Linux embedded engineering > i...@metux.net -- +49-151-27565287 > -- > Stratos-dev mailing list > stratos-...@op-lists.linaro.org > https://op-lists.linaro.org/mailman/listinfo/stratos-dev ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver
Hi Enrico, On Thu, 10 Jun 2021 at 21:24, Enrico Weigelt, metux IT consult wrote: > On 10.06.21 14:16, Viresh Kumar wrote: > > From: "Enrico Weigelt, metux IT consult" > > > > This patch adds a new driver for Virtio based GPIO devices. > > > > This allows a guest VM running Linux to access GPIO device provided by > > the host. It supports all basic operations for now, except interrupts > > for the GPIO lines. > > What exactly did you change here from my original driver ? I didn't write it from scratch and used your patch only to start with, and so you are still the Author of this particular patch. This just followed the specification updates and so changes only the stuff that was different from your original specs. Apart from that as you know, the irqs weren't working in your case as they needed to be implemented differently (patch 2 does that) here. > Your already changed the spec havily (w/o consulting me first), Isn't that what we are doing on the list? I believe that's why the lists exist, so people don't need to discuss in person, offline. I am happy to make changes in spec if you want to suggest something and if something breaks it for you. > so I guess this driver hasn't so much in common w/ my original design. It actually has as I said earlier, it is still based on your patch. > Note that I made my original design decisions for good reaons > (see virtio-dev list). I tried to follow your patches from December on the Linux kernel list and the ID allocation one on virtio-comments list. I wasn't able to search for any other attempt at sending the specification, so may have missed something. I do understand that there were reasons why you (and me) chose a particular way of doing things and if there is a good reason of following that, then we can still modify the spec and fix things for everyone. We just need to discuss our reasoning on the list and get through with a specfication which works for everyone as this will become a standard interface for many in the future and it needs to be robust and efficient for everyone. > It already support async notifications > (IOW: irqs), had an easy upgrade path for future additions, etc. I haven't changed irqs path, we still have a separate virtqueue (named "interrupt" vq) which handles just the interrupts now. And so will be faster than what you initially suggested. In your original design you also received the responses for the requests on this virtqueue, wihch I have changed to get the response synchronously on the "command" virtqueue only. This is from one of your earlier replies: " I've been under the impression that queues only work in only one direction. (at least that's what my web research was telling). Could you please give an example how bi-directional transmission within the same queue could look like ? " It is actually possible and the right thing to do here as we aren't starting multiple requests together. The operation needs to be synchronous anyway and both request/response on the same channel work better there. Also that makes the interrupts reach faster, without additional delay due to responses. > Note #2: it's already implemented and running in the field (different > kernels, different hypervisors, ...) - it just lacked the going through > virtio comitte's formal specification process, which blocked mainlining. I understand the pain you would be reqiured to go through because of this, but this is how any open source community will work, like Linux. There will be changes in specification and code once you post it and any solutions already working in the field won't really matter during the discussion. That is why it is always the right thing to _upstream first_, so one can avoid these problems later on. Though I understand that the real world needs to move faster than community. But no one can really help in that. > Is there anything fundamentally wrong w/ my original design, why you > invented a completely new one ? Not much, and I haven't changed a lot as well. Lemme point out few things which have changed in specification since your earlier version (the code just followed the specification, that's it). - config structure - version information is replaced with virtio-features. - Irq support is added as a feature, as you suggested. - extra padding space (24 bytes) removed. If you see this patch we can now read the entire config structure in a single go. Why should anyone be required to copy extra 24 bytes? If we need more fields later, we can always do that with help of another feature-flag. So this is still extendable. - Virtqueues: we still have two virtqueues (command and interrupt), just responses are moved to the command virtqueue only, as that is more efficient. - Request/response: Request layout is same, added a new layout for response as the same layout is inefficient. - IRQ support: Initial specification had no interface for configuring irqs from the guest, I added that. That's all I
Re: Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support
On Thu, Jun 10, 2021 at 2:52 AM Stefano Garzarella wrote: > > On Thu, Jun 10, 2021 at 03:46:55PM +0800, Jason Wang wrote: > > > >在 2021/6/10 下午3:23, Stefano Garzarella 写道: > >>On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote: > >>> > >>>在 2021/6/10 上午11:43, Jiang Wang . 写道: > On Wed, Jun 9, 2021 at 6:51 PM Jason Wang wrote: > > > >在 2021/6/10 上午7:24, Jiang Wang 写道: > >>This patchset implements support of SOCK_DGRAM for virtio > >>transport. > >> > >>Datagram sockets are connectionless and unreliable. To avoid > >>unfair contention > >>with stream and other sockets, add two more virtqueues and > >>a new feature bit to indicate if those two new queues exist or not. > >> > >>Dgram does not use the existing credit update mechanism for > >>stream sockets. When sending from the guest/driver, sending packets > >>synchronously, so the sender will get an error when the > >>virtqueue is full. > >>When sending from the host/device, send packets asynchronously > >>because the descriptor memory belongs to the corresponding QEMU > >>process. > > > >What's the use case for the datagram vsock? > > > One use case is for non critical info logging from the guest > to the host, such as the performance data of some applications. > >>> > >>> > >>>Anything that prevents you from using the stream socket? > >>> > >>> > > It can also be used to replace UDP communications between > the guest and the host. > >>> > >>> > >>>Any advantage for VSOCK in this case? Is it for performance (I > >>>guess not since I don't exepct vsock will be faster). > >> > >>I think the general advantage to using vsock are for the guest > >>agents that potentially don't need any configuration. > > > > > >Right, I wonder if we really need datagram consider the host to guest > >communication is reliable. > > > >(Note that I don't object it since vsock has already supported that, > >just wonder its use cases) > > Yep, it was the same concern I had :-) > Also because we're now adding SEQPACKET, which provides reliable > datagram support. > > But IIUC the use case is the logging where you don't need a reliable > communication and you want to avoid to keep more open connections with > different guests. > > So the server in the host can be pretty simple and doesn't have to > handle connections. It just waits for datagrams on a port. Yes. With datagram sockets, the application code is simpler than the stream sockets. Also, it will be easier to port existing applications written for dgram, such as UDP or unix domain socket with datagram types to the vsock dgram sockets. Compared to UDP, the vsock dgram has a minimum configuration. When sending data from the guest to the host, the client in the guest knows the host CID will always be 2. For UDP, the host IP may change depending on the configuration. The advantage over UNIX domain sockets is more obvious. We have some applications talking to each other with UNIX domain sockets, but now the applications are running inside VMs, so we will need to use vsock and those applications use datagram types, so it is natural and simpler if vsock has datagram types too. And we can also run applications for vmware vsock dgram on the QEMU directly. btw, SEQPACKET also supports datagram, but the application code logic is similar to stream sockets and the server needs to maintain connections. > > > > > >> > >>> > >>>An obvious drawback is that it breaks the migration. Using UDP you > >>>can have a very rich features support from the kernel where vsock > >>>can't. > >>> > >> > >>Thanks for bringing this up! > >>What features does UDP support and datagram on vsock could not support? > > > > > >E.g the sendpage() and busy polling. And using UDP means qdiscs and > >eBPF can work. > > Thanks, I see! > > > > > > >> > >>> > > >>The virtio spec patch is here: > >>https://www.spinics.net/lists/linux-virtualization/msg50027.html > > > >Have a quick glance, I suggest to split mergeable rx buffer into an > >separate patch. > Sure. > > >But I think it's time to revisit the idea of unifying the > >virtio-net and > >virtio-vsock. Otherwise we're duplicating features and bugs. > For mergeable rxbuf related code, I think a set of common helper > functions can be used by both virtio-net and virtio-vsock. For other > parts, that may not be very beneficial. I will think about more. > > If there is a previous email discussion about this topic, could > you send me > some links? I did a quick web search but did not find any related > info. Thanks. > >>> > >>> > >>>We had a lot: > >>> > >>>[1] > >>>https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/ > >>>[2] > >>>https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html > >>>[3] https://www.lkml.org/lkml/2020/1/16/2043 > >>> Got it. I will check, thanks. >
Re: [PATCH 06/17] mshv: SynIC port and connection hypercalls
Vineeth Pillai writes: > Hyper-V enables inter-partition communication through the port and > connection constructs. More details about ports and connections in > TLFS chapter 11. > > Implement hypercalls related to ports and connections for enabling > inter-partiion communication. > > Signed-off-by: Vineeth Pillai > --- > drivers/hv/hv_call.c | 161 + > drivers/hv/mshv.h | 12 ++ > include/asm-generic/hyperv-tlfs.h | 55 + > include/linux/hyperv.h | 9 -- > include/uapi/asm-generic/hyperv-tlfs.h | 76 > 5 files changed, 304 insertions(+), 9 deletions(-) > > diff --git a/drivers/hv/hv_call.c b/drivers/hv/hv_call.c > index 025d4e2b892f..57db3a8ac94a 100644 > --- a/drivers/hv/hv_call.c > +++ b/drivers/hv/hv_call.c > @@ -742,3 +742,164 @@ int hv_call_translate_virtual_address( > return hv_status_to_errno(status); > } > > + > +int > +hv_call_create_port(u64 port_partition_id, union hv_port_id port_id, > + u64 connection_partition_id, > + struct hv_port_info *port_info, > + u8 port_vtl, u8 min_connection_vtl, int node) > +{ > + struct hv_create_port *input; > + unsigned long flags; > + int ret = 0; > + int status; > + > + do { > + local_irq_save(flags); > + input = (struct hv_create_port *)(*this_cpu_ptr( > + hyperv_pcpu_input_arg)); > + memset(input, 0, sizeof(*input)); > + > + input->port_partition_id = port_partition_id; > + input->port_id = port_id; > + input->connection_partition_id = connection_partition_id; > + input->port_info = *port_info; > + input->port_vtl = port_vtl; > + input->min_connection_vtl = min_connection_vtl; > + input->proximity_domain_info = > + numa_node_to_proximity_domain_info(node); > + status = hv_do_hypercall(HVCALL_CREATE_PORT, input, > + NULL) & HV_HYPERCALL_RESULT_MASK; > + local_irq_restore(flags); > + if (status == HV_STATUS_SUCCESS) > + break; > + > + if (status != HV_STATUS_INSUFFICIENT_MEMORY) { > + pr_err("%s: %s\n", > +__func__, hv_status_to_string(status)); > + ret = -hv_status_to_errno(status); In Nuno's "x86/hyperv: convert hyperv statuses to linux error codes" patch, hv_status_to_errno() already returns negatives: +int hv_status_to_errno(u64 hv_status) +{ + switch (hv_result(hv_status)) { + case HV_STATUS_SUCCESS: + return 0; + case HV_STATUS_INVALID_PARAMETER: + case HV_STATUS_UNKNOWN_PROPERTY: + case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE: + case HV_STATUS_INVALID_VP_INDEX: + case HV_STATUS_INVALID_REGISTER_VALUE: + case HV_STATUS_INVALID_LP_INDEX: + return -EINVAL; + case HV_STATUS_ACCESS_DENIED: + case HV_STATUS_OPERATION_DENIED: + return -EACCES; + case HV_STATUS_NOT_ACKNOWLEDGED: + case HV_STATUS_INVALID_VP_STATE: + case HV_STATUS_INVALID_PARTITION_STATE: + return -EBADFD; + } + return -ENOTRECOVERABLE; +} +EXPORT_SYMBOL_GPL(hv_status_to_errno); + > + break; > + } > + ret = hv_call_deposit_pages(NUMA_NO_NODE, > + port_partition_id, 1); > + > + } while (!ret); > + > + return ret; > +} > + > +int > +hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id) > +{ > + union hv_delete_port input = { 0 }; > + unsigned long flags; > + int status; > + > + local_irq_save(flags); > + input.port_partition_id = port_partition_id; > + input.port_id = port_id; > + status = hv_do_fast_hypercall16(HVCALL_DELETE_PORT, > + input.as_uint64[0], > + input.as_uint64[1]) & > + HV_HYPERCALL_RESULT_MASK; > + local_irq_restore(flags); > + > + if (status != HV_STATUS_SUCCESS) { > + pr_err("%s: %s\n", __func__, hv_status_to_string(status)); > + return -hv_status_to_errno(status); > + } > + > + return 0; > +} > + > +int > +hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id, > + u64 connection_partition_id, > + union hv_connection_id connection_id, > + struct hv_connection_info *connection_info, > + u8 connection_vtl, int node) > +{ > + struct hv_connect_port *input; > + unsigned long flags; > + int ret = 0, status; > + > + do { > + local_irq_save(flags); > + input = (struct hv_connect_port *)(*this_cpu_ptr( > + hyperv_pcpu_input
[PATCH V3 2/3] gpio: virtio: Add IRQ support
This patch adds IRQ support for the virtio GPIO driver. Note that this uses the irq_bus_lock/unlock() callbacks since the operations over virtio can sleep. Signed-off-by: Viresh Kumar --- drivers/gpio/gpio-virtio.c | 256 ++- include/uapi/linux/virtio_gpio.h | 15 ++ 2 files changed, 263 insertions(+), 8 deletions(-) diff --git a/drivers/gpio/gpio-virtio.c b/drivers/gpio/gpio-virtio.c index 1ba8ddf4693a..8bc4b1876961 100644 --- a/drivers/gpio/gpio-virtio.c +++ b/drivers/gpio/gpio-virtio.c @@ -20,6 +20,13 @@ #include #include +struct vgpio_line { + u8 irq_type; + bool irq_type_pending; + bool masked; + bool masked_pending; +}; + struct virtio_gpio { struct virtio_device *vdev; struct mutex lock; @@ -30,14 +37,20 @@ struct virtio_gpio { struct virtio_gpio_request creq; struct virtio_gpio_response cres; + struct irq_chip *ic; + struct vgpio_line *lines; + struct virtqueue *interrupt_vq; + struct virtio_gpio_request ireq; + /* This must be kept as the last entry here, hint: gpio_names[0] */ struct virtio_gpio_config config; }; #define gpio_chip_to_vgpio(chip) container_of(chip, struct virtio_gpio, gc) +#define irq_data_to_gpio_chip(d) (d->domain->host_data) -static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio, - u8 txdata, u8 *rxdata) +static int virtio_gpio_req_unlocked(struct virtio_gpio *vgpio, u16 type, + u16 gpio, u8 txdata, u8 *rxdata) { struct virtio_gpio_response *res = &vgpio->cres; struct virtio_gpio_request *req = &vgpio->creq; @@ -56,11 +69,10 @@ static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio, sgs[0] = &req_sg; sgs[1] = &res_sg; - mutex_lock(&vgpio->lock); ret = virtqueue_add_sgs(vgpio->command_vq, sgs, 1, 1, res, GFP_KERNEL); if (ret) { dev_err(dev, "failed to add request to vq\n"); - goto out; + return ret; } reinit_completion(&vgpio->completion); @@ -81,7 +93,16 @@ static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio, if (rxdata) *rxdata = res->data; -out: + return ret; +} + +static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio, + u8 txdata, u8 *rxdata) +{ + int ret; + + mutex_lock(&vgpio->lock); + ret = virtio_gpio_req_unlocked(vgpio, type, gpio, txdata, rxdata); mutex_unlock(&vgpio->lock); return ret; @@ -152,6 +173,183 @@ static void virtio_gpio_set(struct gpio_chip *gc, unsigned int gpio, int value) virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_SET_VALUE, gpio, value, NULL); } +/* Interrupt handling */ +static void vgpio_irq_mask(struct virtio_gpio *vgpio, u16 gpio) +{ + int ret; + + ret = virtio_gpio_req_unlocked(vgpio, VIRTIO_GPIO_REQ_IRQ_MASK, gpio, 0, + NULL); + if (ret) + dev_err(&vgpio->vdev->dev, "failed to mask irq: %d\n", ret); +} + +static void vgpio_irq_unmask(struct virtio_gpio *vgpio, u16 gpio) +{ + int ret; + + ret = virtio_gpio_req_unlocked(vgpio, VIRTIO_GPIO_REQ_IRQ_UNMASK, gpio, + 0, NULL); + if (ret) + dev_err(&vgpio->vdev->dev, "failed to unmask irq: %d\n", ret); +} + +static void vgpio_irq_set_type(struct virtio_gpio *vgpio, u16 gpio, u8 type) +{ + int ret; + + ret = virtio_gpio_req_unlocked(vgpio, VIRTIO_GPIO_REQ_IRQ_TYPE, gpio, + type, NULL); + if (ret) + dev_err(&vgpio->vdev->dev, "failed to set irq type: %d\n", ret); +} + +static void virtio_gpio_irq_mask(struct irq_data *d) +{ + struct gpio_chip *gc = irq_data_to_gpio_chip(d); + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); + struct vgpio_line *line = &vgpio->lines[d->hwirq]; + + line->masked = true; + line->masked_pending = true; +} + +static void virtio_gpio_irq_unmask(struct irq_data *d) +{ + struct gpio_chip *gc = irq_data_to_gpio_chip(d); + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); + struct vgpio_line *line = &vgpio->lines[d->hwirq]; + + line->masked = false; + line->masked_pending = true; +} + +static int virtio_gpio_irq_set_type(struct irq_data *d, unsigned int type) +{ + struct gpio_chip *gc = irq_data_to_gpio_chip(d); + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); + struct vgpio_line *line = &vgpio->lines[d->hwirq]; + u8 irq_type; + + switch (type) { + case IRQ_TYPE_NONE: + irq_type = VIRTIO_GPIO_IRQ_TYPE_NONE; + break; + case IRQ_TYPE_EDGE_RISING: + irq_type = VIRTIO_GPIO_IRQ_TYPE_EDGE_RISING; + break; + case IRQ_TY
[PATCH V3 1/3] gpio: Add virtio-gpio driver
From: "Enrico Weigelt, metux IT consult" This patch adds a new driver for Virtio based GPIO devices. This allows a guest VM running Linux to access GPIO device provided by the host. It supports all basic operations for now, except interrupts for the GPIO lines. Signed-off-by: "Enrico Weigelt, metux IT consult" Co-developed-by: Viresh Kumar Signed-off-by: Viresh Kumar --- drivers/gpio/Kconfig | 9 + drivers/gpio/Makefile| 1 + drivers/gpio/gpio-virtio.c | 326 +++ include/uapi/linux/virtio_gpio.h | 41 include/uapi/linux/virtio_ids.h | 1 + 5 files changed, 378 insertions(+) create mode 100644 drivers/gpio/gpio-virtio.c create mode 100644 include/uapi/linux/virtio_gpio.h diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig index e3607ec4c2e8..7f12fb65d130 100644 --- a/drivers/gpio/Kconfig +++ b/drivers/gpio/Kconfig @@ -1633,6 +1633,15 @@ config GPIO_MOCKUP tools/testing/selftests/gpio/gpio-mockup.sh. Reference the usage in it. +config GPIO_VIRTIO + tristate "VirtIO GPIO support" + depends on VIRTIO + help + Say Y here to enable guest support for virtio-based GPIO controllers. + + These virtual GPIOs can be routed to real GPIOs or attached to + simulators on the host (like QEMU). + endmenu endif diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile index c58a90a3c3b1..ace004c80871 100644 --- a/drivers/gpio/Makefile +++ b/drivers/gpio/Makefile @@ -162,6 +162,7 @@ obj-$(CONFIG_GPIO_UCB1400) += gpio-ucb1400.o obj-$(CONFIG_GPIO_UNIPHIER)+= gpio-uniphier.o obj-$(CONFIG_GPIO_VF610) += gpio-vf610.o obj-$(CONFIG_GPIO_VIPERBOARD) += gpio-viperboard.o +obj-$(CONFIG_GPIO_VIRTIO) += gpio-virtio.o obj-$(CONFIG_GPIO_VISCONTI)+= gpio-visconti.o obj-$(CONFIG_GPIO_VR41XX) += gpio-vr41xx.o obj-$(CONFIG_GPIO_VX855) += gpio-vx855.o diff --git a/drivers/gpio/gpio-virtio.c b/drivers/gpio/gpio-virtio.c new file mode 100644 index ..1ba8ddf4693a --- /dev/null +++ b/drivers/gpio/gpio-virtio.c @@ -0,0 +1,326 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * GPIO driver for virtio-based virtual GPIO controllers + * + * Copyright (C) 2021 metux IT consult + * Enrico Weigelt, metux IT consult + * + * Copyright (C) 2021 Linaro. + * Viresh Kumar + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct virtio_gpio { + struct virtio_device *vdev; + struct mutex lock; + struct gpio_chip gc; + + struct completion completion; + struct virtqueue *command_vq; + struct virtio_gpio_request creq; + struct virtio_gpio_response cres; + + /* This must be kept as the last entry here, hint: gpio_names[0] */ + struct virtio_gpio_config config; +}; + +#define gpio_chip_to_vgpio(chip) container_of(chip, struct virtio_gpio, gc) + +static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio, + u8 txdata, u8 *rxdata) +{ + struct virtio_gpio_response *res = &vgpio->cres; + struct virtio_gpio_request *req = &vgpio->creq; + struct scatterlist *sgs[2], req_sg, res_sg; + struct device *dev = &vgpio->vdev->dev; + unsigned long time_left; + unsigned int len; + int ret; + + req->type = cpu_to_le16(type); + req->gpio = cpu_to_le16(gpio); + req->data = txdata; + + sg_init_one(&req_sg, req, sizeof(*req)); + sg_init_one(&res_sg, res, sizeof(*res)); + sgs[0] = &req_sg; + sgs[1] = &res_sg; + + mutex_lock(&vgpio->lock); + ret = virtqueue_add_sgs(vgpio->command_vq, sgs, 1, 1, res, GFP_KERNEL); + if (ret) { + dev_err(dev, "failed to add request to vq\n"); + goto out; + } + + reinit_completion(&vgpio->completion); + virtqueue_kick(vgpio->command_vq); + + time_left = wait_for_completion_timeout(&vgpio->completion, HZ / 10); + if (!time_left) { + dev_err(dev, "virtio GPIO backend timeout\n"); + return -ETIMEDOUT; + } + + WARN_ON(res != virtqueue_get_buf(vgpio->command_vq, &len)); + if (unlikely(res->status != VIRTIO_GPIO_STATUS_OK)) { + dev_err(dev, "virtio GPIO request failed: %d\n", gpio); + return -EINVAL; + } + + if (rxdata) + *rxdata = res->data; + +out: + mutex_unlock(&vgpio->lock); + + return ret; +} + +static int virtio_gpio_request(struct gpio_chip *gc, unsigned int gpio) +{ + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); + + return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_ACTIVATE, gpio, 0, NULL); +} + +static void virtio_gpio_free(struct gpio_chip *gc, unsigned int gpio) +{ + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc); + + virtio_gpio_req(
[PATCH V3 0/3] gpio: Add virtio based driver
Hello, This adds a virtio based GPIO driver based on the proposed specification [1]. The first two versions [2] were sent by Enrico earlier and here is a newer version. I have retained the authorship of the work done by Enrico (1st patch) to make sure we don't loose his credits. I have tested all basic operations of the patchset (with Qemu guest) with the libgpiod utility. I have also tested the handling of interrupts on the guest side. All works as expected. I will now be working towards a Rust based hypervisor agnostic backend to provide a generic solution for that. This should be merged only after the specifications are merged, I will keep everyone posted for the same. I am not adding a version history here as a lot of changes have been made, from protocol to code and this really needs a full review from scratch. -- Viresh [1] https://lists.oasis-open.org/archives/virtio-comment/202106/msg00022.html [2] https://lore.kernel.org/linux-gpio/20201203191135.21576-2-i...@metux.net/ Enrico Weigelt, metux IT consult (1): gpio: Add virtio-gpio driver Viresh Kumar (2): gpio: virtio: Add IRQ support MAINTAINERS: Add entry for Virtio-gpio MAINTAINERS | 7 + drivers/gpio/Kconfig | 9 + drivers/gpio/Makefile| 1 + drivers/gpio/gpio-virtio.c | 566 +++ include/uapi/linux/virtio_gpio.h | 56 +++ include/uapi/linux/virtio_ids.h | 1 + 6 files changed, 640 insertions(+) create mode 100644 drivers/gpio/gpio-virtio.c create mode 100644 include/uapi/linux/virtio_gpio.h -- 2.31.1.272.g89b43f80a514 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 3/6] x86/sev-es: Split up runtime #VC handler for correct state tracking
Hi Peter, On Thu, Jun 10, 2021 at 12:19:43PM +0200, Peter Zijlstra wrote: > On Thu, Jun 10, 2021 at 11:11:38AM +0200, Joerg Roedel wrote: > > > +static void vc_handle_from_kernel(struct pt_regs *regs, unsigned long > > error_code) > > static noinstr ... Right, I forgot that, will update the patch and add the correct noinstr annotations. > > + if (user_mode(regs)) > > + vc_handle_from_user(regs, error_code); > > + else > > + vc_handle_from_kernel(regs, error_code); > > } > > #DB and MCE use idtentry_mce_db and split out in asm. When I look at > idtentry_vc, it appears to me that VC_SAFE_STACK already implies > from-user, or am I reading that wrong? VC_SAFE_STACK does not imply from-user. It means that the #VC handler asm code was able to switch away from the IST stack to either the task-stack (if from-user or syscall gap) or to the previous kernel stack. There is a check in vc_switch_off_ist() that shows which stacks are considered safe. If it can not switch to a safe stack the VC entry code switches to the fall-back stack and a special handler function is called, which for now just panics the system. > How about you don't do that and have exc_ call your new from_kernel > function, then we know that safe_stack_ is always from-user. Then also > maybe do: > > s/VS_SAFE_STACK/VC_USER/ > s/safe_stack_/noist_/ > > to match all the others (#DB/MCE). So #VC is different from #DB and #MCE in that it switches stacks even when coming from kernel mode, so that the #VC handler can be nested. What I can do is to call the from_user function directly from asm in the .Lfrom_user_mode_switch_stack path. That will avoid having another from_user check in C code. > DEFINE_IDTENTRY_VC(exc_vc) > { > if (unlikely(on_vc_fallback_stack(regs))) { > instrumentation_begin(); > panic("boohooo\n"); > instrumentation_end(); The on_vc_fallback_stack() path is for now only calling panic(), because it can't be hit when the hypervisor is behaving correctly. In the future it is not clear yet if that path needs to be extended for SNP page validation exceptions, which can basically happen anywhere. The implementation of SNP should make sure that all memory touched during entry (while on unsafe stacks) is always validated, but not sure yet if that holds when live-migration of SNP guests is added to the picture. There is the possibility that this doesn't fit in the above branch, but it can also be moved to a separate function if needed. > } > > vc_from_kernel(); > } > > DEFINE_IDTENTRY_VC_USER(exc_vc) > { > vc_from_user(); > } > > Which is, I'm thinking, much simpler, no? Okay, I am going to try this out. Thanks for your feedback. Regards, Joerg ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 3/6] x86/sev-es: Split up runtime #VC handler for correct state tracking
Bah, I suppose the trouble is that this SEV crap requires PARAVIRT? I should really get around to fixing noinstr validation with PARAVIRT on :-( On Thu, Jun 10, 2021 at 11:11:38AM +0200, Joerg Roedel wrote: > +static void vc_handle_from_kernel(struct pt_regs *regs, unsigned long > error_code) static noinstr ... > +{ > + irqentry_state_t irq_state = irqentry_nmi_enter(regs); > > + instrumentation_begin(); > > + if (!vc_raw_handle_exception(regs, error_code)) { > /* Show some debug info */ > show_regs(regs); > > @@ -1434,7 +1400,59 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) > panic("Returned from Terminate-Request to Hypervisor\n"); > } > > + instrumentation_end(); > + irqentry_nmi_exit(regs, irq_state); > +} > + > +static void vc_handle_from_user(struct pt_regs *regs, unsigned long > error_code) static noinstr ... > +{ > + irqentry_state_t irq_state = irqentry_enter(regs); > + > + instrumentation_begin(); > + > + if (!vc_raw_handle_exception(regs, error_code)) { > + /* > + * Do not kill the machine if user-space triggered the > + * exception. Send SIGBUS instead and let user-space deal with > + * it. > + */ > + force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0); > + } > + > + instrumentation_end(); > + irqentry_exit(regs, irq_state); > +} + linebreak > +/* > + * Main #VC exception handler. It is called when the entry code was able to > + * switch off the IST to a safe kernel stack. > + * > + * With the current implementation it is always possible to switch to a safe > + * stack because #VC exceptions only happen at known places, like intercepted > + * instructions or accesses to MMIO areas/IO ports. They can also happen with > + * code instrumentation when the hypervisor intercepts #DB, but the critical > + * paths are forbidden to be instrumented, so #DB exceptions currently also > + * only happen in safe places. > + */ > +DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) > +{ > + /* > + * Handle #DB before calling into !noinstr code to avoid recursive #DB. > + */ > + if (error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB) { > + vc_handle_trap_db(regs); > + return; > + } > + > + /* > + * This is invoked through an interrupt gate, so IRQs are disabled. The > + * code below might walk page-tables for user or kernel addresses, so > + * keep the IRQs disabled to protect us against concurrent TLB flushes. > + */ > + > + if (user_mode(regs)) > + vc_handle_from_user(regs, error_code); > + else > + vc_handle_from_kernel(regs, error_code); > } #DB and MCE use idtentry_mce_db and split out in asm. When I look at idtentry_vc, it appears to me that VC_SAFE_STACK already implies from-user, or am I reading that wrong? Ah, it appears you're muddling things up again by then also calling safe_stack_ from exc_. How about you don't do that and have exc_ call your new from_kernel function, then we know that safe_stack_ is always from-user. Then also maybe do: s/VS_SAFE_STACK/VC_USER/ s/safe_stack_/noist_/ to match all the others (#DB/MCE). Also, AFAICT, you don't actually need DEFINE_IDTENTRY_VC_IST, it doesn't have an ASM counterpart. So then you end up with something like: DEFINE_IDTENTRY_VC(exc_vc) { if (unlikely(on_vc_fallback_stack(regs))) { instrumentation_begin(); panic("boohooo\n"); instrumentation_end(); } vc_from_kernel(); } DEFINE_IDTENTRY_VC_USER(exc_vc) { vc_from_user(); } Which is, I'm thinking, much simpler, no? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support
On Thu, Jun 10, 2021 at 03:46:55PM +0800, Jason Wang wrote: 在 2021/6/10 下午3:23, Stefano Garzarella 写道: On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote: 在 2021/6/10 上午11:43, Jiang Wang . 写道: On Wed, Jun 9, 2021 at 6:51 PM Jason Wang wrote: 在 2021/6/10 上午7:24, Jiang Wang 写道: This patchset implements support of SOCK_DGRAM for virtio transport. Datagram sockets are connectionless and unreliable. To avoid unfair contention with stream and other sockets, add two more virtqueues and a new feature bit to indicate if those two new queues exist or not. Dgram does not use the existing credit update mechanism for stream sockets. When sending from the guest/driver, sending packets synchronously, so the sender will get an error when the virtqueue is full. When sending from the host/device, send packets asynchronously because the descriptor memory belongs to the corresponding QEMU process. What's the use case for the datagram vsock? One use case is for non critical info logging from the guest to the host, such as the performance data of some applications. Anything that prevents you from using the stream socket? It can also be used to replace UDP communications between the guest and the host. Any advantage for VSOCK in this case? Is it for performance (I guess not since I don't exepct vsock will be faster). I think the general advantage to using vsock are for the guest agents that potentially don't need any configuration. Right, I wonder if we really need datagram consider the host to guest communication is reliable. (Note that I don't object it since vsock has already supported that, just wonder its use cases) Yep, it was the same concern I had :-) Also because we're now adding SEQPACKET, which provides reliable datagram support. But IIUC the use case is the logging where you don't need a reliable communication and you want to avoid to keep more open connections with different guests. So the server in the host can be pretty simple and doesn't have to handle connections. It just waits for datagrams on a port. An obvious drawback is that it breaks the migration. Using UDP you can have a very rich features support from the kernel where vsock can't. Thanks for bringing this up! What features does UDP support and datagram on vsock could not support? E.g the sendpage() and busy polling. And using UDP means qdiscs and eBPF can work. Thanks, I see! The virtio spec patch is here: https://www.spinics.net/lists/linux-virtualization/msg50027.html Have a quick glance, I suggest to split mergeable rx buffer into an separate patch. Sure. But I think it's time to revisit the idea of unifying the virtio-net and virtio-vsock. Otherwise we're duplicating features and bugs. For mergeable rxbuf related code, I think a set of common helper functions can be used by both virtio-net and virtio-vsock. For other parts, that may not be very beneficial. I will think about more. If there is a previous email discussion about this topic, could you send me some links? I did a quick web search but did not find any related info. Thanks. We had a lot: [1] https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/ [2] https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html [3] https://www.lkml.org/lkml/2020/1/16/2043 When I tried it, the biggest problem that blocked me were all the features strictly related to TCP/IP stack and ethernet devices that vsock device doesn't know how to handle: TSO, GSO, checksums, MAC, napi, xdp, min ethernet frame size, MTU, etc. It depends on which level we want to share: 1) sharing codes 2) sharing devices 3) make vsock a protocol that is understood by the network core We can start from 1), the low level tx/rx logic can be shared at both virtio-net and vhost-net. For 2) we probably need some work on the spec, probably with a new feature bit to demonstrate that it's a vsock device not a ethernet device. Then if it is probed as a vsock device we won't let packet to be delivered in the TCP/IP stack. For 3), it would be even harder and I'm not sure it's worth to do that. So in my opinion to unify them is not so simple, because vsock is not really an ethernet device, but simply a socket. We can start from sharing codes. Yep, I agree, and maybe the mergeable buffer is a good starting point to share code! @Jiang, do you want to take a look of this possibility? Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/19] Microsoft Hypervisor root partition ioctl interface
Nuno Das Neves writes: > This patch series provides a userspace interface for creating and running > guest > virtual machines while running on the Microsoft Hypervisor [0]. > > Since managing guest machines can only be done when Linux is the root > partition, > this series depends on Wei Liu's patch series merged in 5.12: > https://lore.kernel.org/linux-hyperv/20210203150435.27941-1-wei@kernel.org/ > > The first two patches provide some helpers for converting hypervisor status > codes to linux error codes, and printing hypervisor status codes to dmesg for > debugging. > > Hyper-V related headers asm-generic/hyperv-tlfs.h and x86/asm/hyperv-tlfs.h > are > split into uapi and non-uapi. The uapi versions contain structures used in > both > the ioctl interface and the kernel. > > The mshv API is introduced in drivers/hv/mshv_main.c. As each interface is > introduced, documentation is added in Documentation/virt/mshv/api.rst. > The API is file-desciptor based, like KVM. The entry point is /dev/mshv. > > /dev/mshv ioctls: > MSHV_CHECK_EXTENSION > MSHV_CREATE_PARTITION > > Partition (vm) ioctls: > MSHV_MAP_GUEST_MEMORY, MSHV_UNMAP_GUEST_MEMORY > MSHV_INSTALL_INTERCEPT > MSHV_ASSERT_INTERRUPT > MSHV_GET_PARTITION_PROPERTY, MSHV_SET_PARTITION_PROPERTY > MSHV_CREATE_VP > > Vp (vcpu) ioctls: > MSHV_GET_VP_REGISTERS, MSHV_SET_VP_REGISTERS > MSHV_RUN_VP > MSHV_GET_VP_STATE, MSHV_SET_VP_STATE > MSHV_TRANSLATE_GVA > mmap() (register page) > > [0] Hyper-V is more well-known, but it really refers to the whole stack > including the hypervisor and other components that run in Windows kernel > and userspace. > > Changes since RFC: > 1. Moved code from virt/mshv to drivers/hv > 2. Split hypercall helper functions and synic code to hv_call.c and hv_synic.c > 3. MSHV_REQUEST_VERSION ioctl replaced with MSHV_CHECK_EXTENSION > 3. Numerous suggestions, fixes, style changes, etc from Michael Kelley, Vitaly >Kuznetsov, Wei Liu, and Vineeth Pillai > 4. Added patch to enable hypervisor enlightenments on partition creation > 5. Added Wei Liu's patch for GVA to GPA translation > Thank you for addressing these! One nitpick though: could you please run your next submission through 'scripts/checkpatch.pl'? It reports a number of issues here, mostly minor but still worth addressing, i.e.: $ scripts/checkpatch.pl *.patch ... --- 0002-asm-generic-hyperv-convert-hyperv-statuses-to-string.patch --- ERROR: Macros with complex values should be enclosed in parentheses #95: FILE: include/asm-generic/hyperv-tlfs.h:192: +#define __HV_STATUS_DEF(OP) \ + OP(HV_STATUS_SUCCESS, 0x0) \ ... ERROR: Macros with complex values should be enclosed in parentheses #119: FILE: include/asm-generic/hyperv-tlfs.h:216: +#define __HV_MAKE_HV_STATUS_ENUM(NAME, VAL) NAME = (VAL), ERROR: Macros with multiple statements should be enclosed in a do - while loop #120: FILE: include/asm-generic/hyperv-tlfs.h:217: +#define __HV_MAKE_HV_STATUS_CASE(NAME, VAL) case (NAME): return (#NAME); WARNING: Macros with flow control statements should be avoided #120: FILE: include/asm-generic/hyperv-tlfs.h:217: +#define __HV_MAKE_HV_STATUS_CASE(NAME, VAL) case (NAME): return (#NAME); WARNING: macros should not use a trailing semicolon #120: FILE: include/asm-generic/hyperv-tlfs.h:217: +#define __HV_MAKE_HV_STATUS_CASE(NAME, VAL) case (NAME): return (#NAME); total: 3 errors, 2 warnings, 108 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. 0002-asm-generic-hyperv-convert-hyperv-statuses-to-string.patch has style problems, please review. ... --- 0004-drivers-hv-check-extension-ioctl.patch --- WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #36: new file mode 100644 WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP #131: FILE: drivers/hv/mshv_main.c:52: + return -ENOTSUPP; total: 0 errors, 2 warnings, 137 lines checked ... WARNING: Improper SPDX comment style for 'drivers/hv/hv_call.c', please use '//' instead #94: FILE: drivers/hv/hv_call.c:1: +/* SPDX-License-Identifier: GPL-2.0-only */ WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 #94: FILE: drivers/hv/hv_call.c:1: +/* SPDX-License-Identifier: GPL-2.0-only */ ERROR: "(foo*)" should be "(foo *)" #178: FILE: drivers/hv/hv_call.c:85: + *(u64*)&input); ERROR: "(foo*)" should be "(foo *)" #201: FILE: drivers/hv/hv_call.c:108: + *(u64*)&input); ERROR: "(foo*)" should be "(foo *)" #215: FILE: drivers/hv/hv_call.c:122: + status = hv_do_fast_hypercall8(HVCALL_DELETE_PARTITION, *(u64*)&input); total: 3 errors, 3 warnings, 330 lines checked ...
[PATCH v4 3/6] x86/sev-es: Split up runtime #VC handler for correct state tracking
From: Joerg Roedel Split up the #VC handler code into a from-user and a from-kernel part. This allows clean and correct state tracking, as the #VC handler needs to enter NMI-state when raised from kernel mode and plain IRQ state when raised from user-mode. Fixes: 62441a1fb532 ("x86/sev-es: Correctly track IRQ states in runtime #VC handler") Suggested-by: Peter Zijlstra Signed-off-by: Joerg Roedel --- arch/x86/kernel/sev.c | 118 -- 1 file changed, 68 insertions(+), 50 deletions(-) diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 2a922d1b03c8..475bbc1b3547 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -1326,43 +1326,14 @@ static __always_inline bool on_vc_fallback_stack(struct pt_regs *regs) return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2)); } -/* - * Main #VC exception handler. It is called when the entry code was able to - * switch off the IST to a safe kernel stack. - * - * With the current implementation it is always possible to switch to a safe - * stack because #VC exceptions only happen at known places, like intercepted - * instructions or accesses to MMIO areas/IO ports. They can also happen with - * code instrumentation when the hypervisor intercepts #DB, but the critical - * paths are forbidden to be instrumented, so #DB exceptions currently also - * only happen in safe places. - */ -DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) +static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code) { - irqentry_state_t irq_state; struct ghcb_state state; struct es_em_ctxt ctxt; enum es_result result; unsigned long flags; struct ghcb *ghcb; - - /* -* Handle #DB before calling into !noinstr code to avoid recursive #DB. -*/ - if (error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB) { - vc_handle_trap_db(regs); - return; - } - - irq_state = irqentry_nmi_enter(regs); - lockdep_assert_irqs_disabled(); - instrumentation_begin(); - - /* -* This is invoked through an interrupt gate, so IRQs are disabled. The -* code below might walk page-tables for user or kernel addresses, so -* keep the IRQs disabled to protect us against concurrent TLB flushes. -*/ + bool ret = true; ghcb = sev_es_get_ghcb(&state, &flags); @@ -1382,15 +1353,18 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) case ES_UNSUPPORTED: pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n", error_code, regs->ip); - goto fail; + ret = false; + break; case ES_VMM_ERROR: pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n", error_code, regs->ip); - goto fail; + ret = false; + break; case ES_DECODE_FAILED: pr_err_ratelimited("Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n", error_code, regs->ip); - goto fail; + ret = false; + break; case ES_EXCEPTION: vc_forward_exception(&ctxt); break; @@ -1406,24 +1380,16 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) BUG(); } -out: - instrumentation_end(); - irqentry_nmi_exit(regs, irq_state); + return ret; +} - return; +static void vc_handle_from_kernel(struct pt_regs *regs, unsigned long error_code) +{ + irqentry_state_t irq_state = irqentry_nmi_enter(regs); -fail: - if (user_mode(regs)) { - /* -* Do not kill the machine if user-space triggered the -* exception. Send SIGBUS instead and let user-space deal with -* it. -*/ - force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0); - } else { - pr_emerg("PANIC: Unhandled #VC exception in kernel space (result=%d)\n", -result); + instrumentation_begin(); + if (!vc_raw_handle_exception(regs, error_code)) { /* Show some debug info */ show_regs(regs); @@ -1434,7 +1400,59 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) panic("Returned from Terminate-Request to Hypervisor\n"); } - goto out; + instrumentation_end(); + irqentry_nmi_exit(regs, irq_state); +} + +static void vc_handle_from_user(struct pt_regs *regs, unsigned long error_code) +{ + irqentry_state_t irq_state = irqentry_enter(regs); + + instrumentation_begin(); + + if (!vc_raw_handle_exception(regs, error_code)) { + /* +
[PATCH v4 4/6] x86/insn-eval: Make 0 a valid RIP for insn_get_effective_ip()
From: Joerg Roedel In theory 0 is a valid value for the instruction pointer, so don't use it as the error return value from insn_get_effective_ip(). Signed-off-by: Joerg Roedel --- arch/x86/lib/insn-eval.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index a67afd74232c..4eecb9c7c6a0 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -1417,7 +1417,7 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) } } -static unsigned long insn_get_effective_ip(struct pt_regs *regs) +static int insn_get_effective_ip(struct pt_regs *regs, unsigned long *ip) { unsigned long seg_base = 0; @@ -1430,10 +1430,12 @@ static unsigned long insn_get_effective_ip(struct pt_regs *regs) if (!user_64bit_mode(regs)) { seg_base = insn_get_seg_base(regs, INAT_SEG_REG_CS); if (seg_base == -1L) - return 0; + return -EINVAL; } - return seg_base + regs->ip; + *ip = seg_base + regs->ip; + + return 0; } /** @@ -1455,8 +1457,7 @@ int insn_fetch_from_user(struct pt_regs *regs, unsigned char buf[MAX_INSN_SIZE]) unsigned long ip; int not_copied; - ip = insn_get_effective_ip(regs); - if (!ip) + if (insn_get_effective_ip(regs, &ip)) return 0; not_copied = copy_from_user(buf, (void __user *)ip, MAX_INSN_SIZE); @@ -1484,8 +1485,7 @@ int insn_fetch_from_user_inatomic(struct pt_regs *regs, unsigned char buf[MAX_IN unsigned long ip; int not_copied; - ip = insn_get_effective_ip(regs); - if (!ip) + if (insn_get_effective_ip(regs, &ip)) return 0; not_copied = __copy_from_user_inatomic(buf, (void __user *)ip, MAX_INSN_SIZE); -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 5/6] x86/insn: Extend error reporting from insn_fetch_from_user[_inatomic]()
From: Joerg Roedel The error reporting from the insn_fetch_from_user*() functions is not very verbose. Extend it to include information on whether the linear RIP could not be calculated or whether the memory access faulted. This will be used in the SEV-ES code to propagate the correct exception depending on what went wrong during instruction fetch. Signed-off-by: Joerg Roedel --- arch/x86/kernel/sev.c| 8 arch/x86/kernel/umip.c | 10 -- arch/x86/lib/insn-eval.c | 8 ++-- 3 files changed, 14 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 475bbc1b3547..a7045a5a94ca 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -267,17 +267,17 @@ static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt, static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt) { char buffer[MAX_INSN_SIZE]; - int res; + int insn_bytes; - res = insn_fetch_from_user_inatomic(ctxt->regs, buffer); - if (!res) { + insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer); + if (insn_bytes <= 0) { ctxt->fi.vector = X86_TRAP_PF; ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER; ctxt->fi.cr2= ctxt->regs->ip; return ES_EXCEPTION; } - if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, res)) + if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes)) return ES_DECODE_FAILED; if (ctxt->insn.immediate.got) diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index 8daa70b0d2da..337178809c89 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -346,14 +346,12 @@ bool fixup_umip_exception(struct pt_regs *regs) if (!regs) return false; - nr_copied = insn_fetch_from_user(regs, buf); - /* -* The insn_fetch_from_user above could have failed if user code -* is protected by a memory protection key. Give up on emulation -* in such a case. Should we issue a page fault? +* Give up on emulation if fetching the instruction failed. Should we +* issue a page fault or a #GP? */ - if (!nr_copied) + nr_copied = insn_fetch_from_user(regs, buf); + if (nr_copied <= 0) return false; if (!insn_decode_from_regs(&insn, regs, buf, nr_copied)) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 4eecb9c7c6a0..1b5cdf8b7a4e 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -1451,6 +1451,8 @@ static int insn_get_effective_ip(struct pt_regs *regs, unsigned long *ip) * Number of instruction bytes copied. * * 0 if nothing was copied. + * + * -EINVAL if the linear address of the instruction could not be calculated */ int insn_fetch_from_user(struct pt_regs *regs, unsigned char buf[MAX_INSN_SIZE]) { @@ -1458,7 +1460,7 @@ int insn_fetch_from_user(struct pt_regs *regs, unsigned char buf[MAX_INSN_SIZE]) int not_copied; if (insn_get_effective_ip(regs, &ip)) - return 0; + return -EINVAL; not_copied = copy_from_user(buf, (void __user *)ip, MAX_INSN_SIZE); @@ -1479,6 +1481,8 @@ int insn_fetch_from_user(struct pt_regs *regs, unsigned char buf[MAX_INSN_SIZE]) * Number of instruction bytes copied. * * 0 if nothing was copied. + * + * -EINVAL if the linear address of the instruction could not be calculated */ int insn_fetch_from_user_inatomic(struct pt_regs *regs, unsigned char buf[MAX_INSN_SIZE]) { @@ -1486,7 +1490,7 @@ int insn_fetch_from_user_inatomic(struct pt_regs *regs, unsigned char buf[MAX_IN int not_copied; if (insn_get_effective_ip(regs, &ip)) - return 0; + return -EINVAL; not_copied = __copy_from_user_inatomic(buf, (void __user *)ip, MAX_INSN_SIZE); -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 6/6] x86/sev-es: Propagate #GP if getting linear instruction address failed
From: Joerg Roedel When an instruction is fetched from user-space, segmentation needs to be taken into account. This means that getting the linear address of an instruction can fail. Hardware would raise a #GP exception in that case, but the #VC exception handler would emulate it as a page-fault. The insn_fetch_from_user*() functions now provide the relevant information in case of an failure. Use that and propagate a #GP when the linear address of an instruction to fetch could not be calculated. Signed-off-by: Joerg Roedel --- arch/x86/kernel/sev.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index a7045a5a94ca..80c0d8385def 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -270,11 +270,18 @@ static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt) int insn_bytes; insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer); - if (insn_bytes <= 0) { + if (insn_bytes == 0) { + /* Nothing could be copied */ ctxt->fi.vector = X86_TRAP_PF; ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER; ctxt->fi.cr2= ctxt->regs->ip; return ES_EXCEPTION; + } else if (insn_bytes == -EINVAL) { + /* Effective RIP could not be calculated */ + ctxt->fi.vector = X86_TRAP_GP; + ctxt->fi.error_code = 0; + ctxt->fi.cr2= 0; + return ES_EXCEPTION; } if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes)) -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 2/6] x86/sev-es: Disable IRQs while GHCB is active
From: Joerg Roedel The #VC handler only cares about IRQs being disabled while the GHCB is active, as it must not be interrupted by something which could cause another #VC while it holds the GHCB (NMI is the exception for which the backup GHCB is there). Make sure nothing interrupts the code path while the GHCB is active by disabling IRQs in sev_es_get_ghcb() and restoring the previous irq state in sev_es_put_ghcb(). Signed-off-by: Joerg Roedel --- arch/x86/kernel/sev.c | 39 +-- 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 4fd997bbf059..2a922d1b03c8 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -192,14 +192,23 @@ void noinstr __sev_es_ist_exit(void) this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], *(unsigned long *)ist); } -static __always_inline struct ghcb *sev_es_get_ghcb(struct ghcb_state *state) +static __always_inline struct ghcb *sev_es_get_ghcb(struct ghcb_state *state, + unsigned long *flags) { struct sev_es_runtime_data *data; struct ghcb *ghcb; + /* +* Nothing shall interrupt this code path while holding the per-cpu +* GHCB. The backup GHCB is only for NMIs interrupting this path. +*/ + local_irq_save(*flags); + data = this_cpu_read(runtime_data); ghcb = &data->ghcb_page; + + if (unlikely(data->ghcb_active)) { /* GHCB is already in use - save its contents */ @@ -479,7 +488,8 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt /* Include code shared with pre-decompression boot stage */ #include "sev-shared.c" -static __always_inline void sev_es_put_ghcb(struct ghcb_state *state) +static __always_inline void sev_es_put_ghcb(struct ghcb_state *state, + unsigned long flags) { struct sev_es_runtime_data *data; struct ghcb *ghcb; @@ -500,14 +510,17 @@ static __always_inline void sev_es_put_ghcb(struct ghcb_state *state) vc_ghcb_invalidate(ghcb); data->ghcb_active = false; } + + local_irq_restore(flags); } void noinstr __sev_es_nmi_complete(void) { struct ghcb_state state; + unsigned long flags; struct ghcb *ghcb; - ghcb = sev_es_get_ghcb(&state); + ghcb = sev_es_get_ghcb(&state, &flags); vc_ghcb_invalidate(ghcb); ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_NMI_COMPLETE); @@ -517,7 +530,7 @@ void noinstr __sev_es_nmi_complete(void) sev_es_wr_ghcb_msr(__pa_nodebug(ghcb)); VMGEXIT(); - sev_es_put_ghcb(&state); + sev_es_put_ghcb(&state, flags); } static u64 get_jump_table_addr(void) @@ -527,9 +540,7 @@ static u64 get_jump_table_addr(void) struct ghcb *ghcb; u64 ret = 0; - local_irq_save(flags); - - ghcb = sev_es_get_ghcb(&state); + ghcb = sev_es_get_ghcb(&state, &flags); vc_ghcb_invalidate(ghcb); ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_AP_JUMP_TABLE); @@ -543,9 +554,7 @@ static u64 get_jump_table_addr(void) ghcb_sw_exit_info_2_is_valid(ghcb)) ret = ghcb->save.sw_exit_info_2; - sev_es_put_ghcb(&state); - - local_irq_restore(flags); + sev_es_put_ghcb(&state, flags); return ret; } @@ -666,9 +675,10 @@ static bool __init sev_es_setup_ghcb(void) static void sev_es_ap_hlt_loop(void) { struct ghcb_state state; + unsigned long flags; struct ghcb *ghcb; - ghcb = sev_es_get_ghcb(&state); + ghcb = sev_es_get_ghcb(&state, &flags); while (true) { vc_ghcb_invalidate(ghcb); @@ -685,7 +695,7 @@ static void sev_es_ap_hlt_loop(void) break; } - sev_es_put_ghcb(&state); + sev_es_put_ghcb(&state, flags); } /* @@ -1333,6 +1343,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) struct ghcb_state state; struct es_em_ctxt ctxt; enum es_result result; + unsigned long flags; struct ghcb *ghcb; /* @@ -1353,7 +1364,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) * keep the IRQs disabled to protect us against concurrent TLB flushes. */ - ghcb = sev_es_get_ghcb(&state); + ghcb = sev_es_get_ghcb(&state, &flags); vc_ghcb_invalidate(ghcb); result = vc_init_em_ctxt(&ctxt, regs, error_code); @@ -1361,7 +1372,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) if (result == ES_OK) result = vc_handle_exitcode(&ctxt, ghcb, error_code); - sev_es_put_ghcb(&state); + sev_es_put_ghcb(&state, flags); /* Done - now check the result */ switch (result) { -- 2.31.1 ___ Virtualization mailing
[PATCH v4 0/6] x86/sev-es: Fixes for SEV-ES Guest Support
From: Joerg Roedel Hi, here is the next revision of my pending fixes for Linux' SEV-ES support. Changes to the previous version are: - Removed first patch which is now in tip/x86/urgent already - Removed patch "x86/sev-es: Run #VC handler in plain IRQ state" and replaced it with "x86/sev-es: Split up runtime #VC handler for correct state tracking" as per suggestion from PeterZ Changes are based on tip/x86/urgent. Please review. Thanks, Joerg Joerg Roedel (6): x86/sev-es: Fix error message in runtime #VC handler x86/sev-es: Disable IRQs while GHCB is active x86/sev-es: Split up runtime #VC handler for correct state tracking x86/insn-eval: Make 0 a valid RIP for insn_get_effective_ip() x86/insn: Extend error reporting from insn_fetch_from_user[_inatomic]() x86/sev-es: Propagate #GP if getting linear instruction address failed arch/x86/kernel/sev.c| 174 +++ arch/x86/kernel/umip.c | 10 +-- arch/x86/lib/insn-eval.c | 22 +++-- 3 files changed, 122 insertions(+), 84 deletions(-) base-commit: efa165504943f2128d50f63de0c02faf6dcceb0d -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 1/6] x86/sev-es: Fix error message in runtime #VC handler
From: Joerg Roedel The runtime #VC handler is not "early" anymore. Fix the copy&paste error and remove that word from the error message. Signed-off-by: Joerg Roedel --- arch/x86/kernel/sev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 651b81cd648e..4fd997bbf059 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -1369,7 +1369,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication) vc_finish_insn(&ctxt); break; case ES_UNSUPPORTED: - pr_err_ratelimited("Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n", + pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n", error_code, regs->ip); goto fail; case ES_VMM_ERROR: -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 11/15] virtio-net: move to virtio_net.h
Move some structure definitions and inline functions into the virtio_net.h file. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/virtio_net.c | 225 +-- drivers/net/virtio/virtio_net.h | 230 2 files changed, 232 insertions(+), 223 deletions(-) create mode 100644 drivers/net/virtio/virtio_net.h diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c index 953739860563..395ec1f18331 100644 --- a/drivers/net/virtio/virtio_net.c +++ b/drivers/net/virtio/virtio_net.c @@ -4,24 +4,8 @@ * Copyright 2007 Rusty Russell IBM Corporation */ //#define DEBUG -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include + +#include "virtio_net.h" static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -44,15 +28,6 @@ module_param(napi_tx, bool, 0644); #define VIRTIO_XDP_TX BIT(0) #define VIRTIO_XDP_REDIR BIT(1) -#define VIRTIO_XDP_FLAGBIT(0) - -/* RX packet size EWMA. The average packet size is used to determine the packet - * buffer size when refilling RX rings. As the entire RX ring may be refilled - * at once, the weight is chosen so that the EWMA will be insensitive to short- - * term, transient changes in packet size. - */ -DECLARE_EWMA(pkt_len, 0, 64) - #define VIRTNET_DRIVER_VERSION "1.0.0" static const unsigned long guest_offloads[] = { @@ -68,35 +43,6 @@ static const unsigned long guest_offloads[] = { (1ULL << VIRTIO_NET_F_GUEST_ECN) | \ (1ULL << VIRTIO_NET_F_GUEST_UFO)) -struct virtnet_stat_desc { - char desc[ETH_GSTRING_LEN]; - size_t offset; -}; - -struct virtnet_sq_stats { - struct u64_stats_sync syncp; - u64 packets; - u64 bytes; - u64 xdp_tx; - u64 xdp_tx_drops; - u64 kicks; -}; - -struct virtnet_rq_stats { - struct u64_stats_sync syncp; - u64 packets; - u64 bytes; - u64 drops; - u64 xdp_packets; - u64 xdp_tx; - u64 xdp_redirects; - u64 xdp_drops; - u64 kicks; -}; - -#define VIRTNET_SQ_STAT(m) offsetof(struct virtnet_sq_stats, m) -#define VIRTNET_RQ_STAT(m) offsetof(struct virtnet_rq_stats, m) - static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = { { "packets",VIRTNET_SQ_STAT(packets) }, { "bytes", VIRTNET_SQ_STAT(bytes) }, @@ -119,54 +65,6 @@ static const struct virtnet_stat_desc virtnet_rq_stats_desc[] = { #define VIRTNET_SQ_STATS_LEN ARRAY_SIZE(virtnet_sq_stats_desc) #define VIRTNET_RQ_STATS_LEN ARRAY_SIZE(virtnet_rq_stats_desc) -/* Internal representation of a send virtqueue */ -struct send_queue { - /* Virtqueue associated with this send _queue */ - struct virtqueue *vq; - - /* TX: fragments + linear part + virtio header */ - struct scatterlist sg[MAX_SKB_FRAGS + 2]; - - /* Name of the send queue: output.$index */ - char name[40]; - - struct virtnet_sq_stats stats; - - struct napi_struct napi; -}; - -/* Internal representation of a receive virtqueue */ -struct receive_queue { - /* Virtqueue associated with this receive_queue */ - struct virtqueue *vq; - - struct napi_struct napi; - - struct bpf_prog __rcu *xdp_prog; - - struct virtnet_rq_stats stats; - - /* Chain pages by the private ptr. */ - struct page *pages; - - /* Average packet length for mergeable receive buffers. */ - struct ewma_pkt_len mrg_avg_pkt_len; - - /* Page frag for packet buffer allocation. */ - struct page_frag alloc_frag; - - /* RX: fragments + linear part + virtio header */ - struct scatterlist sg[MAX_SKB_FRAGS + 2]; - - /* Min single buffer size for mergeable buffers case. */ - unsigned int min_buf_len; - - /* Name of this receive queue: input.$index */ - char name[40]; - - struct xdp_rxq_info xdp_rxq; -}; - /* Control VQ buffers: protected by the rtnl lock */ struct control_buf { struct virtio_net_ctrl_hdr hdr; @@ -178,67 +76,6 @@ struct control_buf { __virtio64 offloads; }; -struct virtnet_info { - struct virtio_device *vdev; - struct virtqueue *cvq; - struct net_device *dev; - struct send_queue *sq; - struct receive_queue *rq; - unsigned int status; - - /* Max # of queue pairs supported by the device */ - u16 max_queue_pairs; - - /* # of queue pairs currently used by the driver */ - u16 curr_queue_pairs; - - /* # of XDP queue pairs currently used by the driver */ - u16 xdp_queue_pairs; - - /* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */ - bool xdp_enabled; - - /* I like... big packets and I cannot lie! */ - bo
[PATCH net-next v5 03/15] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
virtio-net not use dma addr directly. So add this priv_flags IFF_NOT_USE_DMA_ADDR. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0416a7e00914..6c1233f0ab3e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3064,7 +3064,7 @@ static int virtnet_probe(struct virtio_device *vdev) /* Set up network device as normal. */ dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE | - IFF_TX_SKB_NO_LINEAR; + IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR; dev->netdev_ops = &virtnet_netdev; dev->features = NETIF_F_HIGHDMA; -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 07/15] virtio-net: standalone virtnet_aloc_frag function
This logic is used by small and merge when adding buf, and the subsequent patch will also use this logic, so it is separated as an independent function. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 29 - 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d791543a8dd8..3fd87bf2b2ad 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -264,6 +264,22 @@ static struct xdp_frame *ptr_to_xdp(void *ptr) return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); } +static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len, + int gfp) +{ + struct page_frag *alloc_frag = &rq->alloc_frag; + char *buf; + + if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) + return NULL; + + buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; + get_page(alloc_frag->page); + alloc_frag->offset += len; + + return buf; +} + static void __free_old_xmit(struct send_queue *sq, bool in_napi, struct virtnet_sq_stats *stats) { @@ -1190,7 +1206,6 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq, static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq, gfp_t gfp) { - struct page_frag *alloc_frag = &rq->alloc_frag; char *buf; unsigned int xdp_headroom = virtnet_get_headroom(vi); void *ctx = (void *)(unsigned long)xdp_headroom; @@ -1199,12 +1214,10 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq, len = SKB_DATA_ALIGN(len) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) + buf = virtnet_alloc_frag(rq, len, gfp); + if (unlikely(!buf)) return -ENOMEM; - buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; - get_page(alloc_frag->page); - alloc_frag->offset += len; sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom, vi->hdr_len + GOOD_PACKET_LEN); err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp); @@ -1295,13 +1308,11 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, * disabled GSO for XDP, it won't be a big issue. */ len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room); - if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp))) + buf = virtnet_alloc_frag(rq, len + room, gfp); + if (unlikely(!buf)) return -ENOMEM; - buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; buf += headroom; /* advance address leaving hole at front of pkt */ - get_page(alloc_frag->page); - alloc_frag->offset += len + room; hole = alloc_frag->size - alloc_frag->offset; if (hole < len + room) { /* To avoid internal fragmentation, if there is very likely not -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
After testing, the performance of calling kick every time is not stable. And if all the packets are sent and kicked again, the performance is not good. So add a module parameter to specify how many packets are sent to call a kick. 8 is a relatively stable value with the best performance. Here is the pps of the test of xsk_kick_thr under different values (from 1 to 64). thr PPS thr PPS thr PPS 12924116.74247 | 23 3683263.04348 | 45 2777907.22963 23441010.57191 | 24 3078880.13043 | 46 2781376.21739 33636728.72378 | 25 2859219.57656 | 47 2777271.91304 43637518.61468 | 26 2851557.9593 | 48 2800320.56575 53651738.16251 | 27 2834783.54408 | 49 2813039.87599 63652176.69231 | 28 2847012.41472 | 50 3445143.01839 73665415.80602 | 29 2860633.91304 | 51 3666918.01281 83665045.16555 | 30 2857903.5786 | 52 3059929.2709 93671023.2401 | 31 2835589.98963 | 53 2831515.21739 10 3669532.23274 | 32 2862827.88706 | 54 3451804.07204 11 3666160.37749 | 33 2871855.96696 | 55 3654975.92385 12 3674951.44813 | 34 3434456.44816 | 56 3676198.3188 13 3667447.57331 | 35 3656918.54177 | 57 3684740.85619 14 3018846.0503 | 36 3596921.16722 | 58 3060958.8594 15 2792773.84505 | 37 3603460.63903 | 59 2828874.57191 16 3430596.3602 | 38 3595410.87666 | 60 3459926.11027 17 3660525.85806 | 39 3604250.17819 | 61 3685444.47599 18 3045627.69054 | 40 3596542.28428 | 62 3049959.0809 19 2841542.94177 | 41 3600705.16054 | 63 2806280.04013 20 2830475.97348 | 42 3019833.71191 | 64 3448494.3913 21 2845655.55789 | 43 2752951.93264 | 22 3450389.84365 | 44 2753107.27164 | It can be found that when the value of xsk_kick_thr is relatively small, the performance is not good, and when its value is greater than 13, the performance will be more irregular and unstable. It looks similar from 3 to 13, I chose 8 as the default value. The test environment is qemu + vhost-net. I modified vhost-net to drop the packets sent by vm directly, so that the cpu of vm can run higher. By default, the processes in the vm and the cpu of softirqd are too low, and there is no obvious difference in the test data. During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was run for 300s, the pps of every second was recorded, and the average of the pps was finally taken. The vhost process cpu on the host has also reached 100%. Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li --- drivers/net/virtio/virtio_net.c | 1 + drivers/net/virtio/xsk.c| 18 -- drivers/net/virtio/xsk.h| 2 ++ 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c index 9503133e71f0..dfe509939b45 100644 --- a/drivers/net/virtio/virtio_net.c +++ b/drivers/net/virtio/virtio_net.c @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true; module_param(csum, bool, 0444); module_param(gso, bool, 0444); module_param(napi_tx, bool, 0644); +module_param(xsk_kick_thr, int, 0644); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c index 3973c82d1ad2..2f3ba6ab4798 100644 --- a/drivers/net/virtio/xsk.c +++ b/drivers/net/virtio/xsk.c @@ -5,6 +5,8 @@ #include "virtio_net.h" +int xsk_kick_thr = 8; + static struct virtio_net_hdr_mrg_rxbuf xsk_hdr; static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head) @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq, struct xdp_desc desc; int err, packet = 0; int ret = -EAGAIN; + int need_kick = 0; while (budget-- > 0) { if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) { @@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq, } ++packet; + ++need_kick; + if (need_kick > xsk_kick_thr) { + if (virtqueue_kick_prepare(sq->vq) && + virtqueue_notify(sq->vq)) + ++stats->kicks; + + need_kick = 0; + } } if (packet) { - if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) - ++stats->kicks; + if (need_kick) { + if (virtqueue_kick_prepare(sq->vq) && + virtqueue_notify(sq->vq)) + ++stats->kicks; + } *done += packet; stats->xdp_tx += packet; diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h index fe22cf78d505..4f0f4f9cf23b 100644 --- a/drivers/net/virtio/xsk.h +++ b/drivers/net/virtio/xsk.h @@ -7,6 +7,8 @@ #define VIRTNET_XSK_BUFF_CTX ((void *)(unsigned long)~0L) +extern int xsk_kick_thr; + /* When xsk di
[PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
Compared to the case of xsk tx, the case of xsk zc rx is more complicated. When we process the buf received by vq, we may encounter ordinary buffers, or xsk buffers. What makes the situation more complicated is that in the case of mergeable, when num_buffer > 1, we may still encounter the case where xsk buffer is mixed with ordinary buffer. Another thing that makes the situation more complicated is that when we get an xsk buffer from vq, the xsk bound to this xsk buffer may have been unbound. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/virtio_net.c | 238 ++- drivers/net/virtio/virtio_net.h | 27 +++ drivers/net/virtio/xsk.c| 396 +++- drivers/net/virtio/xsk.h| 75 ++ 4 files changed, 678 insertions(+), 58 deletions(-) diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c index 40d7751f1c5f..9503133e71f0 100644 --- a/drivers/net/virtio/virtio_net.c +++ b/drivers/net/virtio/virtio_net.c @@ -125,11 +125,6 @@ static int rxq2vq(int rxq) return rxq * 2; } -static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb) -{ - return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb; -} - /* * private is used to chain pages for big packets, put the whole * most recent used list in the beginning for reuse @@ -458,6 +453,68 @@ static unsigned int virtnet_get_headroom(struct virtnet_info *vi) return vi->xdp_enabled ? VIRTIO_XDP_HEADROOM : 0; } +/* return value: + * 1: XDP_PASS should handle to build skb + * -1: xdp err, should handle to free the buf and return NULL + * 0: buf has been consumed by xdp + */ +int virtnet_run_xdp(struct net_device *dev, + struct bpf_prog *xdp_prog, + struct xdp_buff *xdp, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + struct xdp_frame *xdpf; + int act, err; + + act = bpf_prog_run_xdp(xdp_prog, xdp); + stats->xdp_packets++; + + switch (act) { + case XDP_PASS: + return 1; + + case XDP_TX: + stats->xdp_tx++; + xdpf = xdp_convert_buff_to_frame(xdp); + if (unlikely(!xdpf)) + goto err_xdp; + err = virtnet_xdp_xmit(dev, 1, &xdpf, 0); + if (unlikely(!err)) { + xdp_return_frame_rx_napi(xdpf); + } else if (unlikely(err < 0)) { + trace_xdp_exception(dev, xdp_prog, act); + goto err_xdp; + } + *xdp_xmit |= VIRTIO_XDP_TX; + return 0; + + case XDP_REDIRECT: + stats->xdp_redirects++; + err = xdp_do_redirect(dev, xdp, xdp_prog); + if (err) + goto err_xdp; + + *xdp_xmit |= VIRTIO_XDP_REDIR; + return 0; + + default: + bpf_warn_invalid_xdp_action(act); + fallthrough; + + case XDP_ABORTED: + trace_xdp_exception(dev, xdp_prog, act); + fallthrough; + + case XDP_DROP: + break; + } + +err_xdp: + stats->xdp_drops++; + return -1; +} + /* We copy the packet for XDP in the following cases: * * 1) Packet is scattered across multiple rx buffers. @@ -491,27 +548,40 @@ static struct page *xdp_linearize_page(struct receive_queue *rq, int tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); unsigned int buflen; void *buf; + void *ctx; int off; - buf = virtqueue_get_buf(rq->vq, &buflen); + buf = virtqueue_get_buf_ctx(rq->vq, &buflen, &ctx); if (unlikely(!buf)) goto err_buf; - p = virt_to_head_page(buf); - off = buf - page_address(p); - /* guard against a misconfigured or uncooperative backend that * is sending packet larger than the MTU. */ if ((page_off + buflen + tailroom) > PAGE_SIZE) { - put_page(p); + virtnet_rx_put_buf(buf, ctx); goto err_buf; } - memcpy(page_address(page) + page_off, - page_address(p) + off, buflen); + if (is_xsk_ctx(ctx)) { + struct virtnet_xsk_ctx_rx *xsk_ctx; + + xsk_ctx = (struct virtnet_xsk_ctx_rx *)buf; + + virtnet_xsk_ctx_rx_copy(xsk_ctx, + page_address(page) + page_off, + buflen, true); + + virtnet_xsk_ctx_rx_put(xsk_ctx); + } else { + p = virt_to_head_page(buf); + off = buf -
[PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
Calling virtqueue_napi_schedule() in wakeup results in napi running on the current cpu. If the application is not busy, then there is no problem. But if the application itself is busy, it will cause a lot of scheduling. If the application is continuously sending data packets, due to the continuous scheduling between the application and napi, the data packet transmission will not be smooth, and there will be an obvious delay in the transmission (you can use tcpdump to see it). When pressing a channel to 100% (vhost reaches 100%), the cpu where the application is located reaches 100%. This patch sends a small amount of data directly in wakeup. The purpose of this is to trigger the tx interrupt. The tx interrupt will be awakened on the cpu of its affinity, and then trigger the operation of the napi mechanism, napi can continue to consume the xsk tx queue. Two cpus are running, cpu0 is running applications, cpu1 executes napi consumption data. The same is to press a channel to 100%, but the utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is 2.9%. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/xsk.c | 28 +++- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c index 36cda2dcf8e7..3973c82d1ad2 100644 --- a/drivers/net/virtio/xsk.c +++ b/drivers/net/virtio/xsk.c @@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag) { struct virtnet_info *vi = netdev_priv(dev); struct xsk_buff_pool *pool; + struct netdev_queue *txq; struct send_queue *sq; if (!netif_running(dev)) @@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag) rcu_read_lock(); pool = rcu_dereference(sq->xsk.pool); - if (pool) { - local_bh_disable(); - virtqueue_napi_schedule(&sq->napi, sq->vq); - local_bh_enable(); - } + if (!pool) + goto end; + + if (napi_if_scheduled_mark_missed(&sq->napi)) + goto end; + + txq = netdev_get_tx_queue(dev, qid); + + __netif_tx_lock_bh(txq); + + /* Send part of the packet directly to reduce the delay in sending the +* packet, and this can actively trigger the tx interrupts. +* +* If no packet is sent out, the ring of the device is full. In this +* case, we will still get a tx interrupt response. Then we will deal +* with the subsequent packet sending work. +*/ + virtnet_xsk_run(sq, pool, sq->napi.weight, false); + + __netif_tx_unlock_bh(txq); + +end: rcu_read_unlock(); return 0; } -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
AF_XDP(xdp socket, xsk) is a high-performance packet receiving and sending technology. This patch implements the binding and unbinding operations of xsk and the virtio-net queue for xsk zero copy xmit. The xsk zero copy xmit depends on tx napi. Because the actual sending of data is done in the process of tx napi. If tx napi does not work, then the data of the xsk tx queue will not be sent. So if tx napi is not true, an error will be reported when bind xsk. If xsk is active, it will prevent ethtool from modifying tx napi. When reclaiming ptr, a new type of ptr is added, which is distinguished based on the last two digits of ptr: 00: skb 01: xdp frame 10: xsk xmit ptr All sent xsk packets share the virtio-net header of xsk_hdr. If xsk needs to support csum and other functions later, consider assigning xsk hdr separately for each sent packet. Different from other physical network cards, you can reinitialize the channel when you bind xsk. And vrtio does not support independent reset channel, you can only reset the entire device. I think it is not appropriate for us to directly reset the entire setting. So the situation becomes a bit more complicated. We have to consider how to deal with the buffer referenced in vq after xsk is unbind. I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk buffer added to vq corresponds to a ctx. This ctx is used to record the page where the xsk buffer is located, and add a page reference. When the buffer is recycling, reduce the reference to page. When xsk has been unbind, and all related xsk buffers have been recycled, release all ctx. Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li --- drivers/net/virtio/Makefile | 1 + drivers/net/virtio/virtio_net.c | 20 +- drivers/net/virtio/virtio_net.h | 37 +++- drivers/net/virtio/xsk.c| 346 drivers/net/virtio/xsk.h| 99 + 5 files changed, 497 insertions(+), 6 deletions(-) create mode 100644 drivers/net/virtio/xsk.c create mode 100644 drivers/net/virtio/xsk.h diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index ccc80f40f33a..db79d2e7925f 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -4,3 +4,4 @@ # obj-$(CONFIG_VIRTIO_NET) += virtio_net.o +obj-$(CONFIG_VIRTIO_NET) += xsk.o diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c index 395ec1f18331..40d7751f1c5f 100644 --- a/drivers/net/virtio/virtio_net.c +++ b/drivers/net/virtio/virtio_net.c @@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); + work_done += virtnet_poll_xsk(sq, budget); free_old_xmit(sq, true); __netif_tx_unlock(txq); @@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev, if (napi_weight ^ vi->sq[0].napi.weight) { if (dev->flags & IFF_UP) return -EBUSY; - for (i = 0; i < vi->max_queue_pairs; i++) + + for (i = 0; i < vi->max_queue_pairs; i++) { + /* xsk xmit depend on the tx napi. So if xsk is active, +* prevent modifications to tx napi. +*/ + if (rtnl_dereference(vi->sq[i].xsk.pool)) + continue; + vi->sq[i].napi.weight = napi_weight; + } } return 0; @@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return virtnet_xdp_set(dev, xdp->prog, xdp->extack); + case XDP_SETUP_XSK_POOL: + return virtnet_xsk_pool_setup(dev, xdp); default: return -EINVAL; } @@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = { .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid, .ndo_bpf= virtnet_xdp, .ndo_xdp_xmit = virtnet_xdp_xmit, + .ndo_xsk_wakeup = virtnet_xsk_wakeup, .ndo_features_check = passthru_features_check, .ndo_get_phys_port_name = virtnet_get_phys_port_name, .ndo_set_features = virtnet_set_features, @@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi) for (i = 0; i < vi->max_queue_pairs; i++) { struct virtqueue *vq = vi->sq[i].vq; while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) { - if (!is_xdp_frame(buf)) + if (is_skb_ptr(buf)) dev_kfree_skb(buf); - else + else if (is_xdp_frame(buf)) xdp_return_frame(ptr_to_xdp(buf)); + else +
[PATCH net-next v5 05/15] virtio: support virtqueue_detach_unused_buf_ctx
Supports returning ctx while recycling unused buf, which helps to release buf in different ways for different bufs. Signed-off-by: Xuan Zhuo --- drivers/virtio/virtio_ring.c | 22 +++--- include/linux/virtio.h | 2 ++ 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71e16b53e9c1..a3d7ec1c9ea7 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -815,7 +815,8 @@ static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq) return true; } -static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) +static void *virtqueue_detach_unused_buf_ctx_split(struct virtqueue *_vq, + void **ctx) { struct vring_virtqueue *vq = to_vvq(_vq); unsigned int i; @@ -828,7 +829,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) continue; /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; - detach_buf_split(vq, i, NULL); + detach_buf_split(vq, i, ctx); vq->split.avail_idx_shadow--; vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->split.avail_idx_shadow); @@ -1526,7 +1527,8 @@ static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq) return true; } -static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq) +static void *virtqueue_detach_unused_buf_ctx_packed(struct virtqueue *_vq, + void **ctx) { struct vring_virtqueue *vq = to_vvq(_vq); unsigned int i; @@ -1539,7 +1541,7 @@ static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq) continue; /* detach_buf clears data, so grab it now. */ buf = vq->packed.desc_state[i].data; - detach_buf_packed(vq, i, NULL); + detach_buf_packed(vq, i, ctx); END_USE(vq); return buf; } @@ -2018,12 +2020,18 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed); * This is not valid on an active queue; it is useful only for device * shutdown. */ -void *virtqueue_detach_unused_buf(struct virtqueue *_vq) +void *virtqueue_detach_unused_buf_ctx(struct virtqueue *_vq, void **ctx) { struct vring_virtqueue *vq = to_vvq(_vq); - return vq->packed_ring ? virtqueue_detach_unused_buf_packed(_vq) : -virtqueue_detach_unused_buf_split(_vq); + return vq->packed_ring ? + virtqueue_detach_unused_buf_ctx_packed(_vq, ctx) : + virtqueue_detach_unused_buf_ctx_split(_vq, ctx); +} + +void *virtqueue_detach_unused_buf(struct virtqueue *_vq) +{ + return virtqueue_detach_unused_buf_ctx(_vq, NULL); } EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf); diff --git a/include/linux/virtio.h b/include/linux/virtio.h index b1894e0323fa..8aada4d29e04 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -78,6 +78,8 @@ bool virtqueue_poll(struct virtqueue *vq, unsigned); bool virtqueue_enable_cb_delayed(struct virtqueue *vq); +void *virtqueue_detach_unused_buf_ctx(struct virtqueue *vq, void **ctx); + void *virtqueue_detach_unused_buf(struct virtqueue *vq); unsigned int virtqueue_get_vring_size(struct virtqueue *vq); -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 10/15] virtio-net: independent directory
Create a separate directory for virtio-net. AF_XDP support will be added later, and a separate xsk.c file will be added. Signed-off-by: Xuan Zhuo --- MAINTAINERS | 2 +- drivers/net/Kconfig | 8 +--- drivers/net/Makefile | 2 +- drivers/net/virtio/Kconfig| 11 +++ drivers/net/virtio/Makefile | 6 ++ drivers/net/{ => virtio}/virtio_net.c | 0 6 files changed, 20 insertions(+), 9 deletions(-) create mode 100644 drivers/net/virtio/Kconfig create mode 100644 drivers/net/virtio/Makefile rename drivers/net/{ => virtio}/virtio_net.c (100%) diff --git a/MAINTAINERS b/MAINTAINERS index e69c1991ec3b..2041267f19f1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19344,7 +19344,7 @@ S: Maintained F: Documentation/devicetree/bindings/virtio/ F: drivers/block/virtio_blk.c F: drivers/crypto/virtio/ -F: drivers/net/virtio_net.c +F: drivers/net/virtio/ F: drivers/vdpa/ F: drivers/virtio/ F: include/linux/vdpa.h diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 4da68ba8448f..2297fe4183ae 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -392,13 +392,7 @@ config VETH When one end receives the packet it appears on its pair and vice versa. -config VIRTIO_NET - tristate "Virtio network driver" - depends on VIRTIO - select NET_FAILOVER - help - This is the virtual network driver for virtio. It can be used with - QEMU based VMMs (like KVM or Xen). Say Y or M. +source "drivers/net/virtio/Kconfig" config NLMON tristate "Virtual netlink monitoring device" diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 7ffd2d03efaf..c4c7419e0398 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -28,7 +28,7 @@ obj-$(CONFIG_NET_TEAM) += team/ obj-$(CONFIG_TUN) += tun.o obj-$(CONFIG_TAP) += tap.o obj-$(CONFIG_VETH) += veth.o -obj-$(CONFIG_VIRTIO_NET) += virtio_net.o +obj-$(CONFIG_VIRTIO_NET) += virtio/ obj-$(CONFIG_VXLAN) += vxlan.o obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_BAREUDP) += bareudp.o diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig new file mode 100644 index ..9bc2a2fc6c3e --- /dev/null +++ b/drivers/net/virtio/Kconfig @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# virtio-net device configuration +# +config VIRTIO_NET + tristate "Virtio network driver" + depends on VIRTIO + select NET_FAILOVER + help + This is the virtual network driver for virtio. It can be used with + QEMU based VMMs (like KVM or Xen). Say Y or M. diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile new file mode 100644 index ..ccc80f40f33a --- /dev/null +++ b/drivers/net/virtio/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for the virtio network device drivers. +# + +obj-$(CONFIG_VIRTIO_NET) += virtio_net.o diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/virtio_net.c similarity index 100% rename from drivers/net/virtio_net.c rename to drivers/net/virtio/virtio_net.c -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function
receive_mergeable() is too complicated, so this function is split here. One is to make the function more readable. On the other hand, the two independent functions will be called separately in subsequent patches. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 181 --- 1 file changed, 111 insertions(+), 70 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 3fd87bf2b2ad..989aba600e63 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -733,6 +733,109 @@ static struct page *xdp_linearize_page(struct receive_queue *rq, return NULL; } +static void merge_drop_follow_bufs(struct net_device *dev, + struct receive_queue *rq, + u16 num_buf, + struct virtnet_rq_stats *stats) +{ + struct page *page; + unsigned int len; + void *buf; + + while (num_buf-- > 1) { + buf = virtqueue_get_buf(rq->vq, &len); + if (unlikely(!buf)) { + pr_debug("%s: rx error: %d buffers missing\n", +dev->name, num_buf); + dev->stats.rx_length_errors++; + break; + } + stats->bytes += len; + page = virt_to_head_page(buf); + put_page(page); + } +} + +static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev, +struct virtnet_info *vi, +struct receive_queue *rq, +struct sk_buff *head_skb, +u16 num_buf, +struct virtnet_rq_stats *stats) +{ + struct sk_buff *curr_skb; + unsigned int truesize; + unsigned int len, num; + struct page *page; + void *buf, *ctx; + int offset; + + curr_skb = head_skb; + num = num_buf; + + while (--num_buf) { + int num_skb_frags; + + buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx); + if (unlikely(!buf)) { + pr_debug("%s: rx error: %d buffers out of %d missing\n", +dev->name, num_buf, num); + dev->stats.rx_length_errors++; + goto err_buf; + } + + stats->bytes += len; + page = virt_to_head_page(buf); + + truesize = mergeable_ctx_to_truesize(ctx); + if (unlikely(len > truesize)) { + pr_debug("%s: rx error: len %u exceeds truesize %lu\n", +dev->name, len, (unsigned long)ctx); + dev->stats.rx_length_errors++; + goto err_skb; + } + + num_skb_frags = skb_shinfo(curr_skb)->nr_frags; + if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { + struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); + + if (unlikely(!nskb)) + goto err_skb; + if (curr_skb == head_skb) + skb_shinfo(curr_skb)->frag_list = nskb; + else + curr_skb->next = nskb; + curr_skb = nskb; + head_skb->truesize += nskb->truesize; + num_skb_frags = 0; + } + if (curr_skb != head_skb) { + head_skb->data_len += len; + head_skb->len += len; + head_skb->truesize += truesize; + } + offset = buf - page_address(page); + if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { + put_page(page); + skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, +len, truesize); + } else { + skb_add_rx_frag(curr_skb, num_skb_frags, page, + offset, len, truesize); + } + } + + return head_skb; + +err_skb: + put_page(page); + merge_drop_follow_bufs(dev, rq, num_buf, stats); +err_buf: + stats->drops++; + dev_kfree_skb(head_skb); + return NULL; +} + static struct sk_buff *receive_small(struct net_device *dev, struct virtnet_info *vi, struct receive_queue *rq, @@ -909,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers); struct page *page = virt_to_head_page(buf); int offset = buf - p
[PATCH net-next v5 06/15] virtio-net: unify the code for recycling the xmit ptr
Now there are two types of "skb" and "xdp frame" during recycling old xmit. There are two completely similar and independent implementations. This is inconvenient for the subsequent addition of new types. So extract a function from this piece of code and call this function uniformly to recover old xmit ptr. Rename free_old_xmit_skbs() to free_old_xmit(). Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 86 ++-- 1 file changed, 38 insertions(+), 48 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 6c1233f0ab3e..d791543a8dd8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr) return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); } +static void __free_old_xmit(struct send_queue *sq, bool in_napi, + struct virtnet_sq_stats *stats) +{ + unsigned int len; + void *ptr; + + while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { + if (!is_xdp_frame(ptr)) { + struct sk_buff *skb = ptr; + + pr_debug("Sent skb %p\n", skb); + + stats->bytes += skb->len; + napi_consume_skb(skb, in_napi); + } else { + struct xdp_frame *frame = ptr_to_xdp(ptr); + + stats->bytes += frame->len; + xdp_return_frame(frame); + } + stats->packets++; + } +} + /* Converting between virtqueue no. and kernel tx/rx queue no. * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq */ @@ -572,15 +596,12 @@ static int virtnet_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, u32 flags) { struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_sq_stats stats = {}; struct receive_queue *rq = vi->rq; struct bpf_prog *xdp_prog; struct send_queue *sq; - unsigned int len; - int packets = 0; - int bytes = 0; int nxmit = 0; int kicks = 0; - void *ptr; int ret; int i; @@ -599,20 +620,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, } /* Free up any pending old buffers before queueing new ones. */ - while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { - if (likely(is_xdp_frame(ptr))) { - struct xdp_frame *frame = ptr_to_xdp(ptr); - - bytes += frame->len; - xdp_return_frame(frame); - } else { - struct sk_buff *skb = ptr; - - bytes += skb->len; - napi_consume_skb(skb, false); - } - packets++; - } + __free_old_xmit(sq, false, &stats); for (i = 0; i < n; i++) { struct xdp_frame *xdpf = frames[i]; @@ -629,8 +637,8 @@ static int virtnet_xdp_xmit(struct net_device *dev, } out: u64_stats_update_begin(&sq->stats.syncp); - sq->stats.bytes += bytes; - sq->stats.packets += packets; + sq->stats.bytes += stats.bytes; + sq->stats.packets += stats.packets; sq->stats.xdp_tx += n; sq->stats.xdp_tx_drops += n - nxmit; sq->stats.kicks += kicks; @@ -1459,39 +1467,21 @@ static int virtnet_receive(struct receive_queue *rq, int budget, return stats.packets; } -static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi) +static void free_old_xmit(struct send_queue *sq, bool in_napi) { - unsigned int len; - unsigned int packets = 0; - unsigned int bytes = 0; - void *ptr; + struct virtnet_sq_stats stats = {}; - while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { - if (likely(!is_xdp_frame(ptr))) { - struct sk_buff *skb = ptr; - - pr_debug("Sent skb %p\n", skb); - - bytes += skb->len; - napi_consume_skb(skb, in_napi); - } else { - struct xdp_frame *frame = ptr_to_xdp(ptr); - - bytes += frame->len; - xdp_return_frame(frame); - } - packets++; - } + __free_old_xmit(sq, in_napi, &stats); /* Avoid overhead when no packets have been processed * happens when called speculatively from start_xmit. */ - if (!packets) + if (!stats.packets) return; u64_stats_update_begin(&sq->stats.syncp); - sq->stats.bytes += bytes; - sq->stats.packets += packets; + sq->stats.bytes += stats.bytes; + sq->stats.packets += stats.packets; u64_stats_update_end(&sq->stats.syncp); } @@ -1516,7 +1506,7 @@ static void virtnet_pol
[PATCH net-next v5 09/15] virtio-net: virtnet_poll_tx support budget check
virtnet_poll_tx() check the work done like other network card drivers. When work < budget, napi_poll() in dev.c will exit directly. And virtqueue_napi_complete() will be called to close napi. If closing napi fails or there is still data to be processed, virtqueue_napi_complete() will make napi schedule again, and no conflicts with the logic of napi_poll(). When work == budget, virtnet_poll_tx() will return the var 'work', and the napi_poll() in dev.c will re-add napi to the queue. The purpose of this patch is to support xsk xmit in virtio_poll_tx for subsequent patch. Signed-off-by: Xuan Zhuo Acked-by: Jason Wang --- drivers/net/virtio_net.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 989aba600e63..953739860563 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1634,6 +1634,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) struct virtnet_info *vi = sq->vq->vdev->priv; unsigned int index = vq2txq(sq->vq); struct netdev_queue *txq; + int work_done = 0; if (unlikely(is_xdp_raw_buffer_queue(vi, index))) { /* We don't need to enable cb for XDP */ @@ -1646,12 +1647,13 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) free_old_xmit(sq, true); __netif_tx_unlock(txq); - virtqueue_napi_complete(napi, sq->vq, 0); + if (work_done < budget) + virtqueue_napi_complete(napi, sq->vq, 0); if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) netif_tx_wake_queue(txq); - return 0; + return work_done; } static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 00/15] virtio-net: support xdp socket zero copy
XDP socket is an excellent by pass kernel network transmission framework. The zero copy feature of xsk (XDP socket) needs to be supported by the driver. The performance of zero copy is very good. mlx5 and intel ixgbe already support this feature, This patch set allows virtio-net to support xsk's zerocopy xmit feature. Compared with other drivers, the trouble with virtio-net is that when we bind the channel to xsk, we cannot directly disable/enable the channel. So we have to consider the buf that has been placed in the vq after the xsk is released by the upper layer. My solution is to add a ctx to each buf placed in vq to record the page used, and add a reference to the page. So when the upper xsk is released, these pages are still safe in vq. We will process these bufs when we recycle buf or receive new data. In the case of rx, it will be more complicated, because we may encounter the buf of xsk, or it may be a normal buf. Especially in the case of merge, we may receive multiple bufs, and these bufs may be xsk buf and normal are mixed together. v5: support rx v4: 1. add priv_flags IFF_NOT_USE_DMA_ADDR 2. more reasonable patch split Xuan Zhuo (15): netdevice: priv_flags extend to 64bit netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR virtio: support virtqueue_detach_unused_buf_ctx virtio-net: unify the code for recycling the xmit ptr virtio-net: standalone virtnet_aloc_frag function virtio-net: split the receive_mergeable function virtio-net: virtnet_poll_tx support budget check virtio-net: independent directory virtio-net: move to virtio_net.h virtio-net: support AF_XDP zc tx virtio-net: support AF_XDP zc rx virtio-net: xsk direct xmit inside xsk wakeup virtio-net: xsk zero copy xmit kick by threshold MAINTAINERS | 2 +- drivers/net/Kconfig | 8 +- drivers/net/Makefile | 2 +- drivers/net/virtio/Kconfig| 11 + drivers/net/virtio/Makefile | 7 + drivers/net/{ => virtio}/virtio_net.c | 670 +++--- drivers/net/virtio/virtio_net.h | 288 ++ drivers/net/virtio/xsk.c | 766 ++ drivers/net/virtio/xsk.h | 176 ++ drivers/virtio/virtio_ring.c | 22 +- include/linux/netdevice.h | 143 ++--- include/linux/virtio.h| 2 + net/8021q/vlanproc.c | 2 +- net/xdp/xsk_buff_pool.c | 2 +- 14 files changed, 1664 insertions(+), 437 deletions(-) create mode 100644 drivers/net/virtio/Kconfig create mode 100644 drivers/net/virtio/Makefile rename drivers/net/{ => virtio}/virtio_net.c (92%) create mode 100644 drivers/net/virtio/virtio_net.h create mode 100644 drivers/net/virtio/xsk.c create mode 100644 drivers/net/virtio/xsk.h -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 02/15] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
Some driver devices, such as virtio-net, do not directly use dma addr. For upper-level frameworks such as xdp socket, that need to be aware of this. So add a new priv_flag IFF_NOT_USE_DMA_ADDR. Signed-off-by: Xuan Zhuo --- include/linux/netdevice.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3202e055b305..2de0a0c455e5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1596,6 +1596,7 @@ typedef u64 netdev_priv_flags_t; * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with * skb_headlen(skb) == 0 (data starts from frag0) + * @IFF_NOT_USE_DMA_ADDR: driver not use dma addr directly. such as virtio-net */ enum { IFF_802_1Q_VLAN_BIT, @@ -1630,6 +1631,7 @@ enum { IFF_L3MDEV_RX_HANDLER_BIT, IFF_LIVE_RENAME_OK_BIT, IFF_TX_SKB_NO_LINEAR_BIT, + IFF_NOT_USE_DMA_ADDR_BIT, }; #define __IFF_BIT(bit) ((netdev_priv_flags_t)1 << (bit)) @@ -1667,6 +1669,7 @@ enum { #define IFF_L3MDEV_RX_HANDLER __IFF(L3MDEV_RX_HANDLER) #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK) #define IFF_TX_SKB_NO_LINEAR __IFF(TX_SKB_NO_LINEAR) +#define IFF_NOT_USE_DMA_ADDR __IFF(NOT_USE_DMA_ADDR) /* Specifies the type of the struct net_device::ml_priv pointer */ enum netdev_ml_priv_type { -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 04/15] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
Some devices, such as virtio-net, do not directly use dma addr. These devices do not initialize dma after completing the xsk setup, so the dma check is skipped here. Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li Acked-by: Magnus Karlsson --- net/xdp/xsk_buff_pool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index 8de01aaac4a0..a7e434de0308 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -171,7 +171,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool, if (err) goto err_unreg_pool; - if (!pool->dma_pages) { + if (!(netdev->priv_flags & IFF_NOT_USE_DMA_ADDR) && !pool->dma_pages) { WARN(1, "Driver did not DMA map zero-copy buffers"); err = -EINVAL; goto err_unreg_xsk; -- 2.31.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 01/15] netdevice: priv_flags extend to 64bit
The size of priv_flags is 32 bits, and the number of flags currently available has reached 32. It is time to expand the size of priv_flags to 64 bits. Here the priv_flags is modified to 8 bytes, but the size of struct net_device has not changed, it is still 2176 bytes. It is because _tx is aligned based on the cache line. But there is a 4-byte hole left here. Since the fields before and after priv_flags are read mostly, I did not adjust the order of the fields here. Signed-off-by: Xuan Zhuo --- include/linux/netdevice.h | 140 -- net/8021q/vlanproc.c | 2 +- 2 files changed, 73 insertions(+), 69 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index be1dcceda5e4..3202e055b305 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1549,9 +1549,9 @@ struct net_device_ops { struct net_device_path *path); }; +typedef u64 netdev_priv_flags_t; + /** - * enum netdev_priv_flags - &struct net_device priv_flags - * * These are the &struct net_device, they are only set internally * by drivers and used in the kernel. These flags are invisible to * userspace; this means that the order of these flags can change @@ -1597,73 +1597,76 @@ struct net_device_ops { * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with * skb_headlen(skb) == 0 (data starts from frag0) */ -enum netdev_priv_flags { - IFF_802_1Q_VLAN = 1<<0, - IFF_EBRIDGE = 1<<1, - IFF_BONDING = 1<<2, - IFF_ISATAP = 1<<3, - IFF_WAN_HDLC= 1<<4, - IFF_XMIT_DST_RELEASE= 1<<5, - IFF_DONT_BRIDGE = 1<<6, - IFF_DISABLE_NETPOLL = 1<<7, - IFF_MACVLAN_PORT= 1<<8, - IFF_BRIDGE_PORT = 1<<9, - IFF_OVS_DATAPATH= 1<<10, - IFF_TX_SKB_SHARING = 1<<11, - IFF_UNICAST_FLT = 1<<12, - IFF_TEAM_PORT = 1<<13, - IFF_SUPP_NOFCS = 1<<14, - IFF_LIVE_ADDR_CHANGE= 1<<15, - IFF_MACVLAN = 1<<16, - IFF_XMIT_DST_RELEASE_PERM = 1<<17, - IFF_L3MDEV_MASTER = 1<<18, - IFF_NO_QUEUE= 1<<19, - IFF_OPENVSWITCH = 1<<20, - IFF_L3MDEV_SLAVE= 1<<21, - IFF_TEAM= 1<<22, - IFF_RXFH_CONFIGURED = 1<<23, - IFF_PHONY_HEADROOM = 1<<24, - IFF_MACSEC = 1<<25, - IFF_NO_RX_HANDLER = 1<<26, - IFF_FAILOVER= 1<<27, - IFF_FAILOVER_SLAVE = 1<<28, - IFF_L3MDEV_RX_HANDLER = 1<<29, - IFF_LIVE_RENAME_OK = 1<<30, - IFF_TX_SKB_NO_LINEAR= 1<<31, +enum { + IFF_802_1Q_VLAN_BIT, + IFF_EBRIDGE_BIT, + IFF_BONDING_BIT, + IFF_ISATAP_BIT, + IFF_WAN_HDLC_BIT, + IFF_XMIT_DST_RELEASE_BIT, + IFF_DONT_BRIDGE_BIT, + IFF_DISABLE_NETPOLL_BIT, + IFF_MACVLAN_PORT_BIT, + IFF_BRIDGE_PORT_BIT, + IFF_OVS_DATAPATH_BIT, + IFF_TX_SKB_SHARING_BIT, + IFF_UNICAST_FLT_BIT, + IFF_TEAM_PORT_BIT, + IFF_SUPP_NOFCS_BIT, + IFF_LIVE_ADDR_CHANGE_BIT, + IFF_MACVLAN_BIT, + IFF_XMIT_DST_RELEASE_PERM_BIT, + IFF_L3MDEV_MASTER_BIT, + IFF_NO_QUEUE_BIT, + IFF_OPENVSWITCH_BIT, + IFF_L3MDEV_SLAVE_BIT, + IFF_TEAM_BIT, + IFF_RXFH_CONFIGURED_BIT, + IFF_PHONY_HEADROOM_BIT, + IFF_MACSEC_BIT, + IFF_NO_RX_HANDLER_BIT, + IFF_FAILOVER_BIT, + IFF_FAILOVER_SLAVE_BIT, + IFF_L3MDEV_RX_HANDLER_BIT, + IFF_LIVE_RENAME_OK_BIT, + IFF_TX_SKB_NO_LINEAR_BIT, }; -#define IFF_802_1Q_VLANIFF_802_1Q_VLAN -#define IFF_EBRIDGEIFF_EBRIDGE -#define IFF_BONDINGIFF_BONDING -#define IFF_ISATAP IFF_ISATAP -#define IFF_WAN_HDLC IFF_WAN_HDLC -#define IFF_XMIT_DST_RELEASE IFF_XMIT_DST_RELEASE -#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE -#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL -#define IFF_MACVLAN_PORT IFF_MACVLAN_PORT -#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT -#define IFF_OVS_DATAPATH IFF_OVS_DATAPATH -#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING -#define IFF_UNICAST_FLTIFF_UNICAST_FLT -#define IFF_TEAM_PORT IFF_TEAM_PORT -#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS -#define IFF_LIVE_ADDR_CHANGE IFF_LIVE_ADDR_CHANGE -#define IFF_MACVLANIFF_MACVLAN -#define
Re: [PATCH 7/9] vhost: allow userspace to create workers
On Wed, Jun 09, 2021 at 04:03:55PM -0500, Mike Christie wrote: > On 6/7/21 10:19 AM, Stefan Hajnoczi wrote: > > My concern is that threads should probably accounted against > > RLIMIT_NPROC and max_threads rather than something indirect like 128 * > > RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE > > vhost-user file descriptors open). > > > > Ah ok, I see what you want I think. > > Ok, I think the options are: > > 0. Nothing. Just use existing indirect/RLIMIT_NOFILE. > > 1. Do something like io_uring's create_io_thread/copy_process. If we call > copy_process from the vhost ioctl context, then the userspace process that > did the ioctl will have it's processes count incremented and checked against > its rlimit. > > The drawbacks: > - This gets a little more complicated than just calling copy_process though. > We end up duplicating a lot of the kthread API. > - We have to deal with new error cases like the parent exiting early. > - I think all devs sharing a worker have to have the same owner. > kthread_use_mm > and kthread_unuse_mm to switch between mm's for differrent owner's devs seem > to > be causing lots of errors. I'm still looking into this one though. > > 2. It's not really what you want, but for unbound work io_uring has a check > for > RLIMIT_NPROC in the io_uring code. It does: > > wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = > task_rlimit(current, RLIMIT_NPROC); > > then does: > > if (!ret && acct->nr_workers < acct->max_workers) { > > Drawbacks: > In vhost.c, we could do something similar. It would make sure that vhost.c > does > not create more worker threads than the rlimit value, but we wouldn't be > incrementing the userspace process's process count. The userspace process > could > then create RLIMIT_NPROC threads and vhost.c could also create RLIMIT_NPROC > threads, so we end up with 2 * RLIMIT_NPROC threads. Yes, in that case we might as well go with Option 0, so I think this option can be eliminated. > 3. Change the kthread and copy_process code so we can pass in the thread > (or it's creds or some struct that has the values that need to be check) that > needs to be checked and updated. > > Drawback: > This might be considered too ugly for how special case vhost is. For example, > we > need checks/code like the io_thread/PF_IO_WORKER code in copy_process for > io_uring. > I can see how added that for io_uring because it affects so many users, but I > can > see how vhost is not special enough. This seems like the most general solution. If you try it and get negative feedback then maybe the maintainers can help suggest how to solve this problem :). Stefan signature.asc Description: PGP signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 4/6] iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops()
Passing a 64-bit address width to iommu_setup_dma_ops() is valid on virtual platforms, but isn't currently possible. The overflow check in iommu_dma_init_domain() prevents this even when @dma_base isn't 0. Pass a limit address instead of a size, so callers don't have to fake a size to work around the check. Signed-off-by: Jean-Philippe Brucker --- include/linux/dma-iommu.h | 4 ++-- arch/arm64/mm/dma-mapping.c | 2 +- drivers/iommu/amd/iommu.c | 2 +- drivers/iommu/dma-iommu.c | 12 ++-- drivers/iommu/intel/iommu.c | 2 +- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h index 6e75a2d689b4..758ca4694257 100644 --- a/include/linux/dma-iommu.h +++ b/include/linux/dma-iommu.h @@ -19,7 +19,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base); void iommu_put_dma_cookie(struct iommu_domain *domain); /* Setup call for arch DMA mapping code */ -void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size); +void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit); /* The DMA API isn't _quite_ the whole story, though... */ /* @@ -50,7 +50,7 @@ struct msi_msg; struct device; static inline void iommu_setup_dma_ops(struct device *dev, u64 dma_base, - u64 size) + u64 dma_limit) { } diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 4bf1dd3eb041..7bd1d2199141 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -50,7 +50,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, dev->dma_coherent = coherent; if (iommu) - iommu_setup_dma_ops(dev, dma_base, size); + iommu_setup_dma_ops(dev, dma_base, size - dma_base - 1); #ifdef CONFIG_XEN if (xen_swiotlb_detect()) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 3ac42bbdefc6..94b96d81fcfd 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1713,7 +1713,7 @@ static void amd_iommu_probe_finalize(struct device *dev) /* Domains are initialized for this device - have a look what we ended up with */ domain = iommu_get_domain_for_dev(dev); if (domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, 0); + iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX); else set_dma_ops(dev, NULL); } diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..c62e19bed302 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -319,16 +319,16 @@ static bool dev_is_untrusted(struct device *dev) * iommu_dma_init_domain - Initialise a DMA mapping domain * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie() * @base: IOVA at which the mappable address space starts - * @size: Size of IOVA space + * @limit: Last address of the IOVA space * @dev: Device the domain is being initialised for * - * @base and @size should be exact multiples of IOMMU page granularity to + * @base and @limit + 1 should be exact multiples of IOMMU page granularity to * avoid rounding surprises. If necessary, we reserve the page at address 0 * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but * any change which could make prior IOVAs invalid will fail. */ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, - u64 size, struct device *dev) +dma_addr_t limit, struct device *dev) { struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; @@ -346,7 +346,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, /* Check the domain allows at least some access to the device... */ if (domain->geometry.force_aperture) { if (base > domain->geometry.aperture_end || - base + size <= domain->geometry.aperture_start) { + limit < domain->geometry.aperture_start) { pr_warn("specified DMA range outside IOMMU capability\n"); return -EFAULT; } @@ -1308,7 +1308,7 @@ static const struct dma_map_ops iommu_dma_ops = { * The IOMMU core code allocates the default DMA domain, which the underlying * IOMMU driver needs to support via the dma-iommu layer. */ -void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size) +void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) { struct iommu_domain *domain = iommu_get_domain_for_dev(dev); @@ -1320,7 +1320,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size) * underlying IOMMU driver needs to support via the dma-iommu layer. */ if (domain->type == IOM
[PATCH v4 3/6] ACPI: Add driver for the VIOT table
The ACPI Virtual I/O Translation Table describes topology of para-virtual platforms, similarly to vendor tables DMAR, IVRS and IORT. For now it describes the relation between virtio-iommu and the endpoints it manages. Three steps are needed to configure DMA of endpoints: (1) acpi_viot_init(): parse the VIOT table, find or create the fwnode associated to each vIOMMU device. (2) When probing the vIOMMU device, the driver registers its IOMMU ops within the IOMMU subsystem. This step doesn't require any intervention from the VIOT driver. (3) viot_iommu_configure(): before binding the endpoint to a driver, find the associated IOMMU ops. Register them, along with the endpoint ID, into the device's iommu_fwspec. If step (3) happens before step (2), it is deferred until the IOMMU is initialized, then retried. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/Kconfig | 3 + drivers/iommu/Kconfig | 1 + drivers/acpi/Makefile | 2 + include/linux/acpi_viot.h | 19 ++ drivers/acpi/bus.c| 2 + drivers/acpi/scan.c | 3 + drivers/acpi/viot.c | 364 ++ MAINTAINERS | 8 + 8 files changed, 402 insertions(+) create mode 100644 include/linux/acpi_viot.h create mode 100644 drivers/acpi/viot.c diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index eedec61e3476..3758c6940ed7 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -526,6 +526,9 @@ endif source "drivers/acpi/pmic/Kconfig" +config ACPI_VIOT + bool + endif # ACPI config X86_PM_TIMER diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 1f111b399bca..aff8a4830dd1 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -403,6 +403,7 @@ config VIRTIO_IOMMU depends on ARM64 select IOMMU_API select INTERVAL_TREE + select ACPI_VIOT if ACPI help Para-virtualised IOMMU driver with virtio. diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index 700b41adf2db..a6e644c48987 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -118,3 +118,5 @@ video-objs += acpi_video.o video_detect.o obj-y += dptf/ obj-$(CONFIG_ARM64)+= arm64/ + +obj-$(CONFIG_ACPI_VIOT)+= viot.o diff --git a/include/linux/acpi_viot.h b/include/linux/acpi_viot.h new file mode 100644 index ..1eb8ee5b0e5f --- /dev/null +++ b/include/linux/acpi_viot.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#ifndef __ACPI_VIOT_H__ +#define __ACPI_VIOT_H__ + +#include + +#ifdef CONFIG_ACPI_VIOT +void __init acpi_viot_init(void); +int viot_iommu_configure(struct device *dev); +#else +static inline void acpi_viot_init(void) {} +static inline int viot_iommu_configure(struct device *dev) +{ + return -ENODEV; +} +#endif + +#endif /* __ACPI_VIOT_H__ */ diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index be7da23fad76..b835ca702ff0 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -27,6 +27,7 @@ #include #endif #include +#include #include #include #include @@ -1339,6 +1340,7 @@ static int __init acpi_init(void) pci_mmcfg_late_init(); acpi_iort_init(); acpi_scan_init(); + acpi_viot_init(); acpi_ec_init(); acpi_debugfs_init(); acpi_sleep_proc_init(); diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 0c53c8533300..4fa684fdfda8 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -1556,6 +1557,8 @@ static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev, return ops; err = iort_iommu_configure_id(dev, id_in); + if (err && err != -EPROBE_DEFER) + err = viot_iommu_configure(dev); /* * If we have reason to believe the IOMMU driver missed the initial diff --git a/drivers/acpi/viot.c b/drivers/acpi/viot.c new file mode 100644 index ..892cd9fa7b6d --- /dev/null +++ b/drivers/acpi/viot.c @@ -0,0 +1,364 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Virtual I/O topology + * + * The Virtual I/O Translation Table (VIOT) describes the topology of + * para-virtual IOMMUs and the endpoints they manage. The OS uses it to + * initialize devices in the right order, preventing endpoints from issuing DMA + * before their IOMMU is ready. + * + * When binding a driver to a device, before calling the device driver's probe() + * method, the driver infrastructure calls dma_configure(). At that point the + * VIOT driver looks for an IOMMU associated to the device in the VIOT table. + * If an IOMMU exists and has been initialized, the VIOT driver initializes the + * device's IOMMU fwspec, allowing the DMA infrastructure to invoke the IOMMU + * ops when the device driver configures DMA mappings. If an IOMMU exists a
[PATCH v4 5/6] iommu/dma: Simplify calls to iommu_setup_dma_ops()
dma-iommu uses the address bounds described in domain->geometry during IOVA allocation. The address size parameters of iommu_setup_dma_ops() are useful for describing additional limits set by the platform firmware, but aren't needed for drivers that call this function from probe_finalize(). The base parameter can be zero because dma-iommu already removes the first IOVA page, and the limit parameter can be U64_MAX because it's only checked against the domain geometry. Simplify calls to iommu_setup_dma_ops(). Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/amd/iommu.c | 9 + drivers/iommu/dma-iommu.c | 4 +++- drivers/iommu/intel/iommu.c | 10 +- 3 files changed, 5 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 94b96d81fcfd..d3123bc05c08 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1708,14 +1708,7 @@ static struct iommu_device *amd_iommu_probe_device(struct device *dev) static void amd_iommu_probe_finalize(struct device *dev) { - struct iommu_domain *domain; - - /* Domains are initialized for this device - have a look what we ended up with */ - domain = iommu_get_domain_for_dev(dev); - if (domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX); - else - set_dma_ops(dev, NULL); + iommu_setup_dma_ops(dev, 0, U64_MAX); } static void amd_iommu_release_device(struct device *dev) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index c62e19bed302..175f8eaeb5b3 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1322,7 +1322,9 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) if (domain->type == IOMMU_DOMAIN_DMA) { if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev)) goto out_err; - dev->dma_ops = &iommu_dma_ops; + set_dma_ops(dev, &iommu_dma_ops); + } else { + set_dma_ops(dev, NULL); } return; diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 85f18342603c..8d866940692a 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5165,15 +5165,7 @@ static void intel_iommu_release_device(struct device *dev) static void intel_iommu_probe_finalize(struct device *dev) { - dma_addr_t base = IOVA_START_PFN << VTD_PAGE_SHIFT; - struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct dmar_domain *dmar_domain = to_dmar_domain(domain); - - if (domain && domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, base, - __DOMAIN_MAX_ADDR(dmar_domain->gaw)); - else - set_dma_ops(dev, NULL); + iommu_setup_dma_ops(dev, 0, U64_MAX); } static void intel_iommu_get_resv_regions(struct device *device, -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 2/6] ACPI: Move IOMMU setup code out of IORT
Extract the code that sets up the IOMMU infrastructure from IORT, since it can be reused by VIOT. Move it one level up into a new acpi_iommu_configure_id() function, which calls the IORT parsing function which in turn calls the acpi_iommu_fwspec_init() helper. Signed-off-by: Jean-Philippe Brucker --- include/acpi/acpi_bus.h | 3 ++ include/linux/acpi_iort.h | 8 ++--- drivers/acpi/arm64/iort.c | 75 +-- drivers/acpi/scan.c | 73 - 4 files changed, 87 insertions(+), 72 deletions(-) diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h index 3a82faac5767..41f092a269f6 100644 --- a/include/acpi/acpi_bus.h +++ b/include/acpi/acpi_bus.h @@ -588,6 +588,9 @@ struct acpi_pci_root { bool acpi_dma_supported(struct acpi_device *adev); enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev); +int acpi_iommu_fwspec_init(struct device *dev, u32 id, + struct fwnode_handle *fwnode, + const struct iommu_ops *ops); int acpi_dma_get_range(struct device *dev, u64 *dma_addr, u64 *offset, u64 *size); int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr, diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index f7f054833afd..f1f0842a2cb2 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -35,8 +35,7 @@ void acpi_configure_pmsi_domain(struct device *dev); int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id); /* IOMMU interface */ int iort_dma_get_ranges(struct device *dev, u64 *size); -const struct iommu_ops *iort_iommu_configure_id(struct device *dev, - const u32 *id_in); +int iort_iommu_configure_id(struct device *dev, const u32 *id_in); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); phys_addr_t acpi_iort_dma_get_max_cpu_address(void); #else @@ -50,9 +49,8 @@ static inline void acpi_configure_pmsi_domain(struct device *dev) { } /* IOMMU interface */ static inline int iort_dma_get_ranges(struct device *dev, u64 *size) { return -ENODEV; } -static inline const struct iommu_ops *iort_iommu_configure_id( - struct device *dev, const u32 *id_in) -{ return NULL; } +static inline int iort_iommu_configure_id(struct device *dev, const u32 *id_in) +{ return -ENODEV; } static inline int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) { return 0; } diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index a940be1cf2af..b5b021e064b6 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -806,23 +806,6 @@ static struct acpi_iort_node *iort_get_msi_resv_iommu(struct device *dev) return NULL; } -static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev) -{ - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); - - return (fwspec && fwspec->ops) ? fwspec->ops : NULL; -} - -static inline int iort_add_device_replay(struct device *dev) -{ - int err = 0; - - if (dev->bus && !device_iommu_mapped(dev)) - err = iommu_probe_device(dev); - - return err; -} - /** * iort_iommu_msi_get_resv_regions - Reserved region driver helper * @dev: Device from iommu_get_resv_regions() @@ -900,18 +883,6 @@ static inline bool iort_iommu_driver_enabled(u8 type) } } -static int arm_smmu_iort_xlate(struct device *dev, u32 streamid, - struct fwnode_handle *fwnode, - const struct iommu_ops *ops) -{ - int ret = iommu_fwspec_init(dev, fwnode, ops); - - if (!ret) - ret = iommu_fwspec_add_ids(dev, &streamid, 1); - - return ret; -} - static bool iort_pci_rc_supports_ats(struct acpi_iort_node *node) { struct acpi_iort_root_complex *pci_rc; @@ -946,7 +917,7 @@ static int iort_iommu_xlate(struct device *dev, struct acpi_iort_node *node, return iort_iommu_driver_enabled(node->type) ? -EPROBE_DEFER : -ENODEV; - return arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops); + return acpi_iommu_fwspec_init(dev, streamid, iort_fwnode, ops); } struct iort_pci_alias_info { @@ -1020,24 +991,14 @@ static int iort_nc_iommu_map_id(struct device *dev, * @dev: device to configure * @id_in: optional input id const value pointer * - * Returns: iommu_ops pointer on configuration success - * NULL on configuration failure + * Returns: 0 on success, <0 on failure */ -const struct iommu_ops *iort_iommu_configure_id(struct device *dev, - const u32 *id_in) +int iort_iommu_configure_id(struct device *dev, const u32 *id_in) { struct acpi_iort_node *node; - const struct iommu_ops *ops; + const struct iommu_ops *ops = NULL; int err = -ENODEV; -
[PATCH v4 6/6] iommu/virtio: Enable x86 support
With the VIOT support in place, x86 platforms can now use the virtio-iommu. Because the other x86 IOMMU drivers aren't yet ready to use the acpi_dma_setup() path, x86 doesn't implement arch_setup_dma_ops() at the moment. Similarly to Vt-d and AMD IOMMU, call iommu_setup_dma_ops() from probe_finalize(). Acked-by: Joerg Roedel Acked-by: Michael S. Tsirkin Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig| 3 ++- drivers/iommu/dma-iommu.c| 1 + drivers/iommu/virtio-iommu.c | 8 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index aff8a4830dd1..07b7c25cbed8 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -400,8 +400,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU tristate "Virtio IOMMU driver" depends on VIRTIO - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA select INTERVAL_TREE select ACPI_VIOT if ACPI help diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 175f8eaeb5b3..46ed43c400cf 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1332,6 +1332,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n", dev_name(dev)); } +EXPORT_SYMBOL_GPL(iommu_setup_dma_ops); static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev, phys_addr_t msi_addr, struct iommu_domain *domain) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 218fe8560e8d..77aee1207ced 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -1026,6 +1026,13 @@ static struct iommu_device *viommu_probe_device(struct device *dev) return ERR_PTR(ret); } +static void viommu_probe_finalize(struct device *dev) +{ +#ifndef CONFIG_ARCH_HAS_SETUP_DMA_OPS + iommu_setup_dma_ops(dev, 0, U64_MAX); +#endif +} + static void viommu_release_device(struct device *dev) { struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); @@ -1062,6 +1069,7 @@ static struct iommu_ops viommu_ops = { .iova_to_phys = viommu_iova_to_phys, .iotlb_sync = viommu_iotlb_sync, .probe_device = viommu_probe_device, + .probe_finalize = viommu_probe_finalize, .release_device = viommu_release_device, .device_group = viommu_device_group, .get_resv_regions = viommu_get_resv_regions, -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 1/6] ACPI: arm64: Move DMA setup operations out of IORT
Extract generic DMA setup code out of IORT, so it can be reused by VIOT. Keep it in drivers/acpi/arm64 for now, since it could break x86 platforms that haven't run this code so far, if they have invalid tables. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/arm64/Makefile | 1 + include/linux/acpi.h| 3 +++ include/linux/acpi_iort.h | 6 ++--- drivers/acpi/arm64/dma.c| 50 ++ drivers/acpi/arm64/iort.c | 54 ++--- drivers/acpi/scan.c | 2 +- 6 files changed, 66 insertions(+), 50 deletions(-) create mode 100644 drivers/acpi/arm64/dma.c diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile index 6ff50f4ed947..66acbe77f46e 100644 --- a/drivers/acpi/arm64/Makefile +++ b/drivers/acpi/arm64/Makefile @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only obj-$(CONFIG_ACPI_IORT)+= iort.o obj-$(CONFIG_ACPI_GTDT)+= gtdt.o +obj-y += dma.o diff --git a/include/linux/acpi.h b/include/linux/acpi.h index c60745f657e9..7aaa9559cc19 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -259,9 +259,12 @@ void acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa); #ifdef CONFIG_ARM64 void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa); +void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size); #else static inline void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) { } +static inline void +acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) { } #endif int acpi_numa_memory_affinity_init (struct acpi_srat_mem_affinity *ma); diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index 1a12baa58e40..f7f054833afd 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -34,7 +34,7 @@ struct irq_domain *iort_get_device_domain(struct device *dev, u32 id, void acpi_configure_pmsi_domain(struct device *dev); int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id); /* IOMMU interface */ -void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size); +int iort_dma_get_ranges(struct device *dev, u64 *size); const struct iommu_ops *iort_iommu_configure_id(struct device *dev, const u32 *id_in); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); @@ -48,8 +48,8 @@ static inline struct irq_domain *iort_get_device_domain( { return NULL; } static inline void acpi_configure_pmsi_domain(struct device *dev) { } /* IOMMU interface */ -static inline void iort_dma_setup(struct device *dev, u64 *dma_addr, - u64 *size) { } +static inline int iort_dma_get_ranges(struct device *dev, u64 *size) +{ return -ENODEV; } static inline const struct iommu_ops *iort_iommu_configure_id( struct device *dev, const u32 *id_in) { return NULL; } diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c new file mode 100644 index ..f16739ad3cc0 --- /dev/null +++ b/drivers/acpi/arm64/dma.c @@ -0,0 +1,50 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include + +void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) +{ + int ret; + u64 end, mask; + u64 dmaaddr = 0, size = 0, offset = 0; + + /* +* If @dev is expected to be DMA-capable then the bus code that created +* it should have initialised its dma_mask pointer by this point. For +* now, we'll continue the legacy behaviour of coercing it to the +* coherent mask if not, but we'll no longer do so quietly. +*/ + if (!dev->dma_mask) { + dev_warn(dev, "DMA mask not set\n"); + dev->dma_mask = &dev->coherent_dma_mask; + } + + if (dev->coherent_dma_mask) + size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1); + else + size = 1ULL << 32; + + ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size); + if (ret == -ENODEV) + ret = iort_dma_get_ranges(dev, &size); + if (!ret) { + /* +* Limit coherent and dma mask based on size retrieved from +* firmware. +*/ + end = dmaaddr + size - 1; + mask = DMA_BIT_MASK(ilog2(end) + 1); + dev->bus_dma_limit = end; + dev->coherent_dma_mask = min(dev->coherent_dma_mask, mask); + *dev->dma_mask = min(*dev->dma_mask, mask); + } + + *dma_addr = dmaaddr; + *dma_size = size; + + ret = dma_direct_set_offset(dev, dmaaddr + offset, dmaaddr, size); + + dev_dbg(dev, "dma_offset(%#08llx)%s\n", offset, ret ? " failed!" : ""); +} diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 3912a1f6058e..a940be1cf2af 100644
[PATCH v4 0/6] Add support for ACPI VIOT
Add a driver for the ACPI VIOT table, which provides topology information for para-virtual IOMMUs. Enable virtio-iommu on non-devicetree platforms, including x86. Since v3 [1] I fixed a build bug for !CONFIG_IOMMU_API. Joerg offered to take this series through the IOMMU tree, which requires Acks for patches 1-3. You can find a QEMU implementation at [2], with extra support for testing all VIOT nodes including MMIO-based endpoints and IOMMU. This series is at [3]. [1] https://lore.kernel.org/linux-iommu/2021060215.1077006-1-jean-phili...@linaro.org/ [2] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/acpi [3] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/acpi Jean-Philippe Brucker (6): ACPI: arm64: Move DMA setup operations out of IORT ACPI: Move IOMMU setup code out of IORT ACPI: Add driver for the VIOT table iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops() iommu/dma: Simplify calls to iommu_setup_dma_ops() iommu/virtio: Enable x86 support drivers/acpi/Kconfig | 3 + drivers/iommu/Kconfig| 4 +- drivers/acpi/Makefile| 2 + drivers/acpi/arm64/Makefile | 1 + include/acpi/acpi_bus.h | 3 + include/linux/acpi.h | 3 + include/linux/acpi_iort.h| 14 +- include/linux/acpi_viot.h| 19 ++ include/linux/dma-iommu.h| 4 +- arch/arm64/mm/dma-mapping.c | 2 +- drivers/acpi/arm64/dma.c | 50 + drivers/acpi/arm64/iort.c| 129 ++--- drivers/acpi/bus.c | 2 + drivers/acpi/scan.c | 78 +++- drivers/acpi/viot.c | 364 +++ drivers/iommu/amd/iommu.c| 9 +- drivers/iommu/dma-iommu.c| 17 +- drivers/iommu/intel/iommu.c | 10 +- drivers/iommu/virtio-iommu.c | 8 + MAINTAINERS | 8 + 20 files changed, 580 insertions(+), 150 deletions(-) create mode 100644 include/linux/acpi_viot.h create mode 100644 drivers/acpi/arm64/dma.c create mode 100644 drivers/acpi/viot.c -- 2.31.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support
在 2021/6/10 下午3:23, Stefano Garzarella 写道: On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote: 在 2021/6/10 上午11:43, Jiang Wang . 写道: On Wed, Jun 9, 2021 at 6:51 PM Jason Wang wrote: 在 2021/6/10 上午7:24, Jiang Wang 写道: This patchset implements support of SOCK_DGRAM for virtio transport. Datagram sockets are connectionless and unreliable. To avoid unfair contention with stream and other sockets, add two more virtqueues and a new feature bit to indicate if those two new queues exist or not. Dgram does not use the existing credit update mechanism for stream sockets. When sending from the guest/driver, sending packets synchronously, so the sender will get an error when the virtqueue is full. When sending from the host/device, send packets asynchronously because the descriptor memory belongs to the corresponding QEMU process. What's the use case for the datagram vsock? One use case is for non critical info logging from the guest to the host, such as the performance data of some applications. Anything that prevents you from using the stream socket? It can also be used to replace UDP communications between the guest and the host. Any advantage for VSOCK in this case? Is it for performance (I guess not since I don't exepct vsock will be faster). I think the general advantage to using vsock are for the guest agents that potentially don't need any configuration. Right, I wonder if we really need datagram consider the host to guest communication is reliable. (Note that I don't object it since vsock has already supported that, just wonder its use cases) An obvious drawback is that it breaks the migration. Using UDP you can have a very rich features support from the kernel where vsock can't. Thanks for bringing this up! What features does UDP support and datagram on vsock could not support? E.g the sendpage() and busy polling. And using UDP means qdiscs and eBPF can work. The virtio spec patch is here: https://www.spinics.net/lists/linux-virtualization/msg50027.html Have a quick glance, I suggest to split mergeable rx buffer into an separate patch. Sure. But I think it's time to revisit the idea of unifying the virtio-net and virtio-vsock. Otherwise we're duplicating features and bugs. For mergeable rxbuf related code, I think a set of common helper functions can be used by both virtio-net and virtio-vsock. For other parts, that may not be very beneficial. I will think about more. If there is a previous email discussion about this topic, could you send me some links? I did a quick web search but did not find any related info. Thanks. We had a lot: [1] https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/ [2] https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html [3] https://www.lkml.org/lkml/2020/1/16/2043 When I tried it, the biggest problem that blocked me were all the features strictly related to TCP/IP stack and ethernet devices that vsock device doesn't know how to handle: TSO, GSO, checksums, MAC, napi, xdp, min ethernet frame size, MTU, etc. It depends on which level we want to share: 1) sharing codes 2) sharing devices 3) make vsock a protocol that is understood by the network core We can start from 1), the low level tx/rx logic can be shared at both virtio-net and vhost-net. For 2) we probably need some work on the spec, probably with a new feature bit to demonstrate that it's a vsock device not a ethernet device. Then if it is probed as a vsock device we won't let packet to be delivered in the TCP/IP stack. For 3), it would be even harder and I'm not sure it's worth to do that. So in my opinion to unify them is not so simple, because vsock is not really an ethernet device, but simply a socket. We can start from sharing codes. But I fully agree that we shouldn't duplicate functionality and code, so maybe we could find those common parts and create helpers to be used by both. Yes. Thanks Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support
On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote: 在 2021/6/10 上午11:43, Jiang Wang . 写道: On Wed, Jun 9, 2021 at 6:51 PM Jason Wang wrote: 在 2021/6/10 上午7:24, Jiang Wang 写道: This patchset implements support of SOCK_DGRAM for virtio transport. Datagram sockets are connectionless and unreliable. To avoid unfair contention with stream and other sockets, add two more virtqueues and a new feature bit to indicate if those two new queues exist or not. Dgram does not use the existing credit update mechanism for stream sockets. When sending from the guest/driver, sending packets synchronously, so the sender will get an error when the virtqueue is full. When sending from the host/device, send packets asynchronously because the descriptor memory belongs to the corresponding QEMU process. What's the use case for the datagram vsock? One use case is for non critical info logging from the guest to the host, such as the performance data of some applications. Anything that prevents you from using the stream socket? It can also be used to replace UDP communications between the guest and the host. Any advantage for VSOCK in this case? Is it for performance (I guess not since I don't exepct vsock will be faster). I think the general advantage to using vsock are for the guest agents that potentially don't need any configuration. An obvious drawback is that it breaks the migration. Using UDP you can have a very rich features support from the kernel where vsock can't. Thanks for bringing this up! What features does UDP support and datagram on vsock could not support? The virtio spec patch is here: https://www.spinics.net/lists/linux-virtualization/msg50027.html Have a quick glance, I suggest to split mergeable rx buffer into an separate patch. Sure. But I think it's time to revisit the idea of unifying the virtio-net and virtio-vsock. Otherwise we're duplicating features and bugs. For mergeable rxbuf related code, I think a set of common helper functions can be used by both virtio-net and virtio-vsock. For other parts, that may not be very beneficial. I will think about more. If there is a previous email discussion about this topic, could you send me some links? I did a quick web search but did not find any related info. Thanks. We had a lot: [1] https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/ [2] https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html [3] https://www.lkml.org/lkml/2020/1/16/2043 When I tried it, the biggest problem that blocked me were all the features strictly related to TCP/IP stack and ethernet devices that vsock device doesn't know how to handle: TSO, GSO, checksums, MAC, napi, xdp, min ethernet frame size, MTU, etc. So in my opinion to unify them is not so simple, because vsock is not really an ethernet device, but simply a socket. But I fully agree that we shouldn't duplicate functionality and code, so maybe we could find those common parts and create helpers to be used by both. Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization